It’s not news that high-quality data is an asset to any organization. Yet, companies still lose 15 to 25% of their revenue due to poor data quality. In the realm of AI, data and code (the model) are the most critical parts of any AI system. However, in AI projects, data is complicated to handle and curate, and as a result, ML teams spend a lot of time trying to optimize a model built on bad data.
In a recent Google research paper, the researchers reported that "data is the most undervalued and de-glamorized aspect of AI." The rise of data-centric AI is changing that perspective as data science teams begin to appreciate the impact high-quality data has on the outcome of a particular model. More likely, data science teams are coming to the realization that poor-quality data has detrimental effects on the outcome of their model.
Data-centric AI is a rapidly growing, data-first approach to building AI systems using high-quality data from the start and continually enhancing the dataset to improve the model's performance. Data-centric AI presents an approach to building AI where the model's accuracy is primarily dependent on the quality of data.
Contrary to the data-centric approach, the model-centric AI approach involves conducting experimental research to increase the performance of the ML model by improving the code, features, or configuration while maintaining the data. With the data-centric AI approach, you focus on improving the data as opposed to the code.
Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.
CEO & Founder
Experts like Andrew Ng are proposing a new way to approach AI by focusing on the data because constantly fine-tuning the model yields minimal improvements, especially if the model is running on inaccurate data.
An AI project relies on the training data, model, and configuration/code used to control the model. For a while now, the accuracy of AI systems has been dependent on the model and its configurations. The models are built based on the problem, and then the data is engineered to fit them. In the event that a current model does not solve the problem adequately, new models are developed to tackle the issue. This approach to building an AI system is "model-centric." If there were any focus on the data, it would be to increase the data size. While model optimization is essential, refining the data is likely even more critical.
To appropriately frame the problem and map an easier path to the data gathering, cleaning, and analysis process, data-centric AI demands close coordination with a complete data science team. The data-centric approach to artificial intelligence is to improve the consistency of labeled data and remove biases.
It is easier to see the effects of Artificial Intelligence in consumer-facing businesses like Netflix, Spotify, Amazon, etc. These companies have massive datasets at their disposal and consumers willing to provide more. They also have built-in feedback mechanisms where their users can easily hit like or dislike to help train and fine-tune their models.
B2B sectors like manufacturing, medical device production, agriculture, etc., have not fully experienced the power of AI despite exhaustive model-centric efforts. This is because most organizations in these sectors don't have a code or model problem. They have a data problem that stems from a lack of sophisticated systems and processes required for subject matter experts, data scientists, and annotation teams to collect and curate high-quality data needed to accurately train their models. Simply focusing on a model-centric approach for more complex or company-specific use cases will never yield results accurate enough for production.
With over 74,000 AI-centric job openings on LinkedIn, it’s clear that many companies are seeking candidates with general AI expertise. Rather than depend on AI architects who are already hard to come by, businesses can instead leverage the subject matter expertise and in-house experience of their existing staff to gather the right data and develop processes to turn it into high-quality training data.
The rise of data-centric AI will stimulate colossal growth in the machine learning operations field (MLOps). A primary goal of MLOps is to ensure that high-quality data will be available at every stage of the project lifecycle. They do this by identifying suitable datasets, providing standards for data labeling, quality assurance, and gathering extra data where necessary. With data being the new focus in AI development workflows, there will be a strong need for more people in MLOps to make sure that the best data is available at every stage of the project.
Adopting a data-centric approach to AI involves many specific processes, from data collection and augmentation to data cleaning and labeling. All of these processes depend on data scientists being able to create a dataset. Involving domain experts when creating the dataset helps to ensure quality across the ML project lifecycle and reduce errors. Data-centric AI development workflows revolve around the data, so by default, it requires people who know about the data, can sample it, and then curate relevant subsets. Most importantly, these domain experts can identify when the data has errors.
“Consistent, quality data labeling is the bedrock of data-centric AI”
Data labeling has often been viewed as grunt work and historically, was largely outsourced to third party services. Labeling training datasets is a resource-intensive process, and while crowdsourcing and web scraping may be helpful, they add 'noise' to training sets. Outsourcing data labeling also leads to inconsistency because it's harder to monitor the quality of work and whether consistent processes are being adhered to.
The training data used in machine learning includes labeled text, images, video, audio, etc. If the data quality is poor, the model is bound to perform poorly when sent into production. This poor performance can have drastic negative effects depending on the application.
A study in 2017 examined how mislabeled training data affected the performance of deep learning algorithms that were used to classify rubble from the 2011 New Zealand earthquake. Most of the labeling mistakes they observed were due to “inaccurate geospatial delineation, caused by lack of training (e.g., misunderstanding what to include as rubble), or insufficient tools (e.g., irregularly shaped polygons labeling undamaged sidewalks as rubble).” If the labeling task had been given to domain experts, the inaccurate geospatial delineation error could have been avoided.
The best way to ensure consistency is to adopt a tool that supports internal data labeling and allows experts with more domain knowledge to be a part of the process while automating as much of the “grunt work” as possible.
Develop an effective data labeling strategy to ensure consistency. Manual data labeling can be a slow and error-prone process. Successful labeling often depends on the number of humans you can marshal, how well the annotators have been trained by domain experts, and the length of time they work on the process. Consistent labeling practices guarantee that everyone on the team adheres to the same labeling requirements. Inconsistent labels in the dataset will definitely confuse the model. Use agreement matrices to identify and resolve labeling issues when multiple annotators label the same samples. One of the ways to ensure consistency is to have data labeled internally so that domain experts can have more control of the labeling process.
Machine Learning has relied heavily on the traditional model-centric approach and has enabled the field to reach a stage where models are more accessible. Because the model-centric approach is focused on the code, machine learning engineers can easily go on GitHub to find code that matches the model they want to build.
However, successful machine learning requires accurately curated data and well-built models. The most significant returns moving forward will come from approaches that prioritize the data due to the sophistication of today's models. As data increasingly determines success and failure, iterative development must increasingly be driven by data.