In 1960, scientists at the Air Force Cambridge Research Laboratories successfully created a speech-to-text machine learning algorithm that could recognize 83 words. To create it, they used training data stored on 1,600 magnetic drum storage registers. Given that one magnetic drum stored 10kBs at that time, that’s about 16Mbs worth of training data—a small video you record on your phone takes up more space than that.
Fast forward to today. The speech-to-text machine learning algorithms have arguably been perfected, thanks to training datasets that are gigabytes in size. While storage devices have gotten significantly smaller, and computers now have incredible processing power, there’s a greater need for properly labeled training data to train AI to perform complex actions. That’s because machine learning’s applications, from voice recognition to self-driving cars, are becoming more advanced, and advanced technologies need bigger, better datasets to learn to perform their tasks accurately.
Manually creating training data by labeling unlabeled data is costly in terms of both time and money. Crowdsourcing the data labeling process compromises accuracy. Outsourcing the process compromises both accuracy and security. Giving it to an in-house team means huge overhead costs. To label large quantities of data quickly, accurately, and with minimal costs, organizations should look to intelligent data labeling, i.e., AI labeling unlabeled data with little-to-no human involvement.
The datasets needed to create advanced AI solutions can go from GBs to TBs and beyond in size. All of this data has to be labeled to train the AI model. The time and money needed to label large quantities of data call for intelligent data labeling.
Usually, video files have 24 frames per second. That means a one-minute video has 1,440 frames. When labeling video files, the image and audio components of each frame are labeled. The time it takes to label one frame can vary based on the data points being labeled.
Let’s say it takes one human one minute to label one frame. Since you can’t have your data annotators labeling every frame, you can create keyframes, i.e., pick one or two frames from every second and label them. That means it will take one human data annotator one hour to label a one-minute video manually, and that’s if you are going for one keyframe per second. The time needed to label video data adds to the total cost. If you need higher volumes of data, you’ll need to hire many annotators to get it done.
Different vendors have different pricing for labeling unlabeled data. If you go with Google, it can cost you anywhere from $25 to $870 per 1,000 units per human labeler. The unit can be an image, a five-second video, an event in a 30-second video, and so on. The cost and turnaround time are high, making a case for intelligent data labeling. If you go for an online platform that supports intelligent data labeling, it can cost you up to $300 per user per month. The users, in this case, are for quality assurance.
As datasets become larger and larger and machine learning algorithms evolve to perform highly complicated functions, such as gait detection and space exploration, there’s a need for data labeling to move from a purely manual activity to an intelligent one.
Intelligent data labeling uses a lot of assistance from the machine, where the machine labels a part or all of the unlabeled training data. There are two ways to do this.
In active learning, humans label a subset of the total data, and it’s used to train the machine to label another subset. If the machine achieves an acceptable level of accuracy, it’s allowed to label all of the unlabeled data. If the machine does not achieve an acceptable level of accuracy, it’s fed another subset of human-labeled data to further refine its algorithm.
There are usually many iterations of the process before an algorithm achieves higher accuracy. In every iteration, humans have to manually label smaller sets of data. While this does mean human involvement almost every step of the way, you can get a small team of subject matter experts to accurately label smaller sets of training data. This will help you create a smarter AI model and save time and money (as opposed to the conventional method of humans annotating all the data).
Active learning as a data labeling method gained popularity in the early 1990s due to the limitations of the passive algorithms used for training AI models. The main limitation was the issue of generalization, i.e., how well a learned model can account for previously unencountered information. Active learning aims to solve that problem by using targeted sets of data to train a machine learning model and giving the model more control over training data.
In a 2019 study, Song, Berthelot, and Rostamizadeh applied the active learning method of data labeling to the CIFAR-10, CIFAR-100, and SVHN datasets and achieved an absolute accuracy of 1.5% (i.e., the results were 1.5% more accurate than the baseline results). The study showed that active learning can be used to label large quantities of unlabeled data with great accuracy.
In programmatic labeling, the machine creates its own training data by taking unlabeled data and labeling it using labeling functions. Labeling functions are a specific set of rules for the machine to label unlabeled data. These rules can range from “if job title contains CFO” to Boolean searches and beyond.
Using labeling functions to label data puts a lot of power in the developer’s hands, but these functions must be thoroughly tested on smaller sets of data and checked for accuracy before being deployed on larger sets. The type of data will also affect the complexity of the labeling functions needed to label it. For example, labeling video data with labeling functions will require a higher complexity of rules as opposed to working with a text document. The good news is that a lot of work is being done to improve and perfect labeling functions.
In 2021, Pierre Lison, Jeremy Barnes, and Aliaksandr Hubin used labeling functions for named entity recognition to identify seven entities: location, organization, person, money, date, time, and percent. They used the MUC6 corpus — a database of 318 annotated Wall Street Journal articles. The results of the experiment showed that labeling functions achieved an F1 score of 0.83 and 0.75 for the token-level evaluation method and entity-level evaluation method, respectively. F1 scores are on a scale of 0 to 1, and highly accurate models are closer to 1. The experiment showcased the effectiveness of labeling functions when working with text-based data.
When it comes to video data, a study conducted in 2020 divided the data into audio and video streams and achieved accuracies of 55.9% and 57.9% using labeling functions. The accuracy may not be comparable to human-assisted labeling at the time, but a lot of work is being done in this area, and researchers are trying to achieve higher accuracy by experimenting with different types of labeling functions.
While programmatic labeling is a completely automated way to label unlabeled data, it still requires human involvement for quality assurance. If you are planning to hire an in-house team of subject matter experts for data annotation, programmatic labeling will help you tackle larger sets of data and reduce overhead costs by dedicating your data annotators to quality assurance. But you will first have to create labeling functions, test them for accuracy, and keep improving them to a point where they deliver acceptable results. In some cases, the time and resources needed to do that may not be worth it, making active learning a better alternative for now.
While programmatic labeling is still being perfected, the active learning method of data labeling has achieved desirable accuracy. Label Studio supports active learning. You can create a customized active learning loop and scale your data labeling projects to handle larger datasets with increased accuracy.
Find out more about Label Studio’s active learning capabilities, or give it a try with our free trial, complete with sample templates to help you get started.