Throughout the last years the development of machine learning applications started to shift from being model-centric to being data-centric. Data-centric AI is the new buzzword, but unlike most other buzzwords in AI, it is a very straightforward concept that delivers more business impact. Data-centric AI says that if you work on all the different aspects of improving your dataset, like labeling, augmentation, and curation you'll achieve a more accurate and powerful model. While it sounds simple enough, with production applications and without the proper tools and processes it can introduce a certain degree of complexity, and given its novelty, there is a limited amount of tools and best practices.
Nikolai, Max, and I started Heartex in 2019 while thinking about how we can help more organizations transition to being what is today called data-centric. That conversation originated while we were in the Himalayan mountains, wandering in the wilderness, at times going up to 20,000 ft altitudes, sharing experiences and challenges we faced while working on our previous projects. Those experiences taught us that data labeling (aka data annotation) would be one of the most critical aspects of data-centric AI since it has the most significant impact on model accuracy, and it made us work long nights to create a piece of software that we named Label Studio.
Today, three years later, we’ve had more than 100,000 people use Label Studio, making it the most popular open source labeling platform. We’ve also hit a big milestone for the company—securing our next funding round of $25 million in funding led by Redpoint, with participation from all our existing investors, Unusual Ventures, Bow Capital, and Swift Ventures.
The labeling process, in its essence, is a process of capturing decision making, creating a layer of knowledge on top of the raw dataset. While there are different tools to help your team with data labeling, we believe there are three critical components to achieving the best outcomes for data-centric ML/AI initiatives: 1) enabling internal teams & domain experts to add meaning to data, 2) the extensibility & accessibility of open source software, and 3) building a real community of data scientists and engineers, like us, who can share knowledge.
In the case of data labeling, there are different types of ‘knowledge’ and the skills and experience required to label a dataset depends on the model being built and the underlying data being annotated. Historically, organizations were using AI to solve what is known as "common knowledge" problems. As the name implies, common knowledge problems do not require expertise or specialized training to annotate (think of drawing a bounding box around a vehicle and labeling it "car"), allowing organizations to easily outsource the task to third-party service providers. While there is still a need for such annotation, organizations are starting to build models to solve more complex, unique, and integrated problems that require internal subject matter expertise. When organizations leverage internal expertise to annotate data, not only do they build more accurate models but they also develop processes that can be scaled with the help of algorithms and infrastructure.
Today, with the powerful combination of internal labeling operations and Label Studio Enterprise organizations from a wide range of industries are experiencing the benefits of data centric AI. Some examples are:
After working on these concepts over the last several years and through the many enlightening conversations we've had with practitioners in the data science community, we've realized that to accelerate the adoption of data-centric principles, we need to get Label Studio into as many hands as possible. That led to publishing an open-source package – Label Studio Community, giving everyone the ability to easily label data – contributing their knowledge and expertise.
We are well on our way to getting everyone to label data. Adoption of Label Studio Community has been phenomenal, with customers ranging from Fortune 10 companies to small startups and everything in between. The applicability and value generated from data labeling apply to every market segment where ML has a footprint. A huge thank you to our vibrant and rapidly growing community of thousands of AI professionals for helping push data-centric AI forward.
With the new financing, we will keep investing in the Label Studio community and bring the power of data-centric AI to every organization. We have exciting plans to bring new functionality that will enable organizations to truly scale labeling operations while maintaining high-quality results through labeling automation trained by your subject matter experts.
If Label Studio sounds like something you're interested in working on, please see our list of job openings.
Finally, we want to thank every member of our team for their dedication and hard work in building Label Studio, our amazing community, and all of our customers who believe in our approach.
Michael, Nikolai, Max