At HumanSignal, we're on a mission to help data science and machine learning teams do more with their data. In this blog, we'll walk you through the latest addition to Label Studio Enterprise: external taxonomies. This allows teams to load, manage, and maintain well-defined taxonomies of hundreds of thousands of items in less than a second.
Read on to learn more about this exciting new feature!
Taxonomies play a pivotal role in data labeling, acting as a comprehensive framework for organizing vast amounts of data. By providing a structured hierarchy of categories and subcategories, taxonomies enable more nuanced and accurate data classification.
Taxonomies are especially important when data diversity and volume of information are overwhelming. They provide a structured guide for labelers to follow, ensuring more consistency. This is essential for developing high-quality datasets and training sophisticated machine learning models.
This release provides teams the ability to load large taxonomies from external sources. Before this release, teams had to add each line item in their taxonomy while configuring their labeling interface. While this may only take a second or two per item, creating large taxonomies with thousands of items could quickly snowball. Now, teams can load large taxonomies of hundreds of thousands of items in seconds. Furthermore, the data itself only has to be loaded once. Once loaded, it is available locally with virtually no lag.
This release provides a few advantages over the old method of creating your taxonomy. To start, it eliminates the time-consuming task of manually adding taxonomy choices, streamlining the labeling configuration process. This update also improves standardization and consistency across projects that require the same taxonomy. Now, you can update your taxonomy file once and reuse it across projects instead of updating each.
Additionally, it enables external storage of taxonomy data. This is particularly beneficial for industries with strict regulations. Performance is also significantly improved. Asynchronous loading replaces the previous method of embedding large internal taxonomies directly in the labeling config, enhancing loading speed and the overall annotator experience.
As a technical lead or project manager
With this release, owners, administrators, and manager roles can now load taxonomies into Label Studio Enterprise using formatted JSON files. This allows teams to perform classification tasks within a defined taxonomy or hierarchy of choices using both parent and nested child nodes. The release includes support for both internal proprietary taxonomies and domain-specific ones, such as SNOMED CT or FoodEx2.
These JSON files can also be hosted in your external storage and protected using two methods. For less critical or sensitive taxonomy use cases, you can use a username and password to secure access. Or, you can associate a storage connection from Amazon S3, Google Computer Platform, or Azure to your project in your labeling config by specifying the URL that points to the storage URL. When the taxonomy component is rendered, we pre-sign that URL using the credentials from the storage back end and convert it to a secure HTTPS link. Only users that have access to that pre-signed link can access your taxonomy, as opposed to entering standard log-in information.
Once loaded, you can preview and test your new taxonomy within the settings tab of the labeling config dashboard. This release also provides mechanisms for controlling how your taxonomy is presented to users. For example, you can choose specific options to be preselected for your annotators. This can save significant time when certain choices are likely overwhelmingly present in your dataset.
You can also control which levels of your taxonomy hierarchy can be selected, as well as how many choices can be made by your annotators within a label. This allows you to use the same large-scale taxonomies across projects while maintaining complete control over which areas are relevant to your data project and how detailed you want your labels to be. Other capabilities, such as the ability to show an abbreviated or full results path when a taxonomy item is selected are also included in this release, such as ‘continent’ + ‘country’ + ‘city’ instead of just ‘city.’
You can also add explanatory information for each taxonomy item, as needed, directly within the JSON file. This is useful for describing differences between similar classifications or helping reduce onboarding and training time. It also allows your subject matter experts to distill their knowledge directly into your taxonomy and, by extension, the labeling interface and your labeling team.
As a user
When you enter a labeling project with an external taxonomy, you'll first notice how much faster everything is. To find and select one or more taxonomy choices, all you have to do is scroll through the available list or use a powerful search bar. The search bar is handy for long or otherwise complicated taxonomies. If set up in the labeling configuration, you can also mouse over a taxonomy item to get helpful hints or extra information via a tooltip.
Unlock the power of taxonomies and ontologies
The structure that taxonomies provide can significantly enhance data reliability and consistency. This is especially true in scenarios involving multiple data labelers. Establishing uniform criteria ensures a higher degree of uniformity in labeled datasets.
Precise categorization also reduces ambiguity and minimizes the errors caused by it. Taxonomies also facilitate knowledge transfer, making it easier for new labelers or different teams to grasp the data structure and classification logic. They also enable non-experts to capture nuanced domain-specific details in classifications.
This release makes loading, managing, updating, and using large taxonomies easy and painless.
Maintain dynamic taxonomies
Managing and updating taxonomies has been streamlined with this release. It eliminates the cumbersome process of navigating Label Studio and updating labeling configurations for each project. This feature allows dynamic modifications to the taxonomy as new data types or categories emerge without disrupting existing classifications.
Securely host your taxonomy data
As mentioned, this release allows companies to manage their taxonomies outside of Label Studio Enterprise. This eliminates any security concerns or risks associated with storing proprietary taxonomies or other sensitive data within a labeling platform.
Interoperability and standardization with industry best practices
Lastly, because the taxonomy is contained within a standard JSON-formatted file, it can be easily shared, re-used, and integrated into other systems or platforms. This eliminates the need to maintain or recreate separate taxonomies across different systems. It also reduces management burdens by enabling the maintenance of a single JSON file.
This release also empowers organizations to put in place and maintain large, industry-specific taxonomies and ontologies, such as SNOMED CT, ICD, EUNIS, FoodEx2, and more, thereby aligning with various industry standards and enhancing the overall efficacy and applicability of data labeling projects.
This release provides comprehensive support for implementing and managing your taxonomy asynchronously using a flat file, which you can securely load from your external storage. We are also working on adding comprehensive support for loading your taxonomy via API, with dynamic data loading for maximum performance. Stay tuned for that and other exciting new features!
We can’t wait to see how support for external large-scale taxonomies helps your organization more effectively categorize and label your data. To learn more about this and other recent releases, check out our docs. If you’d like to learn more about Label Studio Enterprise or get a demo of external taxonomies, our expert team of humans would be happy to chat with you.