Label Studio is known for its flexibility and user-friendly interface. But one of the often unsung heroes of the Label Studio backend is its powerful API and SDK, which can significantly enhance your data labeling workflow. Whether setting up a new project, managing large datasets, or configuring complex labeling tasks, these tools provide the flexibility and robustness needed for high-quality data preparation.
In this post, we will show five essential tips and tricks that will open your eyes to the power of the API and SDK backend. From automating project creation to schema modifications and export operations, we’ll address some of the most common beginner's problems and even some advanced practitioner situations. By the end of this guide, you'll be equipped with practical insights and best practices to leverage Label Studio's capabilities to their fullest, ensuring your datasets are of the highest quality.
The first tip is to know when to use the API and the SDK. Label Studio’s web application is designed to simplify working with data regardless of the type. But if there’s one thing I’ve found, working with data is never simple. Whether it’s data preparation or connecting labeling tools into a workflow, there always seems to be a need for something custom. Luckily, Label Studio has extension points to help you incorporate data labeling into any workflow with the API and SDK. But when should we use these tools?
The Label Studio API is a RESTful interface that directly interacts with the Label Studio server. It features endpoints for various operations, including project creation, task import, annotation export, and user management. This versatile API can be utilized across any programming language via HTTP requests, making it ideal for system integrations and automated workflows in non-Python languages.
On the other hand, the Label Studio SDK, designed specifically for Python, offers a more intuitive layer over the API. It streamlines API interactions with user-friendly Python methods and classes. This simplification benefits routine tasks such as project setup, task handling, and data export.
Usage Recommendations:
Initiating a new project is a fundamental step in utilizing Label Studio. Creating multiple smaller projects for better organization and automation is advantageous in specific workflows.
The first thing you will need is a Label Studio API token. As described in the documentation, we can create that token from the CLI or use the UI to get a token for our user, as shown below.
While the direct API can be used for project creation, the complexity of these calls can quickly escalate, especially when embedding label configurations in requests, as shown in this example:
!curl -H "Content-Type:application/json" -H "Authorization: Token your-api-token" -X POST "http://localhost:8080/api/projects" --data "{\"title\": \"api_project\", \"label_config\": \"<View><Image name=\\\"image\\\" value=\\\"\$image\\\"/><Choices name=\\\"choice\\\" toName=\\\"image\\\"><Choice value=\\\"Dog\\\"/><Choice value=\\\"Cat\\\" /></Choices></View>\"}"
Conversely, the SDK simplifies this process with more intuitive functions and reduced boilerplate. Here’s how:
!pip install label-studio-sdk
from label_studio_sdk import Client
ls = Client(url=LABEL_STUDIO_URL, api_key=API_TOKEN)
ls.check_connection()
project = ls.start_project(
title='Image Classification Project',
label_config='''
<View>
<Image name="image" value="$image"/>
<Choices name="choice" toName="image">
<Choice value="Dog"/>
<Choice value="Cat" />
</Choices>
</View>
'''
)
This code creates an image classification project with options for Dog and Cat labels. By using the SDK, we can significantly reduce the complexity of project creation, making your workflow more efficient and less error-prone.
When dealing with datasets that have a large number of classes, like the CoCo dataset with 80 object classes, manual configuration can be cumbersome. The Label Studio SDK allows for automated setup using existing class lists to streamline this process. Here's how you can set up a project for the CoCo dataset:
import requests
def get_coco_classes():
classes_url = "https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-2014_2017.txt"
response = requests.get(classes_url)
if response.status_code == 200:
classes = response.text.splitlines()
else:
classes = ["Error: Unable to fetch COCO classes"]
return classes
coco_classes = get_coco_classes()
def create_xml_for_classes(classes):
xml_structure = "<View>\n\t<Image name=\"image\" value=\"$image\"/>\n\t<Choices name=\"choice\" toName=\"image\">\n"
for cls in classes:
xml_structure += f"\t\t<Choice value=\"{cls}\"/>\n"
xml_structure += "\t</Choices>\n</View>"
return xml_structure
xml_with_coco_classes = create_xml_for_classes(coco_classes)
project = ls.start_project(
title='Coco Image Classification Project',
label_config=xml_with_coco_classes
)
Now, when we navigate to the project, we can see that we have all 80 classes in our project.
Project settings our Coco image classification project.
This approach automates a complex project setup with numerous classes, ensuring efficiency and accuracy.
Once your project is set up, the next phase involves importing tasks (data points). The Label Studio SDK streamlines this process as well:
project.import_tasks(
[
{'image': 'https://data.heartex.net/open-images/train_0/mini/0045dd96bf73936c.jpg'},
{'image': 'https://data.heartex.net/open-images/train_0/mini/0083d02f6ad18b38.jpg'}
]
)
This example shows how to import images for classification. The process is similar for other data types like images or audio.
project.import_tasks(
[
{'link_source':'wikipedia', 'image': 'https://upload.wikimedia.org/wikipedia/commons/thumb/2/25/Siam_lilacpoint.jpg/294px-Siam_lilacpoint.jpg'}
]
)
for t in project.get_tasks():
d = t['data']
if not 'source' in d:
d['source'] = 'data.heartex.net'
project.update_task(t['id'], data=d)
These steps demonstrate the ease with which you can manage your data labeling projects using the Label Studio SDK, streamlining the import and management of tasks to save time and reduce manual errors.
In machine learning, managing large datasets efficiently is pivotal. The Label Studio SDK offers an effective approach for bulk data exports, addressing common challenges like extended processing times and web request timeouts.
snapshot = project.export_snapshot_create("my_snapshot")
print(project.export_snapshot_list())
project.export_snapshot_download(snapshot['id'])
This approach allows asynchronous downloading of large datasets, facilitating a seamless and uninterrupted export process. These snapshots can also be tailored using filters, adding to their utility. For example, you might want to export only the newly added data since your last export for ongoing training. This level of customization in data exports makes the SDK a powerful tool in your machine-learning arsenal.
These five tips for using Label Studio's API and SDK demonstrate the powerful capabilities and flexibility these tools offer for managing data labeling projects. From efficient project creation and task imports to advanced configurations and bulk data exports, Label Studio provides a comprehensive and streamlined approach suitable for beginners and advanced users. We hope these tips enhance your workflow, save time, and ensure high-quality data preparation for your machine-learning projects. Whether you're dealing with large datasets or require specific configurations, these strategies will empower you to utilize Label Studio's full potential, making your data labeling process more efficient and effective.