Generate and Evaluate Predictions using LLMs & ML models - Learn More
Contact Sales
beta Data Discovery

Surface highly relevant data to label with a few clicks

Spend less time looking for data diamonds in the rough. Get a powerful data discovery interface with advanced vector search and proven labeling workflows.

Data selection is the secret to your projects’ success

It’s the difference between companies that see minimal to no impact from their AI projects, and the initiatives that make it to production and deliver results.

Request Demo

Intuitive data discovery combined with first-class labeling workflows

Curate the right text or image dataset for your model. Understand the composition and quality of your data. Generate data-driven insights that make an impact.

Start by connecting your unstructured data

Connect, manage, and explore all your cloud-based datasets. Data Discovery eliminates the need to build your own traditional and embedding infrastructure to index and bring context to your unstructured data.

  • Integrates seamlessly with GCP, AWS, and Azure storage
  • Generates embeddings to uncover similar data points
  • Supports metadata or context on imported data (coming soon)

Search through all your data in seconds

Finding the most impactful and relevant data to label first is key to reducing data preparation headaches. Data Discovery helps surface data matching the exact conditions you need for your task - and send it to your labeling team in just a few clicks.

  • Use semantic search for intuitive, natural language queries
  • Or find similar images using a reference image
  • Combine search methods for more complex and focused tasks
  • Create desired data subsets with similarity filtering
  • Send the data you find as tasks to new or existing projects

Find data to remedy underperforming classes

Labeling your initial ground truth is only step one. Testing your model on diverse scenarios and then iterating on your data is critical. Data Discovery makes identifying data relevant to your underperforming classes easy.

Can’t find enough examples to label? Now you know exactly what to collect. More insights can be found from:

  • Label distribution charts to find underrepresented classes
  • Active learning workflows to identify impactful data points
  • Exporting labels to visualization tools for analysis

Use these insights to improve the composition of your datasets and improve model performance.

Label and manage it all with one platform

Once you’ve found your diamonds in the rough, you’ll want to ensure they’re accurately labeled. Data Discovery is seamlessly integrated with all the labeling, QA, and management workflows you need to train, fine-tune, and validate models.

  • Label efficiently with automated and human-in-the-loop tools
  • Maintain quality with reviewer workflows and dashboards
  • Centrally manage all your data, projects, and teams
  • Keep your data secure with strict access controls
Learn more

Data Discovery Explained

Request Demo
What is data discovery?

Data discovery is part of the data sourcing process and usually occurs after data collection. It generally involves cataloging and classifying gathered data for various purposes, such as analysis or annotation.

In the ML/AI space, data discovery workflows and tools help teams find relevant data for labeling. This covers a full spectrum of tasks, from finding data to include in your initial ground truth dataset to finding specific data points to remedy underperforming classes or address edge cases. Typically, a data discovery tool consists of embedding generation, a catalog view for exploring data in a visual manner, and semantic search and filtering.

How can I use and benefit from Data Discovery?

By connecting your unstructured data, Data Discovery automatically generates embeddings for them. This makes your data indexable and searchable based on similarity between data points. Once done, you can use Data Discovery to perform tasks such as searching for data using natural language query or 'find similar images,' sending data you've uncovered to your team for labeling as a task in a new or existing project, and more.

This helps you significantly reduce the time and costs associated with data preparation tasks like finding, cataloging, and sending data for labeling. It also helps you improve the composition and relevance of your training datasets. Insights you uncover with Data Discovery can help drive more strategic decision-making around both collecting and labeling data.

What data inputs are supported?

Data Discovery currently supports both image and text data.

Can I use this to manage data I don’t intend to label?

Certainly! While Data Discovery was designed with dataset development workflows in mind, it can also be used for any application where indexing and understanding the semantic relationships between unstructured data would be helpful.

Which foundational model is used?

Embeddings are foundational to Data Discovery. We index and automatically categorize the unstructured data you connect using embeddings. These embeddings provide information on the potential relationships between different data points. This makes it possible to search, catalog, and label your data. Embeddings are currently generated for your data using an off-the-shelf CLIP model and require no engineering on your part. We may add support for additional models in a future release of Data Discovery.

Can I bring my own embeddings or models?

Not at this time. However, we may add support for bringing your own embeddings or embedding models in a future release of Data Discovery.

Is my data secure?

Yes! Your data is only temporarily brought into memory to index and generate embeddings. All of that data is then securely purged. Our infrastructure is also SOC2 and HIPAA certified, so you can rest assured your data is always secure.

How is this different from the ML Backend workflow?

Data Discovery is the newest beta feature in Label Studio Enterprise. It is integrated directly into the wider platform and UI itself. This provides powerful semantic search that makes it easier than ever to find, catalog, and label high-impact data.

Our ML Backend workflow allows you to integrate popular models like Segment Anything, YOLO and Bert, as well as custom models to automate your data labeling workflow. This allows you to perform AI-assisted labeling tasks, including pre-labeling for human-in-the-loop reviews, interactive labeling, and active learning tasks to accelerate your labeling projects.

What does Data Discovery cost?

Data Discovery is available free for all Label Studio Enterprise customers for the duration of the beta. This means you can use and benefit from this feature at no additional cost. Since this is a beta, not everything will be perfect yet. But with your help, we can make Data Discovery as powerful and easy to use as possible. Your feedback and requests are welcomed and encouraged, so please contact your customer success manager or email us at

How can I start using Data Discovery?

Getting started is super easy! If you're a current Label Studio Enterprise customer, all you need to do is reach out to your customer success manager, or email us at You'll receive a personalized onboarding and walkthrough, along with any additional assistance you may need as you get going with the beta.

Once you've been onboarded, you'll be able to find Data Discovery through the main interface in the left-hand navigation bar. You'll find it listed as Datasets (Beta) under the Projects tab.

Not an enterprise customer? We'd be happy to give you a personalized demo and show you how we can help solve your greatest data-related challenges.

Try Data Discovery Today

Data Discovery is now in open beta for all enterprise customers. Join us today to get exclusive access to this new feature.