As large language models become more central to labeling workflows, one question comes up again and again: how do you know if a model is performing well for your specific use case? Whether you're using an LLM to pre-label data, auto-generate annotations, or assist human reviewers, evaluating model outputs is critical to building trust and scaling your system with confidence.
In this blog, we’ll explore two approaches for assessing model quality in Label Studio. The first relies on ground truth data to generate accuracy scores. The second compares model predictions in cases where ground truth is unavailable, helping you identify gaps and make informed decisions about which models to use or fine-tune.
Using LLMs to label data offers a major speed boost, but without a clear evaluation method, it's easy to lose track of quality. Manual inspection is time-consuming and subjective. What’s needed is a structured, repeatable way to compare outputs, both against existing annotations and across different models or prompt versions.
Label Studio’s Prompts feature makes this possible. It allows users to:
With Prompts, you get a complete loop for testing, improving, and operationalizing LLM-based labeling.
When you already have a labeled dataset, evaluating a model is straightforward. In Label Studio, this starts by marking a subset of tasks as ground truth. You can then run a prompt using a selected base model and compare the model's predictions to the ground truth annotations.
Label Studio displays both outputs side by side, highlighting differences. It also calculates an agreement score for each task, as well as a macro-level score across all evaluated tasks. This helps you answer questions like:
You can also evaluate multiple models or prompt variations within the same interface by creating new prompt versions. Each version is stored with its associated model, instruction, and results, making it easy to track what changes had the biggest impact.
In the live workshop, the team walked through a real-world image captioning project. The dataset contained 200 images that needed both classification and descriptive captions. Rather than labeling each image from scratch, the team used Prompts to test two models—GPT-4 and Gemini 1.5 Pro—against a small ground truth set.
Despite similar accuracy scores, Gemini delivered comparable results at a significantly lower cost. This allowed participants to see firsthand how tradeoffs between performance and price can shape production decisions.
When ground truth isn't available, Label Studio still gives you a way to evaluate. Instead of comparing model outputs to existing annotations, you compare two different model predictions to each other.
The steps look like this:
This approach helps identify disagreement areas where model behavior diverges. Those tasks often represent edge cases, ambiguous inputs, or gaps in model coverage.
From there, you can prioritize human review on the most problematic tasks or convert them into new ground truth examples. This iterative workflow helps you bootstrap evaluation even in early-stage projects.
Another example covered in the session was a text-based project involving conversation transcripts. The task was to detect PII and summarize call topics. Two models, Claude Haiku and GPT-4, were run on the same inputs, and their outputs were compared.
By converting predictions into annotations and scoring agreement, the team quickly pinpointed where the models aligned and where they diverged. This helped highlight sensitive cases that warranted closer inspection and potential ground truth creation.
When evaluating models, performance isn’t the only metric that matters. Label Studio also tracks inference cost, giving you visibility into what it would take to scale a particular model across your full dataset.
For example, if two models have similar agreement scores but one is three times more expensive, that may influence which model you deploy for production pre-labeling or auto-annotation. Being able to balance quality and cost is essential when scaling human-in-the-loop systems.
The workflows described above work for both text and image tasks and support a wide range of annotation types, including:
Prompts can be used with standard LLMs or connected to custom models hosted in your own environment. You can also customize agreement metrics in the settings, allowing for more advanced evaluations beyond simple accuracy.
Evaluating and comparing models shouldn't require writing scripts or manually reviewing every output. Prompts in Label Studio give teams a fast, structured way to test models, adjust prompts, and build better labeling workflows over time.
Whether you're working with ground truth or starting from scratch, you can use these tools to build trust in your model’s predictions and speed up your data pipeline with confidence.
Watch the full webinar below: