Contact Sales

Manage Quality at Scale with Super-granular Agreement

Key takeaways

  • Agreement is how you align models to human judgment and it’s a critical part of quality assurance workflows as you work toward business outcomes
  • Label Studio Enterprise gives you the most flexible and granular agreement metrics out of the box
  • With more granular agreement metrics, you can spot issues and take action sooner to ensure quality at scale

You need sharper agreement metrics to manage quality at scale

Agreement isn’t just a proxy for quality. It’s how you align models to human judgment.

When you’re working with high-stakes or subjective data, you need more annotators and subject-matter experts in agreement to align the models. Real-time visibility is critical to catch issues early before they threaten the outcomes for the models.

For teams evaluating the output of genAI, you need to account for nuance and evaluations that are more subjective. As agents evolve, the labeling interfaces become increasingly complex. The more you learn about your agents—their capabilities and their failure states—the more you need to adapt your labeling interfaces. The subjective nature of agent evaluation means consensus agreement is the only practical way to manage toward taste and quality.

In either scenario, what’s important is confirming annotators are converging on the same answer. Whenever consensus agreement is low, you need to take action. You need to bring in more subject-matter experts, provide better instructions or improve the task so it’s less ambiguous. Otherwise, you can’t reliably use that data for training and fine-tuning.

Most systems for agreement scoring give you a blunt signal of whether annotators are miscalibrated, but they don’t point you to the specific area where agreement is low.

Is one question on the task dragging the whole score down? Does a specific labeling instruction need to be rewritten? Chasing those answers involved following hunches, interviewing annotators, and manually sifting through data. That meant hours lost following breadcrumbs or falling down the wrong rabbit holes.

Now Label Studio Enterprise gives you more granular metrics so you can compare agreement at the question-level—not just for the entire task. This makes it easier to quickly spot where disagreement happens so you can diagnose issues and manage quality.

What’s new

In this release, we’re introducing:

  • New question-level agreement metrics
  • Configurable methodologies and thresholds for consensus and pairwise agreement
  • Fine-grained control for agreement comparisons across humans, models, and ground-truth

Let’s look at each of the updates in more detail.

New question-level agreement metrics

Now when you configure agreement in Label Studio Enterprise, you will find columns for questions within tasks. This view of agreement on a per-question level will help you understand where divergence is happening more quickly.

Read more in the docs.

Configurable methodologies for consensus and pairwise

Now when you configure agreement in Label Studio Enterprise, you’ll have the choice to pick between two options: consensus or pairwise. We recommend consensus agreement for high-volume, high-throughput projects with more than two annotators where you want to understand convergence.

For each metric at the question or task-level, you can specify the methodology you want to use, such as exact match, intersection over union or Jaccard similarity with thresholds

These new methodologies give you maximum flexibility to measure agreement even when data is complex.

Learn more in the docs.

Fine-grained comparisons in Data Manager

Inter-annotator agreement lets you compare agreement between humans. But what if you need to compare humans vs. models? Or models vs. ground truth? LLMs vs. LLMs?

Along with the more granular agreement metrics, we’ve also added the ability to choose which annotators and/or models you include in your agreement calculations. Now you can slice, dice and compare agreement across any combination of annotators, models or LLMs. This unlocks some powerful use cases such:

  • LLMs as a jury
  • Model vs. model comparisons for challenger vs. champion testing
  • Model vs. ground truth to assess model performance and understand exactly where your model diverges
  • Annotator vs. ground truth so you can provide coaching or better instructions

Agreement out of the box

We hear this from data teams all the time: they’re already calculating agreement, but they’re doing it separately with a lot of manual labor. When agreement is treated as a custom solution, it’s something you have to build and manage on your own.

Often this requires duplicating data and coding custom workflows just to run calculations. Not only is this time-consuming but it leaves a lot of room for things to go wrong as you scale up operations.

Custom agreement calculations often get built on the periphery, in a vacuum, so it’s not connected to your labeling workflows. You have to traverse back and forth between notebooks and projects to put those calculations to use.

In Label Studio Enterprise, Agreement is a first-class feature. Everything is available out of the box, integrated directly into the Data Manager and your existing quality workflows. You don't need to build and manage a totally separate system to get actionable agreement data for quality.

Get started today

Super-granular agreement metrics are available today in Label Studio Enterprise. You’ll find the new columns in your project's Quality settings and in Data Manager.

If you'd like a walkthrough of how to set up consensus agreement or configure question-level metrics for your specific use case, contact us or reach out to your account manager. We’d love to hear from you.

Want to learn more? Join us for a webinar on March 26th at 11:30 AM EDT. We’ll walk you through how to pick the right methodologies and incorporate them into your quality workflows.

Related Content