Evaluating generative AI tools is hard. Generative models, by definition, give you a different answer each time, even when the inputs are the same. Ground truth isn’t clearly defined upfront. “Correct” can mean something different each time, as the answers need to be evaluated along many different dimensions. Relying on a single annotator to make these judgements often leads to the introduction of bias in the data.
Moreover, in data as complex as that produced by generative AI, different people will likely evaluate each answer differently. Disagreement isn’t an edge case. It’s to be expected.
Instead of treating disagreement as noise to eliminate, high-performing teams treat it as signal for defining quality:
Consensus workflows are designed to answer these questions.
Unlike pairwise agreement, which gives you an understanding of how people perform, consensus methodologies for agreement help you understand exactly where annotators do and do not converge upon a single answer for a given dimension or across a whole task. This understanding helps you:
In this way, consensus doesn’t just improve label accuracy. It transforms disagreement into structured insight, turning subjective judgments into something teams can measure, analyze, and iterate on.
Let’s dive into two cases where consensus is an extra powerful tool for Generative AI: LLM-as-a-Jury, and understanding output performance.
You may have heard of LLM-as-a-Judge, the technique where you use one LLM to judge the output of another model or LLM. LLM-as-a-Jury is a similar technique, but uses a panel of LLMs to judge a single output. This helps to counter some of the differences that may appear between runs with an LLM. It can also help you pick which LLM is right for a task, or use speciality LLMs for labeling and see where they converge.
When using LLM-as-a-Jury, consensus agreement helps us understand how multiple different LLMs have interpreted a single task. Where pairwise agreement might show us if a single model is acting out of whack from the others, consensus gives us a quick way of knowing how much the LLMs have agreed with each other. This score could also be interpreted as the final LLM-as-a-Jury score.
When the jury is evaluating the output of a different genAI tool, things get a little more complicated than if the jury were just evaluating the output of something simple, like a classification model. With classification models, you’re looking for a “right” or “wrong”, a binary judgment call. With a genAI tool, you’re much more likely to be evaluating against a rubric, or a set of criteria that determines the quality of the output.
Let’s look at an example. Say that your original LLM was generating a paper based on a provided topic.
Your LLM-as-a-Jury is responsible for grading the paper against the following requirements:
The prompt for the LLM jury will need to outline the criteria for each question on the rubric, and we’ll parse the response from each LLM juror to get the type of answer we need for each item on the rubric.
Imagine that we got the following outputs from a group of three jury LLMs:
The pairwise agreement score for this task would be 44%, where the consensus agreement is 66%. What does that tell us about the quality of the LLM jury? What should the answer be?
Think about the goal of LLM-as-a-Jury. Ultimately, we want to know how good the original model response, the generated paper, was. Agreement tells us how similar the answers from each juror were. The higher this agreement score, the more likely it is that the LLM jurors have agreed on the answer, making it trustworthy.
A pairwise score of 44 seems pretty low, especially considering that on two of the three questions, at least two annotators agree. A score of 66 is a little more intuitive, telling us that about ⅔ of the annotators agree with each other.
The updated agreement in Label Studio also allows us to break down the agreement for each task into its dimensions, or the individual questions that make up a task. For consensus agreement on the above example, we get the following:
This is an especially helpful view of agreement to have. Now, we can see that all three LLMs agree on whether the post addresses the topic, two agree if the answer is clear, and none agree on the answer quality. Not only does this give us insight into how trusting we can be in our LLMs, it also tells us that there’s something fundamentally wrong with how we’re asking the LLMs to evaluate quality. This makes sense, because the quality question uses a 5 star rating scale, which LLMs are notoriously bad at. By having no consensus at all, we can establish that we need to fix our prompt or the question itself.
When it comes to Generative AI, there’s a decent amount of individual bias that comes into choosing a preferable answer. While most people can agree about factuality, elements like tone and style are much more subjective.
Agreement, as it’s implemented in Label Studio Enterprise, can help us better understand the opinions of our subject matter experts on this type of subjective data. Let’s say we have a task where humans are asked to pick their favorite of two possible answers. In this case, two of our three people have picked the same winner.
A pairwise score, which calculates the average agreement of every possible pair of annotators, would say that the agreement on this task is only 33%, because only 1 out of 3 possible pairs of annotators agree. This can be useful, because it tells us that one of the annotators is severely out of alignment with the other two, but doesn’t give us good insight into which answer people prefer. Consensus agreement, on the other hand, would give us a score of 66%, because 2 out of 3 people chose the same answer. This is what we’re looking for – an understanding of how people might prefer one LLM answer over another.
When we ask humans to give a preference, we also want to know why they have that preference. Maybe it has to do with tone and style, or factuality, or something else. By breaking out our agreement score across all the questions within a task, we can also quickly identify the strengths and weaknesses of our GenAI model. High agreement in a factuality category means that we can trust the annotators when they say that something is factual, where low agreement in tone might mean that we need to re-educate our annotators on the aspects of tone that we care about understanding.
Both consensus and pairwise agreement scores tell us important information about the people or models evaluating Generative AI outputs and the quality of the data that we get back from each. Knowing when to use which type of agreement, and how to understand it, is crucial to a robust and intelligent understanding of Generative AI outputs.
Ready to apply this to your project? Consensus agreement is available today on Label Studio Enterprise. Learn more.
Already on Enterprise? Play around with the different types of agreement, available in the quality settings within your project, and learn for yourself about the power of strong, configurable, granular agreement scores.