May 21, 2025

Who Watches the Watchdogs? Evaluating LLM-as-a-Judge

The idea of using a large language model (LLM) to evaluate the output of another LLM has been gaining traction. It’s easy to see why: manual evaluation is slow and expensive. LLM-as-a-Judge is faster, cheaper, and doesn’t get tired. But can it be trusted?

In this post, we explore the tradeoffs, common pitfalls, and how to evaluate the evaluator, because if you're going to rely on a model’s judgment, you'd better make sure it's up to the task.

The Old Way: Human Judgement and Held-Out Test Sets

Traditionally, model evaluation starts with people. You take a slice of your data, label the ideal outputs by hand, and set that aside as your test set. This becomes the benchmark for Precision, Recall, F1, and other metrics.

But human judgment takes time. And time costs money, especially when tasks are complex or require domain expertise. Even your most efficient annotators will need a few seconds per task. Multiply that across a large dataset, and the costs add up quickly.

So what if we could offload some of that burden to machines?

The Pitch: LLM-as-a-Judge

The idea behind LLM-as-a-Judge is straightforward: if we can generate outputs with LLMs, maybe we can use them to score outputs too. The model could rate the relevance, correctness, or fluency of a given answer, mimicking human evaluation but at machine speed.

In theory, this offers scalability without sacrificing quality. In practice, it's not that simple.

When you rely on a model to grade another model’s work, you're introducing a new layer of uncertainty. Several well-documented biases make this risky:

Position bias: The model favors the first option presented, regardless of quality
Verbosity bias: The model favors longer responses, even when they aren't better
Self-enhancement bias: When judging its own outputs, an LLM is more likely to rate them favorably

These tendencies mean that what looks like a reliable evaluation may be skewed in subtle, systemic ways.

So… How Do You Evaluate the Evaluator?

If you're going to trust an LLM to score your outputs, you first need to test how well it performs.

1. Compare to Human Ratings

Start by having humans annotate a small test set, just like you would for any model. Then compare the LLM judge’s scores to your human gold standard. Where do they align? Where do they diverge? This not only gives you confidence in the judge, it helps you tune your prompts and surface patterns of error.

2. Evaluate Across Multiple Dimensions

Don’t just ask if an answer is “correct.” Check for clarity, relevance, accuracy, and completeness. A more holistic scoring rubric helps you spot where your model or your judge is falling short.

3. Try the Jury Approach

Instead of relying on a single model’s judgment, sample multiple times or use several different LLMs. The “LLM-as-a-Jury” idea, coined by Varga et al. in 2024, can help average out individual model quirks and reduce bias in the final decision.

The Future Is Hybrid

It’s tempting to view automation as the end goal: faster, cheaper, more scalable. But when it comes to evaluation, speed without reliability is a false economy. You can generate thousands of judgments instantly with an LLM, but if they’re subtly biased or context-blind, you’re just scaling error.

That’s why hybrid evaluation matters.

In a hybrid setup, LLMs handle the heavy lifting. They can quickly flag low-confidence outputs, score routine tasks, or narrow down options. Then human reviewers step in where nuance, context, or ethical judgment is required. It’s not just about catching mistakes. It’s about making sure your systems evolve in the right direction.

This layered approach also creates valuable feedback loops. When you compare LLM judgments with human ones, you’re not just validating the model. You’re uncovering where your definitions of “quality” need refinement. Over time, this leads to better prompts, more robust test sets, and ultimately, more trustworthy models.

In short, automation alone won’t get you to safe or reliable AI. But automation plus human judgment is how you build systems that can scale and stay grounded in reality.

Judging the Judge Requires Just as Much Care

There’s no single answer for how to evaluate models, but one thing is clear: if you’re using LLM-as-a-Judge, you need to scrutinize it as closely as you would any other part of your system. That means understanding its biases, testing it against human standards, and watching where it fails.

Model evaluation is not just a back-end step. It shapes what gets deployed, what users experience, and what gets reinforced. Treating it as a thoughtful, iterative process is essential if you want to build AI systems that people can trust.