July 8, 2025

Why Benchmarks Matter for Evaluating LLMs (and Why Most Miss the Mark)

Every successful AI project masters one fundamental challenge: AI evaluation. As sophisticated AI systems are adopted across industries, the difference between impressive demos and reliable production systems comes down to proper evaluation.

Similar to how traditional software teams write unit tests to ensure components behave as expected, AI teams write AI evaluations (evals) to test whether models produce outputs appropriate for the scenario. As teams grow and projects scale, benchmarks play a crucial role in both worlds by providing a standardized structure to running these tests.

Welcome to our AI Benchmarks series - where we break down:

AI benchmarks and why they matter
Types of AI benchmarks and when to use them
How to run benchmarks effectively
How to build custom benchmarks

In this post, we’ll start with why evaluation is both critical and complex for LLMs. Then, we’ll introduce AI benchmarks and explore what makes them truly useful.

Why structured evaluations matter

Teams building AI solutions naturally test “does this system solve the problem?” These evaluations reveal insights into system strengths, weaknesses, and opportunities for improvement. Importantly, benchmarks, a more standardized evaluation method, help provide consistent, repeatable assessments of system quality. However, for enterprise organizations, evaluation is not just a matter of quality.

The EU AI Act (in force as of August 2024) established the world’s first comprehensive legal AI regulation in an evolving regulatory landscape. This means systematic AI evaluation for performance, reliability, and safety is now a compliance requirement. Evaluation can no longer serve as an ad hoc exercise, but rather needs to be a strategic ongoing initiative.

What makes LLM evaluation harder

Classical ML development followed a predictable routine. Split your labeled data, train your model, test on the holdout set, score with performance metrics (e.g. accuracy or F1), and iterate. This worked because classical ML systems focused on closed domain problems and produced outputs that could be objectively measured against correct (i.e. ground truth) labels.

LLMs, however, are non-deterministic by design. They’re:

Open-ended: Tasks like text generation, reasoning, and summarization can have multiple valid answers. There may be no definitive ‘correct,’ but rather assessments of subjective criteria such as helpfulness.
Configurable post-training: Prompt design, temperature settings, context window and more can all greatly alter LLM outputs. In practical cases, cost considerations with base model and setting configurations also determine the ultimate usability of an AI system.

Example use cases of classical ML and LLM applications

Instead of simple accuracy metrics, LLM evaluations rely on human assessment, model-based-judgements, or sophisticated rubrics to evaluate performance on a task. This shift from objective measurement to nuanced assessment calls for a new framework for evaluation: benchmarks.

AI benchmarks: standardized evaluations

Benchmarks standardize testing so you can repeatedly evaluate AI system performance on certain skills covered in the tasks. You may have seen some popular ones used to compare model performance in LLM leaderboards. Examples include:

MMLU (Massive Multitask Language Understanding): a popular public benchmark which tests models on 15k multiple-choice questions across 57 subjects spanning humanities, social sciences, and STEM.
HLE (Humanity’s Last Exam): a challenging benchmark containing 2500 publicly-released questions (multiple choice and short-form answer) crowdsourced from subject matter experts around the world. Questions must pass through a rigorous process of human reviews before being approved for inclusion in the dataset, and a private set of questions are maintained by creators.
HealthBench: a benchmark for AI in healthcare containing 5000 realistic patient queries to test model responses against physician-written rubric criteria. This benchmark was developed by OpenAI in partnership with 262 practicing physicians around the world to improve state-of-the-art model performance in health settings.

OpenAI HealthBench publication, May 12 2025

These examples comprise the two core components that make up an AI benchmark:

A standardized set of tasks
A defined scoring methodology

However, AI benchmarks have recently gotten a bad rep for serving more as a vanity metric than useful evaluation. And it’s true, off-the-shelf benchmarks generally miss the mark when it comes down to evaluating your own AI systems. Although they may be useful when initially selecting a foundation model to use, a question like “What is the prime factorization of 48” might not reflect the unique challenges your specific AI application faces.

Instead of relying on the test suites from public benchmarks, teams deploying consumer-facing AI systems collect or generate tasks that are tailored to their application’s users and use case. They include common, challenging, and even adversarial tasks, aiming to be representative of scenarios the system encounters in production. These tailored tasks serve as the foundation for custom benchmarks.

Note: We’ll dive deeper into the types of popular and custom benchmarks and when to use them in Part 2. A Guide to Types of AI Benchmarks. Stay tuned!

What makes a custom benchmark effective?

To recap, an AI benchmark contains a set of tasks and a scoring method. An effective benchmark is also highly relevant, interpretable, and practical for your use case. Generally, this means creating a custom benchmarks with the following characteristics:

Utility relevance: Mirrors the tasks and challenges your AI system will face in its specific production usage, capturing the diversity and edge cases of real user behavior.
Clear evaluation criteria: Explicitly defines the goal of the evaluation, and ensures that is measured by the scoring method. Whether you’re using rubric-based criteria from domain experts or ground truth reference, the scoring method must be repeatable across runs.
Granular insights: Breaks down AI evaluation criteria into more insightful error type analysis and capability-specific performance. Consider how feedback on a student’s essay is more useful and constructive when using a rubric-based assessment rather than a pass or fail grade.
Practical efficiency: Supports automations with appropriate human-in-the-loop review. As the suite of tests grows and becomes costlier to run, automations can help flag the most important tasks to focus on. Well-curated test suites that are smaller can be just as effective as larger ones.
Version control and reporting: Tracks changes and run history as benchmarks evolve. As more production task samples are collected, use cases shift, or evaluation methods become more sophisticated, benchmark versioning is crucial to ensure you’re comparing apples to apples.

The future of AI development

As AI systems become core infrastructure, test-driven development will become the norm. In the same way software teams write test suites to protect their codebase, AI teams will maintain custom benchmarks to protect the quality of their systems. These benchmarks will encode the goals we want our applications to achieve and define the meaning of quality in our context.

Coming next: Types of AI benchmarks and when to use them

This is the first post in our AI Benchmarks series - your guide to building effective, scalable AI evaluations. Whether you’re experimenting on your own AI system or you’re a seasoned practitioner looking to strategize an enterprise evaluation workflow, this series has something for you. In the next post, we’ll break down the different types of AI benchmarks and how to use them effectively.