Introducing LLM Evaluations and the HumanSignal Platform

Generative AI has the potential to be one of the most transformative innovations that humanity has ever produced. The recent flood of research papers and the general availability of both open source and closed source LLMs has resulted in a renaissance in the field of computer science and AI projects now have the feel of real inventions with tangible outcomes. GenAI’s promise has captured the imaginations of both consumers and companies alike, and there is strong demand in the enterprise for leveraging the strengths of GenAI to cut costs, drive efficiency, and deliver new products to the market.

However, in our conversations with thousands of enterprises, what we have learned is that while they have mandates to adopt and leverage GenAI, they are struggling to put models into production for high-stakes (and high-value) use cases. The feedback we get is that despite GenAI’s potential, the non-deterministic nature of Generative AI and the risk of a high-profile hallucination are a substantial obstacle because the stakes are just too high. And every time an enterprise has a GenAI gaffe hit the headlines, those stakes get even higher.

“There is a lot we don't understand about how these systems work and whether they will remain aligned to human interests.”

Daniel Kokatajlo

OpenAI

So What Is The Solution?

While there are several emerging methods for dealing with this problem, one of the most effective ways to ensure that production GenAI is accurate, aligned, and unbiased is through model evaluation.

What is model evaluation? In brief, GenAI model evaluation refers to the process of measuring a given model’s performance against a set of benchmarks or metrics. This evaluation is something that you can do when you are determining which model you want to use for your project (“Which model performs best for my use case out of the box?”) or during model development (“As I make changes, how is the model improving against my chosen metrics?”). In this way, you can ensure that the model you put into production is within an acceptable margin for error.

However, many of the most popular evaluation methods have some significant issues. Jesse Dodge, a scientist at the Allen Institute for AI, says the industry has reached an “evaluation crisis.” The crisis isn't due to a lack of evaluations—every new AI model undergoes extensive testing—but rather that most standard evaluations poorly reflect real-world use. Furthermore, any benchmarks used for evaluation are over three years old, from when AI systems were primarily used for research. But now, people are using generative AI in incredibly creative and diverse ways.

Human Signal Is The Answer

While there is an increasing trend towards relying on automation to try to fill in these gaps, at the end of the day, the only way to truly guarantee model performance is to leverage human supervision and expertise (albeit to varying degrees). Case in point, Apple – known for its consideration of human-machine interaction – recently launched Apple Intelligence. More intriguing than any feature they announced was seeing a historically secretive company share their research and methodology in building AI. They employed extensive human review to ensure data quality, using a combination of human-annotated and synthetic data. For evaluation, they focus on human evaluation because it's highly correlated to user experience in their products. Ultimately, automated evaluations alone can't capture the nuanced insights humans provide. Is this a good song? Is this good advice? Is this a compelling image? These are fundamentally human questions.

HumanSignal has always been a leader in using automation and human supervision throughout the AI development cycle, making it as efficient as possible to achieve the highest possible quality. And today we are launching some new functionality that provides a better evaluation path (while also significantly improving auto-labeling and predictive model labeling functionality).

With Label Studio at the core enabling efficient and advanced workflows to create ground truth data, we will be showcasing new functionality that leverages human signal. Proximity to ground truth turns out to be a critical factor in enhancing the accuracy and reliability of AI systems. By maintaining close alignment with verified data, we can significantly improve the performance and trustworthiness of our models. This approach not only streamlines workflows but also ensures that AI outputs are more aligned, unbiased, and accurate.

Introducing Evaluations

The first new feature we want to highlight is called, fittingly, Evaluations. This new feature makes it extremely intuitive to evaluate models and track performance. The tool includes flexible evaluation workflows that range from fully automated to fully manual.

Fully Automated

Automate the evaluation process using other LLMs as judges. While this approach offers speed and efficiency, it may not match the precision of manual reviews.

Hybrid

Combine manual and automated evaluations to balance accuracy and efficiency. Use automation for initial checks and deploy expert reviews for more complex or critical assessments.

Fully Manual

For the highest accuracy, leverage internal experts to manually review and validate LLM outputs. Ideal for critical tasks where precision is paramount, despite the higher cost and time investment.

The tool includes several Evaluators - pre-built configurations for popular use cases like PII or toxicity - so you can get started evaluating models immediately. And of course you can also create your own custom metrics to evaluate models based on your specific needs and use cases.

Introducing Prompts

Prompts is another major new feature we’re releasing today that allows you to evaluate prompts and generate high-quality datasets using our Prompt interface. Featuring a purpose-built UI and guardrails features, Prompts ensures you can constrain LLM generation to reliably prevent hallucinations. And to keep inference costs down, you can measure prompt performance against a ground truth dataset to ensure the LLM generates accurate labels before you auto-label at scale.

Welcome To The HumanSignal Platform

Last year we changed our name to HumanSignal to highlight our focus on efficiently facilitating interaction between humans and AI. And while the language that we use to communicate with AI – data – will always be the same, we’re going to continue to innovate on how that interaction takes place.

To that end, by adding these new tools to Label Studio’s existing robust workflows, dashboards, quality analytics, and customizable labeling UI, we have expanded our product’s functionality past traditional data labeling. And the name of the solution needs to reflect that expansion.

So going forward, we’re changing the name of Label Studio Enterprise to the HumanSignal platform, which will comprise Label Studio as the UI, manual labeling interface, and orchestration layer, Prompts for prompt evaluation and auto-labeling, and Evaluations for GenAI model evaluation. You’ll also see an updated cleaner look and feel in the platform as part of this change.

Of course, our open source project will continue to be named Label Studio and we are dedicated to continuing our unwavering support for our OS software and community.

All these features are available (in beta?) as of today in the HumanSignal platform. If you’d like to see them in action, feel free to schedule a demo, or if you’re already a customer, contact your Customer Success rep.

Happy Evaluating!