✨ Download the New Guide: Ensuring Quality for Mission-Critical AI Applications
Contact Sales
Back to Press

HumanSignal Introduces LLM Evaluation Capabilities to Ensure Trust & Safety

New Evals Interface Allows Data Science Teams to Customize the Degree of Automation vs Human Supervision According to their Business Requirements

San Francisco June 26, 2024

HumanSignal Introduces LLM Evaluation Capabilities to Ensure Trust & Safety

New Evals Interface Allows Data Science Teams to Customize the Degree of Automation vs Human Supervision According to their Business Requirements

June 26, 2024—HumanSignal introduces new AI Evaluation capabilities, featuring LLM-as-a-judge evaluators combined with human-in-the-loop workflows, enabling data science teams to deliver high quality, compliant GenAI applications. The new Evals features are now available in the HumanSignal platform, formerly known as Label Studio Enterprise. Additional updates to the product include fully automated data labeling, production-grade prompt engineering, and secure model integrations.

Human feedback is critical for building trusted AI applications

Stanford’s Human-Centered Artificial Intelligence group found that one in six leading legal models hallucinate. The risk of hallucinations and inaccurate outputs is a top challenge for enterprises putting LLM-based applications into production for customer service, enterprise knowledge bases, product development, and more.

"If you do not perform continuous evaluation of your GenAI applications in-house, you’re essentially asking your customers to do that for you," said Michael Malyuk, CEO of HumanSignal. "The key is to combine human and machine judgment in a way that efficiently balances cost with reliability. This is precisely where we step in, scaling feedback through tools and workflows that automate away tedious tasks, while enabling human supervision at the most critical junctures."

Evaluation at Every Step of the AI Development Lifecycle

When the stakes are high, a combination of automated evals with human supervision yields the most reliable outputs. Data science teams are adopting new workflows to leverage LLMs, which involve comprehensive evaluations at every stage of the AI development lifecycle, from choosing the best model to optimizing model performance to continuously evaluating models in production. With HumanSignal, they can uniquely configure evaluation workflows to match their use case, using automation for initial checks and pilot projects, and deploying expert reviews or ground truth assessments for more complex and critical applications.

In the new Evals interface, users can choose from pre-built metrics for popular use cases, such as PII and content moderation, or create custom metrics for bespoke use cases. New dashboards allow users to visualize, assess, and compare the quality of different models.

The HumanSignal labeling interface also supports manual and semi-automated evaluations of side-by-side model inputs and outputs, response grading, ranking, and RLHF workflows, which can be used to evaluate LLMs and RAG pipelines.

Use HumanSignal to Efficiently Improve Model Reliability: What’s New

Renaming Label Studio Enterprise to HumanSignal highlights its expanded functionality, supporting the entire Generative AI development lifecycle, in addition to classic predictive ML/AI pipelines. Enterprises can use HumanSignal not only to evaluate models, but to improve model quality through prompt engineering and preparing datasets for fine-tuning.

New features in the platform align the theme of automation and efficiency, while making it easier for subject matter experts to provide feedback and drive model alignment. In addition to support for LLM evaluations, new use cases enabled by the HumanSignal platform include:

  • Auto-labeling or bootstrapping the labeling of large-scale datasets using prompt engineering and proximity to ground truth data.
  • Production-grade prompt engineering using constrained generation and real-time feedback via quality metrics.
  • Easy, secure connections to popular LLMs or custom models, in order to predict annotations, facilitate interactive labeling or model evaluation.
  • Streamlined integrations and management of ML/AI infrastructure, including cloud storage data I/O, batch annotation, model integrations, and configuring webhooks, via new API and SDK.
  • Faster, and more robust dataset curation for training and fine-tuning via semantic search and similarity search.
  • More granular dashboards and workflows to manage teams of external annotators and reviews from internal subject matter experts.

HumanSignal continues to invest in its popular open source project, Label Studio, which remains a core component of the new HumanSignal Platform. The latest version of Label Studio includes new UI configurations that support manual AI evaluation workflows, including RAG pipeline evaluation, side-by-side comparison of model outputs, response moderation, and grading. The open source community also benefits from a new version of the API/SDK, as well as new model integration examples including LLama3, Ollama, and GliNER, in addition to OpenAI, Segment Anything, and many other popular models.