New Evals Interface Allows Data Science Teams to Customize the Degree of Automation vs Human Supervision According to their Business Requirements
San Francisco June 26, 2024
HumanSignal Introduces LLM Evaluation Capabilities to Ensure Trust & Safety
New Evals Interface Allows Data Science Teams to Customize the Degree of Automation vs Human Supervision According to their Business Requirements
June 26, 2024—HumanSignal introduces new AI Evaluation capabilities, featuring LLM-as-a-judge evaluators combined with human-in-the-loop workflows, enabling data science teams to deliver high quality, compliant GenAI applications. The new Evals features are now available in the HumanSignal platform, formerly known as Label Studio Enterprise. Additional updates to the product include fully automated data labeling, production-grade prompt engineering, and secure model integrations.
Human feedback is critical for building trusted AI applications
Stanford’s Human-Centered Artificial Intelligence group found that one in six leading legal models hallucinate. The risk of hallucinations and inaccurate outputs is a top challenge for enterprises putting LLM-based applications into production for customer service, enterprise knowledge bases, product development, and more.
"If you do not perform continuous evaluation of your GenAI applications in-house, you’re essentially asking your customers to do that for you," said Michael Malyuk, CEO of HumanSignal. "The key is to combine human and machine judgment in a way that efficiently balances cost with reliability. This is precisely where we step in, scaling feedback through tools and workflows that automate away tedious tasks, while enabling human supervision at the most critical junctures."
Evaluation at Every Step of the AI Development Lifecycle
When the stakes are high, a combination of automated evals with human supervision yields the most reliable outputs. Data science teams are adopting new workflows to leverage LLMs, which involve comprehensive evaluations at every stage of the AI development lifecycle, from choosing the best model to optimizing model performance to continuously evaluating models in production. With HumanSignal, they can uniquely configure evaluation workflows to match their use case, using automation for initial checks and pilot projects, and deploying expert reviews or ground truth assessments for more complex and critical applications.
In the new Evals interface, users can choose from pre-built metrics for popular use cases, such as PII and content moderation, or create custom metrics for bespoke use cases. New dashboards allow users to visualize, assess, and compare the quality of different models.
The HumanSignal labeling interface also supports manual and semi-automated evaluations of side-by-side model inputs and outputs, response grading, ranking, and RLHF workflows, which can be used to evaluate LLMs and RAG pipelines.
Use HumanSignal to Efficiently Improve Model Reliability: What’s New
Renaming Label Studio Enterprise to HumanSignal highlights its expanded functionality, supporting the entire Generative AI development lifecycle, in addition to classic predictive ML/AI pipelines. Enterprises can use HumanSignal not only to evaluate models, but to improve model quality through prompt engineering and preparing datasets for fine-tuning.
New features in the platform align the theme of automation and efficiency, while making it easier for subject matter experts to provide feedback and drive model alignment. In addition to support for LLM evaluations, new use cases enabled by the HumanSignal platform include:
HumanSignal continues to invest in its popular open source project, Label Studio, which remains a core component of the new HumanSignal Platform. The latest version of Label Studio includes new UI configurations that support manual AI evaluation workflows, including RAG pipeline evaluation, side-by-side comparison of model outputs, response moderation, and grading. The open source community also benefits from a new version of the API/SDK, as well as new model integration examples including LLama3, Ollama, and GliNER, in addition to OpenAI, Segment Anything, and many other popular models.