How Mind Moves and HumanSignal Brought Trust to AI in Healthcare

The Challenge: Hallucinations, Health, and High Stakes

Large language models can sound smart, but in medicine, sounding smart isn’t enough. At one of the most trusted institutions in health research, a team was piloting a new GenAI assistant aimed at delivering accurate, evidence-based responses to health questions. The system pulled from NIH-vetted sources like PubMed Central and MedlinePlus, but even with a retrieval-augmented generation (RAG) pipeline in place, one critical question remained:

Can we trust the model’s outputs, especially when the stakes are health-related?

That question isn’t theoretical. From policy alignment to health literacy to factual accuracy, every sentence generated needed scrutiny. The team needed a rigorous evaluation framework that could balance expert insight with scalable workflows.

The Solution: A Human-in-the-Loop Blueprint with Mind Moves and Label Studio

That’s where Mind Moves came in, bringing human-centered design and organizational strategy to help develop an AI evaluation process that was not only effective, but sustainable across teams. In partnership with HumanSignal’s Label Studio, they designed a six-phase human-in-the-loop workflow to assess the AI assistant’s outputs across dimensions like:

Interpretability (was the meaning clear?)
Readability (was it accessible for users?)
Accuracy and Evidence Support (was it factually correct and properly cited?)
Alignment with NIH standards (did it stay within the guardrails?)
Fact‑checking (did each sentence align with its cited reference chunks?)
Reference response creation (was a reference‑grounded answer generated?)
Reference response selection (was the strongest reference‑supported option chosen?)

Instead of relying on spreadsheets or ad hoc review tools, the team used Label Studio to orchestrate a complex annotation project that spanned:

4 reviewer groups, consisting of 20 annotators
100 biomedical questions, both expert and non-expert
6 structured projects, each building on the last
20,000+ annotation tasks, with varying levels of cognitive difficulty

Each group worked within a dedicated Label Studio workspace, a feature unique to the Enterprise platform, enabling separation of projects and reviewer types while maintaining centralized oversight. The team also customized the labeling interface using Enterprise Plugins, including “hoverable” question marks that surfaced contextual help and definitions at the project level. This reduced friction for first-time annotators and ensured high consistency on cognitively demanding tasks.

“Our objectives exceeded the capabilities of traditional tools like Excel. Label Studio provided a modular, extensible platform for managing annotation workflows, quality assurance, and consensus review at scale. The resulting benchmark dataset now underpins our LLM-as-a-judge framework for systematic, high-throughput model evaluation.” — Dr. James Shanahan, CIO, Mind Moves

A Hybrid Evaluation Pipeline

Mind Moves also helped develop an interoperable dual-pipeline system: one for RAG-based response generation, and one for evaluation.

The generation pipeline used secure, serverless infrastructure to retrieve and synthesize data from NIH-vetted sources.
The evaluation pipeline incorporated both human annotation (via Label Studio) and LLM-as-a-judge evaluations using GPT-4 and Claude 3.

By exporting annotation data from Label Studio in structured formats (CSV/JSON), the team was able to compare human and LLM evaluations head-to-head, revealing important gaps. Notably, human reviewers were more strict and conservative, especially on evidence support and clarity, highlighting the essential role of domain expertise.

Early Results and Lessons

The pilot yielded a number of early-stage insights:

Interpretability rated high— Most outputs were clear, but not all.
Readability measured as acceptable— accessible at or below an 8th-grade reading level to support health literacy
Alignment with NIH standards scored extremely high— affirming the strength of the prompt engineering in areas like not providing medical advice. Beyond the metrics, the project also helped create a cadre of annotators with hands-on experience, who will return to their own work better equipped to assess results from LLMs as well.
Overall acceptance rate landed at 50% — A strong starting point for an early-stage GenAI system in a high-risk domain.

Beyond the metrics, the project also educated 78% first-time annotators on how AI evaluation works, deepening internal capacity and building trust from the ground up.

Why It Matters

Human Signal and Mind Moves were able to develop a repeatable, human-centered workflow that scales across annotators, safeguards against hallucinations, and upholds medical integrity.

This wasn’t just a tech problem. It was an institutional challenge, and the team met it with clarity and care.

“In a world where trust in health information is fragile, we knew we needed a system that reflected human judgment, not just machine outputs. This process helped us build that trust.” - Nicole Sroka, CEO Mind Moves

Final Thoughts

Evaluating AI in critical domains requires more than benchmarks or dashboards. It requires real humans, thoughtful scaffolding, and tools that facilitate progress. With Label Studio, Mind Moves took a step toward AI that doesn’t just answer questions, but earns trust.

How Mind Moves and HumanSignal Brought Trust to AI in Healthcare