Customer Story
In Conversation With
CEO
CIO
Large language models can sound smart, but in medicine, sounding smart isn’t enough. At one of the most trusted institutions in health research, a team was piloting a new GenAI assistant aimed at delivering accurate, evidence-based responses to health questions. The system pulled from NIH-vetted sources like PubMed Central and MedlinePlus, but even with a retrieval-augmented generation (RAG) pipeline in place, one critical question remained:
Can we trust the model’s outputs, especially when the stakes are health-related?
That question isn’t theoretical. From policy alignment to health literacy to factual accuracy, every sentence generated needed scrutiny. The team needed a rigorous evaluation framework that could balance expert insight with scalable workflows.
That’s where Mind Moves came in, bringing human-centered design and organizational strategy to help develop an AI evaluation process that was not only effective, but sustainable across teams. In partnership with HumanSignal’s Label Studio, they designed a six-phase human-in-the-loop workflow to assess the AI assistant’s outputs across dimensions like:
Instead of relying on spreadsheets or ad hoc review tools, the team used Label Studio to orchestrate a complex annotation project that spanned:
Each group worked within a dedicated Label Studio workspace, a feature unique to the Enterprise platform, enabling separation of projects and reviewer types while maintaining centralized oversight. The team also customized the labeling interface using Enterprise Plugins, including “hoverable” question marks that surfaced contextual help and definitions at the project level. This reduced friction for first-time annotators and ensured high consistency on cognitively demanding tasks.
“Our objectives exceeded the capabilities of traditional tools like Excel. Label Studio provided a modular, extensible platform for managing annotation workflows, quality assurance, and consensus review at scale. The resulting benchmark dataset now underpins our LLM-as-a-judge framework for systematic, high-throughput model evaluation.” — Dr. James Shanahan, CIO, Mind Moves
Mind Moves also helped develop an interoperable dual-pipeline system: one for RAG-based response generation, and one for evaluation.
By exporting annotation data from Label Studio in structured formats (CSV/JSON), the team was able to compare human and LLM evaluations head-to-head, revealing important gaps. Notably, human reviewers were more strict and conservative, especially on evidence support and clarity, highlighting the essential role of domain expertise.
The pilot yielded a number of early-stage insights:
Beyond the metrics, the project also educated 78% first-time annotators on how AI evaluation works, deepening internal capacity and building trust from the ground up.
Human Signal and Mind Moves were able to develop a repeatable, human-centered workflow that scales across annotators, safeguards against hallucinations, and upholds medical integrity.
This wasn’t just a tech problem. It was an institutional challenge, and the team met it with clarity and care.
“In a world where trust in health information is fragile, we knew we needed a system that reflected human judgment, not just machine outputs. This process helped us build that trust.” - Nicole Sroka, CEO Mind Moves
Evaluating AI in critical domains requires more than benchmarks or dashboards. It requires real humans, thoughtful scaffolding, and tools that facilitate progress. With Label Studio, Mind Moves took a step toward AI that doesn’t just answer questions, but earns trust.