Contact Sales
Prompts

Scale AI evaluation and annotation

Accelerate labeling and evaluation workflows with prompts and LLMs built into Label Studio Enterprise.

Image Evals / AI Image Rubric Evaluation
v3 · Current gpt-5
System

You are a visual design expert. Your task is to score a generated image based on a strict rubric.

You will be given an image. The prompt is "portrait of airplane"

1. Prompt Alignment (1–5) Does the subject clearly match the prompt?
2. Visual Quality (1–5) Is the image sharp, well-lit, and artifact-free?
3. Composition (1–5) Is the framing, balance, and crop strong?
User
{{image}} variable
Overall Evaluation Results 20 tasks $ 0.025 avg ⤢ Expand
Results Alignment Versions Compare History
All Needs Review Has Error Columns ▾
Image Ovr Algn Qual Comp Comments Cost
5 5 5 4 Clear, sharp photo of airplane at airport — matches prompt, strong composition, no issues. $0.021
4 5 4 4 Alignment excellent — clearly matches "portrait of airplane". Image slightly soft but fully… $0.023
2 2 3 2 Subject is a person holding a model plane — not an airplane portrait. Prompt fundamentally… ⚠ Needs Review $0.019
5 5 5 5 Commercial airplane on runway — subject perfectly framed, excellent light, prompt… $0.020
10×
time-to-insight
93%
label accuracy
vs. manual
4-5×
lower labeling costs
faster labeling
Active Learning

An automated active-learning loop

Every annotation closes the loop. The model retrains on each new labeled data. Predictions update automatically; the least-confident tasks are routed to your team. Available exclusively in Label Studio Enterprise.

01

Annotate

An annotator labels a task. A webhook instantly notifies the connected model backend.

02

Retrain

The model retrains on the new labeled data. No manual batching or scheduling.

03

Re-predict

Fresh predictions from the updated model flow back into Label Studio in real time.

04

Re-prioritize

Tasks are reordered so annotators always see the most uncertain, lowest-confidence cases next.

Use LLM-as-a-judge or jury

Deploy models to act like a judge or a jury, scoring responses against a rubric, rating scale or pass/fail threshold. Measure consensus or identify uncertain cases to route to subject matter experts.

Judge model Overall Dims Verdict
Claude Sonnet 4 Anthropic
5
A · H · C

Accurate, well-reasoned, and grounded in the task context. The 60% efficiency claim is appropriately hedged and the core mechanism is explained clearly.

AccurateConciseOn-task
GPT-5 OpenAI
4
A · H · C

Good explanation of the active learning loop. Could benefit from a concrete example or a mention of annotation tool integration to ground the efficiency claim.

AccurateNeeds example
Gemini 1.5 Pro Google DeepMind
4
A · H · C

Satisfactory summary. The response is clear and mostly correct. The 60% figure is presented without a citation — recommend flagging for human review before publication.

ClearUnverified stat

Pre-label with AI so humans can focus on what matters

Set up a prompt and generate predictions for an entire dataset. Your annotation team can switch to reviewing and validating instead of manually writing labels from zero.

Datasets / Portrait Dataset
Image Classification Caption
indoor · candid Bearded man in a denim jacket smiling warmly in a café setting, with soft bokeh light in the background.
natural · portrait Woman in a cream linen top photographed in soft natural light next to large tropical foliage, looking off-camera.
workplace · portrait Professional with glasses and highlighted hair standing confidently in an open-plan office with a monitor visible behind her.
formal · indoor Professional woman in a navy blazer smiling in a formal office setting, with framed credentials visible in the background.
indoor · portrait Brown-haired woman in a dark formal shirt photographed with a warm smile, lit by diffused natural window light indoors.

Bring your own model

Connect any model: your own model, a model from any commercial provider, or an AI gateway like OpenRouter or Hugging Face.

Empower SMEs

Give subject-matter experts
the power to refine prompts

You don't need to be a masterful prompt engineer to improve a model's performance. Click to enhance prompts automatically, improving definitions, decision logic and examples.

Prompts / Product Review NLP
Base model
gpt-5-mini OpenAI
Prompt
Text Title

Given the product review and title, perform the following:

- sentiment: classify the review as positive, neutral, or negative

- entities: identify key product features and brands mentioned

Enhancement Complete Applied

Changes made:

The prompt lacked specificity around output structure and task scope. The following improvements were applied:

  1. 1 Added output format — specified JSON structure ensures consistent, parseable responses across all model providers.
  2. 2 Clarified entity scope — listed product features, brand names, and key adjectives as explicit target entity types.
  3. 3 Handled ambiguous cases — added instructions for mixed-sentiment reviews and confidence scoring.

Verify and align to ground truth.
Then automate and scale.

How do you know your AI judges are actually right? Bad pre-labels bias your annotators. Unreliable judges give you false confidence in your models.

Score against ground truth first. Set a confidence threshold, verify accuracy on a sample set, and automate with confidence.

Prompts / MT Evaluation
ResultsAlignmentVariantsCompareHistory

Overall Evaluation Results

73.15% accuracy 20 tasks $0.055005 avg
AllCorrectIncorrectHas Error
# Original text Ground truth LLM output Score Accuracy Cost
24701 That is why Mr Searle's report deserves our full support and congratulations. C'est pourquoi le rapport de M. Searle mérite notre plein soutien et nos félicitations. C'est pourquoi le rapport de M. Searle mérite notre plein soutien et nos félicitations. 5 100% $0.0027
24702 I believe the Commission has shown great clarity on this very important matter. Je crois que la Commission a fait preuve d'une grande clarté sur cette question importante. Je pense que la Commission a été très claire dans son approche de cette importante question. 4 84% $0.0031
24703 Liberalization and trade must go hand in hand with strong social protection measures. La libéralisation et le commerce doivent aller de pair avec la protection sociale. Les échanges commerciaux et la libéralisation sont importants pour la croissance économique. 2 42% $0.0029

Put prompts to work for your use cases

From classification to evaluation, prompts are built to handle your unique use cases for multimodal data.

PDF entity extraction
Q&A rubric generation
LLM response evaluation
Content moderation
PII detection and extraction
Multi-turn conversation evaluation
Pipelines & API

Automate the entire pipeline

Wire Label Studio into your stack with webhooks, a full REST API, and a Python SDK. Every task, annotation, and project change can trigger the next step so labeling, training, and evaluation run without manual hand-offs.

Trigger model training

Kick off training automatically once a project hits a threshold of new annotations.

Drive active learning

Fire the active-learning loop on every annotation event — no scheduler required.

Version your data

Push each new batch of labeled data into your dataset-versioning repository automatically.

Integrate external pipelines

Connect Label Studio to your stack with the REST API and Python SDK for fully programmatic workflows.

webhook_handler.py
from flask import Flask, request

app = Flask(__name__)

@app.route("/ls-webhook", methods=["POST"])
def label_studio_webhook():
    event = request.get_json()
    action = event.get("action")

    if action == "ANNOTATION_CREATED":
        task = event["task"]
        annotation = event["annotation"]
        # New label landed — kick off the active-learning loop
        retrain_model(project_id=task["project"], annotation=annotation)

    return {"status": "ok"}, 200
LABEL STUDIO ENTERPRISE

COMPREHENSIVE INFRASTRUCTURE

Make the highest use of your unique expertise and novel datasets as you train, benchmark, and evaluate AI in one common environment.