Features

Know Your Agents Work Before Users Do

Eight specialized evaluators, deployment gates, scheduled runs, and A/B experiments. Ship with confidence.

Evaluation System

Test, measure, improve

Everything you need to systematically raise agent quality.

Test Datasets

Define inputs and expected outputs. Measure accuracy across every scenario that matters.

A/B Experiments

Split traffic across agent versions with different prompts, models, or settings. Compare real performance side-by-side and pick the winner with data.

AI Judges

Eight specialized evaluators score every response automatically — from rubric-based accuracy to multi-turn conversation analysis.

Track Over Time

Catch regressions before users do. Get alerts when quality drops below your thresholds.

8 Evaluator Types

Every dimension of quality, covered

From rubric-based scoring to full conversation trajectory analysis. Each evaluator targets a specific failure mode so nothing slips through.

Rubric-Based Scoring

Score responses against structured criteria with LLM-as-judge evaluation.

Output

Define custom rubrics to score any dimension specific to your domain.

Helpfulness

Is the response actionable, complete, and actually useful?

Faithfulness

Does the response stay grounded in source data, or does it hallucinate?

Tool Accuracy

Verify agents pick the right tools and pass the right parameters every time.

Tool Selection

Did the agent choose the correct tool for the task?

Tool Parameters

Were the parameters passed to each tool call correct and complete?

Conversation Analysis

Evaluate full multi-turn interactions, not just isolated responses.

Trajectory

Evaluate the full sequence of tool calls and decisions across a conversation.

Interactions

Score multi-turn back-and-forth dialogue quality over time.

Goal Success

Did the agent ultimately achieve the user's goal by the end of the conversation?

Deployment Gates

No deploy until evals pass

Require eval suites to pass before an agent version can go live. Eval gates connect your quality bar directly to the deployment pipeline — regressions are caught automatically, not by users.

  • Block deploys when pass rate drops below threshold
  • Configurable per agent — strict for production, relaxed for staging
  • Automatic rollback if post-deploy checks fail
Scheduled Evals

Run evals on autopilot

Configure cron schedules to run eval suites automatically. Catch model drift, knowledge base staleness, and subtle regressions without lifting a finger.

  • Hourly, daily, or weekly cron schedules
  • Alerts when scores regress from baseline
  • Historical trend tracking across every run

Evaluation Dashboard

One dashboard, every signal

Pass rates, score distributions, and trends over time. Spot issues before they reach users.

Evaluation Dashboard

Support Agent • v2.4

95.3%pass rate
Overall Score
+2.4%vs last week
Score Distribution150 test cases
0-20
21-40
41-60
61-80
81-100
Per-Evaluator Scores
Accuracy
94%+2.1%
Helpfulness
91%+1.8%
Tone
97%+0.5%
Safety
99%0%
Pass Rate TrendLast 30 days

Recent Test Runs

run-847
Today 2:34pm
1428
4m 23s
run-846
Today 11:15am
1455
4m 18s
run-845
Yesterday
13812
4m 45s

Continuous Improvement

Experiment, learn, and iterate

A/B test agent configurations and let the platform learn from real conversations to improve performance automatically.

A/B Experiments

Compare different agent versions head-to-head. Split live traffic across variants with different prompts, models, temperatures, or tool configurations — then let eval scores and real metrics pick the winner.

Variant A — Claude Sonnet, temp 0.387% pass
Variant B — Claude Haiku, temp 0.772% pass

Heuristic Learning

The platform auto-learns strategic guidelines from real conversations and injects them into agent prompts. Behavioral patterns, tool usage best practices, knowledge gaps, and style preferences compound over time.

BehavioralAlways confirm order details before processing
Tool UsageUse lookup_customer before checking_balance
KnowledgeQ4 pricing changed — cite updated rate sheet
StyleKeep responses under 3 sentences for status checks

Metrics

Metrics that matter

Pass Rate

% of test cases that meet your criteria

Average Score

Weighted evaluation score across all tests

Per-Evaluator Breakdown

See scores by each evaluation criterion

Duration

How long each test run takes

AI Judges

Criteria you define, scores you trust

Built-in rubrics for helpfulness, faithfulness, and safety — plus custom rubrics for any domain-specific criteria. Every score includes reasoning you can audit.

  • Custom rubrics for domain-specific evaluation
  • Detailed scoring explanations for every judgment
  • Configurable thresholds per evaluator per agent
Output — Custom rubric scoring on any dimension you define
Helpfulness — Is the response actionable and complete?
Faithfulness — Grounded in source data, no hallucinations
Goal Success — Did the agent achieve the objective?
Tool Selection — Picked the right tool for the job
Tool Parameters — Passed correct arguments to each tool
Trajectory — Full conversation path evaluation
Interactions — Multi-turn dialogue quality scoring

Evaluation Workflow

Four steps to confidence

1

Create test dataset

Build a set of representative inputs and expected behaviors for your agent.

2

Run experiment

Your agent processes each test case and generates responses automatically.

3

AI judges score

Eight evaluators assess each response across rubric, tool, and conversation dimensions.

4

Review and improve

Identify failures, improve your agent, and re-run to verify improvements.

Stop guessing. Start measuring.

See how Zentrr helps you build agents you can trust. Schedule a demo.