Features
Know Your Agents Work Before Users Do
Eight specialized evaluators, deployment gates, scheduled runs, and A/B experiments. Ship with confidence.
Evaluation System
Test, measure, improve
Everything you need to systematically raise agent quality.
Test Datasets
Define inputs and expected outputs. Measure accuracy across every scenario that matters.
A/B Experiments
Split traffic across agent versions with different prompts, models, or settings. Compare real performance side-by-side and pick the winner with data.
AI Judges
Eight specialized evaluators score every response automatically — from rubric-based accuracy to multi-turn conversation analysis.
Track Over Time
Catch regressions before users do. Get alerts when quality drops below your thresholds.
8 Evaluator Types
Every dimension of quality, covered
From rubric-based scoring to full conversation trajectory analysis. Each evaluator targets a specific failure mode so nothing slips through.
Rubric-Based Scoring
Score responses against structured criteria with LLM-as-judge evaluation.
Output
Define custom rubrics to score any dimension specific to your domain.
Helpfulness
Is the response actionable, complete, and actually useful?
Faithfulness
Does the response stay grounded in source data, or does it hallucinate?
Tool Accuracy
Verify agents pick the right tools and pass the right parameters every time.
Tool Selection
Did the agent choose the correct tool for the task?
Tool Parameters
Were the parameters passed to each tool call correct and complete?
Conversation Analysis
Evaluate full multi-turn interactions, not just isolated responses.
Trajectory
Evaluate the full sequence of tool calls and decisions across a conversation.
Interactions
Score multi-turn back-and-forth dialogue quality over time.
Goal Success
Did the agent ultimately achieve the user's goal by the end of the conversation?
No deploy until evals pass
Require eval suites to pass before an agent version can go live. Eval gates connect your quality bar directly to the deployment pipeline — regressions are caught automatically, not by users.
- Block deploys when pass rate drops below threshold
- Configurable per agent — strict for production, relaxed for staging
- Automatic rollback if post-deploy checks fail
Run evals on autopilot
Configure cron schedules to run eval suites automatically. Catch model drift, knowledge base staleness, and subtle regressions without lifting a finger.
- Hourly, daily, or weekly cron schedules
- Alerts when scores regress from baseline
- Historical trend tracking across every run
Evaluation Dashboard
One dashboard, every signal
Pass rates, score distributions, and trends over time. Spot issues before they reach users.
Evaluation Dashboard
Support Agent • v2.4
Recent Test Runs
Continuous Improvement
Experiment, learn, and iterate
A/B test agent configurations and let the platform learn from real conversations to improve performance automatically.
A/B Experiments
Compare different agent versions head-to-head. Split live traffic across variants with different prompts, models, temperatures, or tool configurations — then let eval scores and real metrics pick the winner.
Heuristic Learning
The platform auto-learns strategic guidelines from real conversations and injects them into agent prompts. Behavioral patterns, tool usage best practices, knowledge gaps, and style preferences compound over time.
Metrics
Metrics that matter
Pass Rate
% of test cases that meet your criteria
Average Score
Weighted evaluation score across all tests
Per-Evaluator Breakdown
See scores by each evaluation criterion
Duration
How long each test run takes
Criteria you define, scores you trust
Built-in rubrics for helpfulness, faithfulness, and safety — plus custom rubrics for any domain-specific criteria. Every score includes reasoning you can audit.
- Custom rubrics for domain-specific evaluation
- Detailed scoring explanations for every judgment
- Configurable thresholds per evaluator per agent
Evaluation Workflow
Four steps to confidence
Create test dataset
Build a set of representative inputs and expected behaviors for your agent.
Run experiment
Your agent processes each test case and generates responses automatically.
AI judges score
Eight evaluators assess each response across rubric, tool, and conversation dimensions.
Review and improve
Identify failures, improve your agent, and re-run to verify improvements.
Stop guessing. Start measuring.
See how Zentrr helps you build agents you can trust. Schedule a demo.