Evaluating AI Agents in Production: Beyond Offline Benchmarks
Offline evals tell you how an agent might perform. Production monitoring tells you how it actually performs. Here is how to bridge the gap with evaluation strategies that work at scale.
Why Offline Evals Fall Short
Every team building AI agents starts the same way: assemble an eval dataset of fifty to a hundred representative queries, run the agent against them, measure accuracy, and declare the system ready for production. The eval suite passes at 94%. The demo is flawless. Then users arrive, and within a week, the support queue fills with failure reports that the eval suite never predicted.
The core problem is distribution shift. Eval datasets are curated by the people who built the agent — which means they reflect the builders’ mental model of what users will ask, not what users actually ask. Real production traffic contains misspellings, ambiguous phrasing, multi-lingual code-switching, sarcasm, and queries that combine multiple intents in a single message. It also contains adversarial inputs — prompt injections, jailbreak attempts, and inputs deliberately designed to confuse the system. No hand-curated dataset captures this long tail.
There is a deeper issue: offline evals measure point-in-time performance against a frozen dataset, but agents operate in environments that change continuously. APIs return different data formats after upstream updates. User expectations shift as they learn what the agent can do. The knowledge the agent relies on becomes stale. An eval suite that passed last month may not represent the failure modes that matter today. Offline evals are necessary — they catch regressions and provide a reproducible baseline — but they are not sufficient. Production evaluation is a separate discipline entirely.
What to Measure in Production
Production evaluation requires measuring four dimensions simultaneously. Optimizing for any single dimension without watching the others leads to agents that are accurate but slow, fast but unreliable, or cheap but dangerous.
| Dimension | Key Metrics | Why It Matters |
|---|---|---|
| Quality | Correctness rate, relevance score, completeness, user satisfaction | An agent that gives wrong answers erodes trust permanently |
| Reliability | Tool-call success rate, retry rate, timeout rate, error rate | Intermittent failures frustrate users more than consistent ones |
| Cost | Tokens per task, dollars per successful completion, cost per failed attempt | Unchecked costs can make a viable product economically unsustainable |
| Safety | Guardrail trigger rate, out-of-scope handling, PII exposure rate | A single safety incident can outweigh months of good performance |
Quality is the hardest to measure automatically because “correct” depends on context. For structured outputs (data extraction, classification), you can compare against ground truth. For open-ended generation (summaries, advice, creative writing), you need either human evaluation or LLM-as-judge — a separate model that scores the agent’s output on defined rubrics. Both have trade-offs: human evaluation is expensive and slow; LLM-as-judge is cheaper but introduces its own biases.
Reliability metrics are often the most actionable because they surface mechanical failures — API timeouts, malformed tool calls, rate limit hits — that have clear engineering fixes. Track these at the span level: which specific tool call failed, at which step, with what parameters. A 2% overall error rate might hide a 40% failure rate on one specific tool that only fires for certain query types.
The metric that matters most is cost per successful task completion, not cost per token. An agent that spends $0.02 on tokens but fails 30% of the time (requiring human escalation at $5 per incident) is far more expensive than one that spends $0.08 on tokens with a 98% success rate.
Eval-Driven Development
Eval-driven development is the practice of treating your eval suite as a living artifact that evolves with your agent. The workflow has five steps: collect production failures, categorize them into failure modes, add representative examples to your eval suite, improve the agent (prompts, tools, guardrails), and redeploy with monitoring. This cycle repeats continuously — it is not a one-time exercise.
The categorization step is critical. Not all failures are equal. A failure where the agent calls the wrong tool is different from a failure where it calls the right tool with wrong parameters, which is different from a failure where the tool returns unexpected data and the agent doesn’t recover gracefully. Each failure mode has a different root cause and a different fix. Lumping them together as “the agent got it wrong” prevents systematic improvement.
Practical failure mode categories for most agent systems include: wrong tool selection (the model picked an inappropriate tool), parameter hallucination (correct tool, fabricated arguments), context blindness (the answer was in the context but the model ignored it), instruction drift (the model deviated from its system prompt), recovery failure (a tool errored and the model couldn’t adapt), and safety bypass (the model produced output that should have been filtered). Tracking the distribution of these categories over time reveals whether your improvements are working.
Start with twenty golden examples — ten successes and ten failures from real production traffic. Run every candidate agent change against these twenty cases before deploying. This takes minutes, catches obvious regressions, and builds the habit of eval-first development. Expand the suite to one hundred examples within the first month.
Building Eval Datasets from Production Traffic
The best eval datasets come from production, not from a brainstorming session. But sampling production traffic naively introduces bias — you over-represent common queries and under-represent the edge cases that cause the most damage. Stratified sampling addresses this: sample proportionally across query intent categories, complexity levels, and outcome types (success, partial success, failure).
Annotating eval examples requires deciding who defines “correct.” For factual tasks, domain experts annotate ground truth. For subjective tasks, you need multiple annotators and inter-rater agreement metrics (Cohen’s kappa or Krippendorff’s alpha) to quantify how much annotators disagree. If annotators can’t agree on what “correct” means, your agent can’t be expected to get it right either — the problem is underspecified, and the fix is clearer task definitions, not better prompts.
LLM-as-judge is increasingly viable for scaling annotation. Use a separate, typically larger model to score outputs on specific rubrics: factual accuracy (1–5), completeness (1–5), relevance (1–5), safety (pass/fail). Calibrate the judge against human annotations on a hundred examples to understand its biases. Common biases include verbosity preference (longer answers score higher), position bias (first option in a list scores higher), and self-preference (models rate their own outputs higher). Awareness of these biases lets you correct for them in your scoring pipeline.
Shadow Mode and Progressive Rollout
When you have a candidate agent improvement, the question is: how do you validate it with real traffic without risking user experience? Shadow mode answers this. The production agent handles all requests normally. Simultaneously, the candidate agent processes the same requests in the background. Both outputs are logged and compared, but only the production agent’s response reaches the user. This lets you evaluate quality, latency, cost, and safety on real traffic with zero user impact.
Shadow mode has limitations. It cannot measure user satisfaction directly because users never see the candidate’s output. It also doubles your compute cost during the evaluation period. For agents with side effects — those that write to databases, send emails, or invoke external APIs — you need a read-only shadow mode that intercepts side effects and logs the intended action instead of executing it. This adds engineering complexity but is non-negotiable for agents that take real-world actions.
Once shadow mode confirms the candidate is at least as good as production, switch to progressive rollout. Route 5% of traffic to the candidate. Monitor the four dimensions (quality, reliability, cost, safety) with automated alerts. If metrics hold, increase to 25%, then 50%, then 100%. At each stage, wait long enough to capture a statistically meaningful sample — for agents with high output variance, this may take days rather than hours. Define automated rollback triggers: if error rate increases by more than 2 percentage points, or if any safety metric degrades, automatically revert to the previous version.
Teams often declare an A/B test conclusive after a few hundred requests. For non-deterministic agents, variance is high and effect sizes are small. You typically need thousands of requests per variant to detect a meaningful difference. Use power analysis to calculate the required sample size before starting the experiment, not after.
Closing the Loop
Production evaluation is not a phase — it is a continuous practice. The teams that maintain reliable agents over months and years share a common discipline: they treat their eval infrastructure with the same rigor as their production infrastructure. Eval suites are versioned, reviewed, and expanded with every incident. Monitoring dashboards are checked daily, not built and forgotten. Failure mode distributions are tracked over time to measure whether the system is improving or degrading.
The practical checklist for production-grade agent evaluation:
- Maintain a golden eval set of at least 100 examples from real production traffic, refreshed monthly
- Track quality, reliability, cost, and safety metrics at the span level, not just the run level
- Categorize every production failure into a defined taxonomy and track category distribution over time
- Use shadow mode for all candidate changes before progressive rollout
- Define automated rollback triggers with specific metric thresholds
- Run LLM-as-judge scoring on a sample of production outputs daily
- Review the eval suite itself quarterly — remove outdated examples, add new failure modes
The gap between a demo agent and a production agent is not model capability — it is evaluation rigor. Invest in your eval infrastructure early, and it pays compound returns as your system grows.