Evaluations

Score and compare AI agent outputs to measure quality and track improvements.

What are Evaluations?

Evaluations let you systematically score your AI agent's outputs using various metrics. Run evaluations online (in real-time) or offline (batch) to measure quality, detect regressions, and compare different configurations.

Evaluation Types

LLM-as-a-Judge

Use an LLM to evaluate outputs based on criteria like helpfulness, accuracy, or safety:

  • Relevance - Is the response relevant to the query?
  • Helpfulness - Does it actually help the user?
  • Accuracy - Is the information correct?
  • Safety - Is the response safe and appropriate?
  • Custom Criteria - Define your own evaluation prompts

Heuristic Metrics

Fast, deterministic metrics that don't require LLM calls:

  • JSON Validity - Check if output is valid JSON
  • Contains Keywords - Check for required terms
  • Length Checks - Validate response length
  • Regex Patterns - Match expected patterns
  • ROUGE/BLEU - Text similarity scores

Custom Evaluators

Write your own evaluation functions for domain-specific scoring.

Running Evaluations

Via UI

  1. Navigate to Analysis → Evaluations
  2. Select runs to evaluate (or use filters)
  3. Choose evaluation metrics
  4. Click Run Evaluation
  5. View results in the evaluation dashboard

Via CLI

Terminal
# Run evaluations via the TuringPulse CLI
tp evals run --config eval-config-id --workflow customer-support

# Or trigger via the REST API
# POST https://api.turingpulse.ai/v1/evals/evaluate
# {
#   "workflow_id": "customer-support",
#   "metrics": ["relevance", "helpfulness", "json_valid"],
#   "time_range": "24h",
#   "sample_size": 100
# }

Viewing Results

Evaluation results are displayed in the Evaluations page with:

  • Score Distribution - Histogram of scores
  • Trend Charts - Scores over time
  • Run Details - Click to see individual run scores
  • Comparison View - Compare different configurations

Configuring Evaluators

Configure evaluation settings in Analysis → Evaluations → Config:

  • LLM Provider - Which LLM to use for judging
  • Default Metrics - Metrics to run automatically
  • Sample Rate - Percentage of runs to evaluate
  • Custom Prompts - Define custom evaluation criteria

Next Steps