Evaluations

Score and compare AI agent outputs to measure quality and track improvements.

What are Evaluations?

Evaluations let you systematically score your AI agent's outputs using various metrics. Run evaluations online (in real-time) or offline (batch) to measure quality, detect regressions, and compare different configurations.

Evaluation Types

LLM-as-a-Judge

Use an LLM to evaluate outputs based on criteria like helpfulness, accuracy, or safety:

Relevance - Is the response relevant to the query?
Helpfulness - Does it actually help the user?
Accuracy - Is the information correct?
Safety - Is the response safe and appropriate?
Custom Criteria - Define your own evaluation prompts

Heuristic Metrics

Fast, deterministic metrics that don't require LLM calls:

JSON Validity - Check if output is valid JSON
Contains Keywords - Check for required terms
Length Checks - Validate response length
Regex Patterns - Match expected patterns
ROUGE/BLEU - Text similarity scores

Custom Evaluators

Write your own evaluation functions for domain-specific scoring.

Running Evaluations

Via UI

Navigate to Analysis → Evaluations
Select runs to evaluate (or use filters)
Choose evaluation metrics
Click Run Evaluation
View results in the evaluation dashboard

Via CLI

# Run evaluations via the TuringPulse CLI
tp evals run --config eval-config-id --workflow customer-support

# Or trigger via the REST API
# POST https://api.turingpulse.ai/v1/evals/evaluate
# {
#   "workflow_id": "customer-support",
#   "metrics": ["relevance", "helpfulness", "json_valid"],
#   "time_range": "24h",
#   "sample_size": 100
# }

Viewing Results

Evaluation results are displayed in the Evaluations page with:

Score Distribution - Histogram of scores
Trend Charts - Scores over time
Run Details - Click to see individual run scores
Comparison View - Compare different configurations

Configuring Evaluators

Configure evaluation settings in Analysis → Evaluations → Config:

LLM Provider - Which LLM to use for judging
Default Metrics - Metrics to run automatically
Sample Rate - Percentage of runs to evaluate
Custom Prompts - Define custom evaluation criteria