Anomaly Detection Rules

Automatically detect outliers and unusual patterns in your AI agent behavior.

What are Anomalies?

Anomalies are individual data points or patterns that deviate significantly from expected behavior. Unlike drift (which is a gradual change), anomalies are point-in-time outliers that may indicate:

  • Errors - Unexpected failures or exceptions
  • Attacks - Prompt injection or abuse attempts
  • Edge Cases - Unusual inputs the agent struggles with
  • Resource Issues - Rate limits, timeouts, OOM
  • Data Quality - Malformed or unexpected inputs

Creating Anomaly Rules via UI

Step 1: Navigate to Anomalies

Go to Controls → Anomalies in the sidebar.

Step 2: Create New Rule

Click Create Rule and configure:

  • Name - Descriptive name (e.g., "Latency Spike Detector")
  • Workflow - Select specific workflow or "All Workflows"
  • Metric - Metric to monitor for anomalies
  • Detection Method - Z-Score, IQR, or Isolation Forest
  • Threshold - Anomaly score threshold
  • Minimum Samples - Data points needed before detection

Step 3: Configure Alerts

  • Severity - Warning or Critical
  • Auto-create Incident - Toggle for automatic incidents
  • Alert Channels - Select notification channels

Creating Anomaly Rules via API

create_anomaly_rule.py
import requests

# Create an anomaly detection rule
response = requests.post(
    "https://api.turingpulse.ai/api/v1/config/anomaly-rules",
    headers={"Authorization": "Bearer sk_live_..."},
    json={
        "name": "Latency Spike Detector",
        "workflow_id": "customer-support",
        "metric": "latency_ms",
        "method": "zscore",           # Detection method
        "threshold": 3.0,             # Z-score threshold
        "window": "1h",               # Lookback window
        "min_samples": 50,            # Minimum samples
        "severity": "warning",
        "auto_create_incident": False,
        "alert_channels": ["slack://alerts"],
        "enabled": True,
    }
)

Detection Methods

Z-Score (zscore)

Measures how many standard deviations a value is from the mean. Best for normally distributed metrics.

  • Threshold 2.0 - ~5% of data flagged (lenient)
  • Threshold 3.0 - ~0.3% of data flagged (standard)
  • Threshold 4.0 - ~0.01% of data flagged (strict)

Interquartile Range (iqr)

Uses quartiles to identify outliers. More robust to non-normal distributions.

  • Threshold 1.5 - Standard outlier detection
  • Threshold 3.0 - Extreme outliers only

Isolation Forest (isolation_forest)

Machine learning-based detection. Best for complex, multi-dimensional anomalies.

  • Threshold 0.1 - ~10% flagged as anomalies
  • Threshold 0.05 - ~5% flagged
MethodBest ForProsCons
zscoreNormal distributionsSimple, interpretableAssumes normality
iqrSkewed distributionsRobust to outliersLess sensitive
isolation_forestComplex patternsMulti-dimensionalLess interpretable

Anomaly Rule Configuration

OptionTypeDescription
namestrHuman-readable name
workflow_idstrWorkflow to monitor, or "*" for all
metricstrMetric to monitor (latency_ms, tokens, cost, etc.)
methodstrzscore, iqr, isolation_forest
thresholdfloatDetection threshold (method-specific)
windowstrLookback window for baseline (e.g., "1h", "24h")
min_samplesintMinimum samples before detection
severitystrwarning, critical
auto_create_incidentboolCreate incident on detection

Viewing Anomalies

When anomalies are detected:

  1. Anomaly events appear in Operations → Overview → Anomalies tab
  2. Each anomaly shows the metric value, expected range, and severity
  3. Click on an anomaly to see the affected run details
  4. Notifications are sent to configured alert channels

Anomaly Event Details

  • Metric - Which metric triggered the anomaly
  • Value - The anomalous value
  • Expected Range - Normal range based on baseline
  • Anomaly Score - How anomalous (higher = more unusual)
  • Run ID - Link to the affected run
  • Timestamp - When the anomaly occurred

Anomaly Clustering

TuringPulse automatically clusters related anomalies:

  • Time-based - Anomalies occurring close together
  • Metric-based - Same metric across workflows
  • Cause-based - Similar root causes
💡
Incident Creation
When multiple anomalies are clustered, a single incident is created to avoid alert fatigue.

Best Practices

  • Start with Z-Score - Simple and effective for most metrics. Switch to IQR or Isolation Forest if you see too many false positives.
  • Set appropriate windows - Use 1h for real-time detection, 24h for more stable baselines.
  • Tune thresholds - Start lenient and tighten based on false positive rates.
  • Monitor multiple metrics - Create rules for latency, tokens, errors, and custom metrics.

Next Steps