Incidents

Track, investigate, and resolve production issues systematically.

What are Incidents?

Incidents are production issues that require attention. They can be created automatically from alerts or manually by team members.

Incident Sources

  • KPI Alerts - Threshold breaches with auto_create_incident enabled
  • Drift Events - Significant behavior changes
  • Anomaly Clusters - Groups of related anomalies
  • Error Spikes - Sudden increase in failures
  • Manual - Created by team members

Incident Lifecycle

  1. Open - New incident, needs investigation
  2. Investigating - Team is looking into it
  3. Identified - Root cause found
  4. Fixing - Solution in progress
  5. Resolved - Issue fixed
  6. Closed - Post-mortem complete

Viewing Incidents

Navigate to Operations → Incidents to see:

  • Active Incidents - Open issues requiring attention
  • Recent Incidents - Recently resolved issues
  • All Incidents - Full history with filters

Incident List

  • Title and description
  • Severity (Critical, High, Medium, Low)
  • Status and assignee
  • Created time and duration
  • Affected workflows

Incident Detail

Click on an incident to see:

  • Timeline - Event history and updates
  • Related Alerts - Alerts that triggered this incident
  • Affected Runs - Runs impacted by the issue
  • Root Cause Analysis - Automated RCA results
  • Notes - Team comments and findings

Managing Incidents

  • Assign - Assign to a team member
  • Update Status - Move through lifecycle stages
  • Add Notes - Document findings
  • Link Runs - Associate affected runs
  • Resolve - Mark as fixed

Severity Levels

LevelDescriptionResponse Time
CriticalService down, major impactImmediate
HighSignificant degradationWithin 1 hour
MediumPartial impactWithin 4 hours
LowMinor issueWithin 24 hours

Next Steps