Incidents

Track, investigate, and resolve production issues systematically.

What are Incidents?

Incidents are production issues that require attention. They can be created automatically from alerts or manually by team members.

Incident Sources

KPI Alerts - Threshold breaches with auto_create_incident enabled
Drift Events - Significant behavior changes
Anomaly Clusters - Groups of related anomalies
Error Spikes - Sudden increase in failures
Manual - Created by team members

Incident Lifecycle

Open - New incident, needs investigation
Investigating - Team is looking into it
Identified - Root cause found
Fixing - Solution in progress
Resolved - Issue fixed
Closed - Post-mortem complete

Viewing Incidents

Navigate to Operations → Incidents to see:

Active Incidents - Open issues requiring attention
Recent Incidents - Recently resolved issues
All Incidents - Full history with filters

Incident List

Title and description
Severity (Critical, High, Medium, Low)
Status and assignee
Created time and duration
Affected workflows

Incident Detail

Click on an incident to see:

Timeline - Event history and updates
Related Alerts - Alerts that triggered this incident
Affected Runs - Runs impacted by the issue
Root Cause Analysis - Automated RCA results
Notes - Team comments and findings

Managing Incidents

Assign - Assign to a team member
Update Status - Move through lifecycle stages
Add Notes - Document findings
Link Runs - Associate affected runs
Resolve - Mark as fixed

Severity Levels

Level	Description	Response Time
Critical	Service down, major impact	Immediate
High	Significant degradation	Within 1 hour
Medium	Partial impact	Within 4 hours
Low	Minor issue	Within 24 hours

Next Steps