Incidents
Track, investigate, and resolve production issues systematically.
What are Incidents?
Incidents are production issues that require attention. They can be created automatically from alerts or manually by team members.
Incident Sources
- KPI Alerts - Threshold breaches with auto_create_incident enabled
- Drift Events - Significant behavior changes
- Anomaly Clusters - Groups of related anomalies
- Error Spikes - Sudden increase in failures
- Manual - Created by team members
Incident Lifecycle
- Open - New incident, needs investigation
- Investigating - Team is looking into it
- Identified - Root cause found
- Fixing - Solution in progress
- Resolved - Issue fixed
- Closed - Post-mortem complete
Viewing Incidents
Navigate to Operations → Incidents to see:
- Active Incidents - Open issues requiring attention
- Recent Incidents - Recently resolved issues
- All Incidents - Full history with filters
Incident List
- Title and description
- Severity (Critical, High, Medium, Low)
- Status and assignee
- Created time and duration
- Affected workflows
Incident Detail
Click on an incident to see:
- Timeline - Event history and updates
- Related Alerts - Alerts that triggered this incident
- Affected Runs - Runs impacted by the issue
- Root Cause Analysis - Automated RCA results
- Notes - Team comments and findings
Managing Incidents
- Assign - Assign to a team member
- Update Status - Move through lifecycle stages
- Add Notes - Document findings
- Link Runs - Associate affected runs
- Resolve - Mark as fixed
Severity Levels
| Level | Description | Response Time |
|---|---|---|
| Critical | Service down, major impact | Immediate |
| High | Significant degradation | Within 1 hour |
| Medium | Partial impact | Within 4 hours |
| Low | Minor issue | Within 24 hours |