Question 1

What does the investigation workflow look like?

Accepted Answer

Start from an alert or KPI spike, open the linked trace, and walk the DAG from user input through each LLM, tool, and retriever. Inspect prompts, outputs, and metadata at the span where latency or errors diverge. Attach eval outcomes for postmortems.

Question 2

How does alert deduplication work?

Accepted Answer

Related firing conditions group within configurable time windows so similar anomalies do not page repeatedly. Dedup respects severity: critical paths can break through when correlation would hide a widespread outage.

Question 3

How are escalation and correlation handled?

Accepted Answer

Escalation policies chain channels and timeouts—Slack first, then Teams or webhook (e.g. PagerDuty) if unacknowledged. Correlation links incidents that share workflow, deployment version, or upstream dependency signals.

Incident Response for AI Agents

Root Cause Analysis

Alert Routing

Incident Correlation

Trace-Level Investigation Built for Agents

Multi-Channel Alerts: Slack, Teams, Email, Webhook

Incident Correlation and Escalation

Frequently Asked Questions

Respond to Agent Incidents with Confidence