Incident Response
When an incident fires, the first question is always “what changed?”. OpsTrails AI-powered investigation lets AI answer that instantly by querying the operational timeline.
The Pattern
- Alert fires — PagerDuty, Opsgenie, or your monitoring system triggers an incident notification
- Engineer asks AI — Instead of manually checking CI/CD logs and dashboards, the engineer asks their AI assistant “what changed in the last 2 hours?”
- OpsTrails returns context — The AI queries the timeline and returns recent events with metric impact analysis
- Faster resolution — The engineer has the information they need in seconds, not minutes, cutting MTTR significantly
AI Conversation Example
Engineer:
“Production error rate is spiking. What changed in the last 2 hours?”
AI (via OpsTrails MCP):
I found 2 events in the last 2 hours for the production environment:
- 1. deployment of api-service v2.1.0 at 14:30 UTC (45 minutes ago)
- 2. config-change from terraform/production-infra at 14:15 UTC
Checking metrics around the api-service deployment: error rate increased from 0.1% to 2.3% in the 30 minutes after deploy. This deployment is the likely cause.
Key Questions AI Can Answer
- “What changed in the last hour?”
- “Were there any CRITICAL severity events today?”
- “Show me all events for the api-service source”
- “Did error rates change after the 2pm deployment?”
- “What was the last rollback?”
Real-World Scenario Walkthrough
Here's a detailed step-by-step of how OpsTrails accelerates incident response in practice:
- 3:12 PM — Alert fires. PagerDuty pages the on-call engineer: “Error rate above 5% for production API.”
- 3:13 PM — Engineer opens AI assistant. Instead of logging into three different dashboards, the engineer opens Claude (connected to OpsTrails via MCP) and asks: “Production error rate is spiking. What changed in the last 4 hours?”
- 3:13 PM — OpsTrails returns 2 events. The AI calls
query_events and finds: (1) a deployment of api-service v2.5.1 at 2:45 PM and (2) a config-change from Terraform at 1:30 PM. - 3:14 PM — AI checks impact metrics. The AI automatically calls
get_metrics_around_event for each event. The api-service deployment shows error_rate jumping from 0.12% to 5.8%. The config-change shows no metric impact. - 3:14 PM — Root cause identified. The AI reports: “The api-service v2.5.1 deployment at 2:45 PM is the likely cause. Error rate increased 48x immediately after.”
- 3:15 PM — Engineer rolls back. With a clear root cause in under 3 minutes, the engineer triggers a rollback to v2.5.0. Total MTTR: 3 minutes vs. the typical 15-30 minutes of manual investigation.
Setting Up for Fast Incident Response
To get the most out of OpsTrails during incidents, set up these integrations first:
- CI/CD integration — Connect your deployment pipelines so every deploy is automatically recorded. This is the most critical integration — without deployment events, AI can't answer “what changed?”
- Analytics providers — Connect at least one analytics provider (Sentry, Datadog, etc.) to enable before/after metric comparisons. See Connecting Providers.
- Severity levels — Use
LOW for routine deploys, MAJOR for significant releases, and CRITICAL for emergency changes. This helps AI prioritize events during investigation. - Subjects for environments — Always set the
subject field to the target environment (production, staging, etc.) so AI can filter events by environment when investigating.
Connecting to Your Incident Workflow
Beyond querying events during incidents, you can also record incident events themselves on the timeline. Use the API or webhooks to create events with type: incident when incidents are opened or resolved in your incident management system. This creates a complete timeline that shows both the causes (deployments, config changes) and the effects (incidents).
For metric-powered impact analysis, see Connecting Providers. For details on MCP tools available during investigation, see the MCP Tools Reference.
Best Practices
- Connect analytics providers — Impact analysis is most powerful when metrics are connected. See Connecting Providers.
- Use severity for incidents — Mark incident events with
MAJOR or CRITICAL severity to make them easy to find - Track all change types — Don't just track deployments. Config changes, database migrations, and infrastructure updates are often the cause of incidents
- Pair with deployment tracking — The Deployment Tracking pattern ensures every deploy is recorded, giving AI the data it needs during incidents
- Track releases for full context — Use the Release Management pattern to know exactly which versions are running in each environment