OpsTrailsDocs
Console

Incident Response

When an incident fires, the first question is always “what changed?”. OpsTrails AI-powered investigation lets AI answer that instantly by querying the operational timeline.

The Pattern

  1. Alert fires — PagerDuty, Opsgenie, or your monitoring system triggers an incident notification
  2. Engineer asks AI — Instead of manually checking CI/CD logs and dashboards, the engineer asks their AI assistant “what changed in the last 2 hours?”
  3. OpsTrails returns context — The AI queries the timeline and returns recent events with metric impact analysis
  4. Faster resolution — The engineer has the information they need in seconds, not minutes, cutting MTTR significantly

AI Conversation Example

Engineer:

“Production error rate is spiking. What changed in the last 2 hours?”

AI (via OpsTrails MCP):

I found 2 events in the last 2 hours for the production environment:

  • 1. deployment of api-service v2.1.0 at 14:30 UTC (45 minutes ago)
  • 2. config-change from terraform/production-infra at 14:15 UTC

Checking metrics around the api-service deployment: error rate increased from 0.1% to 2.3% in the 30 minutes after deploy. This deployment is the likely cause.

Key Questions AI Can Answer

Real-World Scenario Walkthrough

Here's a detailed step-by-step of how OpsTrails accelerates incident response in practice:

  1. 3:12 PM — Alert fires. PagerDuty pages the on-call engineer: “Error rate above 5% for production API.”
  2. 3:13 PM — Engineer opens AI assistant. Instead of logging into three different dashboards, the engineer opens Claude (connected to OpsTrails via MCP) and asks: “Production error rate is spiking. What changed in the last 4 hours?”
  3. 3:13 PM — OpsTrails returns 2 events. The AI calls query_events and finds: (1) a deployment of api-service v2.5.1 at 2:45 PM and (2) a config-change from Terraform at 1:30 PM.
  4. 3:14 PM — AI checks impact metrics. The AI automatically calls get_metrics_around_event for each event. The api-service deployment shows error_rate jumping from 0.12% to 5.8%. The config-change shows no metric impact.
  5. 3:14 PM — Root cause identified. The AI reports: “The api-service v2.5.1 deployment at 2:45 PM is the likely cause. Error rate increased 48x immediately after.”
  6. 3:15 PM — Engineer rolls back. With a clear root cause in under 3 minutes, the engineer triggers a rollback to v2.5.0. Total MTTR: 3 minutes vs. the typical 15-30 minutes of manual investigation.

Setting Up for Fast Incident Response

To get the most out of OpsTrails during incidents, set up these integrations first:

Connecting to Your Incident Workflow

Beyond querying events during incidents, you can also record incident events themselves on the timeline. Use the API or webhooks to create events with type: incident when incidents are opened or resolved in your incident management system. This creates a complete timeline that shows both the causes (deployments, config changes) and the effects (incidents).

For metric-powered impact analysis, see Connecting Providers. For details on MCP tools available during investigation, see the MCP Tools Reference.

Best Practices