Incident Response
When an incident fires, the first question is always “what changed?”. OpsTrails lets AI answer that instantly by querying the operational timeline.
The Pattern
- Alert fires — PagerDuty, Opsgenie, or your monitoring system triggers an incident notification
- Engineer asks AI — Instead of manually checking CI/CD logs and dashboards, the engineer asks their AI assistant “what changed in the last 2 hours?”
- OpsTrails returns context — The AI queries the timeline and returns recent events with metric impact analysis
- Faster resolution — The engineer has the information they need in seconds, not minutes, cutting MTTR significantly
AI Conversation Example
Engineer:
“Production error rate is spiking. What changed in the last 2 hours?”
AI (via OpsTrails MCP):
I found 2 events in the last 2 hours for the production environment:
- 1. deployment of api-service v2.1.0 at 14:30 UTC (45 minutes ago)
- 2. config-change from terraform/production-infra at 14:15 UTC
Checking metrics around the api-service deployment: error rate increased from 0.1% to 2.3% in the 30 minutes after deploy. This deployment is the likely cause.
Key Questions AI Can Answer
- “What changed in the last hour?”
- “Were there any CRITICAL severity events today?”
- “Show me all events for the api-service source”
- “Did error rates change after the 2pm deployment?”
- “What was the last rollback?”
Best Practices
- Connect analytics providers — Impact analysis is most powerful when metrics are connected. See Connecting Providers.
- Use severity for incidents — Mark incident events with
MAJOR or CRITICAL severity to make them easy to find - Track all change types — Don't just track deployments. Config changes, database migrations, and infrastructure updates are often the cause of incidents