Tech Blog - Rohlik Group

When a P1 hits at 3 AM, nobody wants to spend the first ten minutes figuring out what changed. We built a system where they don't have to.

The Old Way Was Too Slow

Incident response at Rohlik used to look like incident response everywhere: an alert fires, someone acknowledges it, a Slack thread starts, people trickle in, someone eventually checks the deploy log, someone else opens Datadog, and thirty minutes later you're still orienting. The median time from alert to "we know what changed" was longer than the median time to fix it. That's backwards.

The problem wasn't people. The problem was that every incident started with the same twenty minutes of manual reconnaissance — the same Datadog queries, the same deploy history checks, the same "who shipped something in the last hour?" questions. Humans are good at judgment calls. They're bad at repetitive context assembly under pressure. So we automated the context assembly.

What Happens When an Incident Starts

Today, our incident lifecycle is managed through incident.io, deeply integrated with Slack, Datadog, and our deployment infrastructure. When an incident is declared — whether by a human or triggered automatically from an alert — several things happen simultaneously, with no human intervention required.

The War Room Opens Instantly

The moment an incident is created, incident.io automatically provisions a dedicated Slack channel. This isn't a thread buried in #general — it's a purpose-built war room with a structured name, pinned context, and the right people already invited based on the affected service.

Every incident at Rohlik gets its own channel. Each one has a dedicated war room from second zero.

An AI Agent Starts Investigating Immediately

This is where it gets interesting. The moment the incident is created, an AI investigation agent spins up automatically. It doesn't wait for a human to ask questions. It starts working immediately, pulling from every telemetry source we have:

Datadog Logs — production microservice logs across all our services
Datadog APM — distributed traces across HTTP, database, queue, and gRPC spans
Datadog Metrics — infrastructure and APM metrics covering HTTP services, GKE containers, Cloud SQL, RabbitMQ, JVM health
Change Events — code deployments and configuration changes, filterable by component, environment, and pull request

The agent correlates all of this automatically. It checks: what deployed in the last hour? Are error rates spiking on a specific service? Is there a pattern in the traces? Are any database metrics anomalous?

Within minutes — not the thirty it used to take — the war room has a working hypothesis. By the time the first human joins the channel, the AI has already narrowed the investigation to a specific service, a specific deployment, or a specific infrastructure change. The responder's job shifts from finding the problem to confirming it and deciding what to do about it.

The Severity and Status Model

Not every incident is a P1. Our severity model is calibrated to match business impact:

Severity	Description	Response
P1 — Critical	Full outage or data breach. Very high customer impact.	Immediate response. War room. All hands.
P2 — High	Significant impact. Workarounds may exist.	Immediate response, usually within working hours for non-critical.
P3 — Medium	Low impact. Most customers unaffected.	Handled within working hours.

Every incident moves through a structured lifecycle: Triage → Investigating → Fixing → Monitoring → Closed — or into the post-incident flow: Documenting → Reviewing.

The AI agent generates real-time status updates throughout. These aren't just timestamps — they're narrative summaries of what's known, what's changed, and what's being done. Responders joining the channel late can read the update thread and be fully oriented in under a minute.

Post-Incident: Where AI Really Shines

Resolving the incident is only half the story. The other half — the part most teams skip or do badly — is learning from it. This is where our agentic approach has the biggest impact.

AI-Generated Post-Mortems

When an incident moves to the Documenting status, the AI doesn't stop working. It generates a structured post-mortem automatically, pulling from:

The full timeline of status updates
All messages in the war room channel
The telemetry data it queried during the investigation
The root cause hypothesis and resolution steps

The output is a draft post-mortem that's 80% complete before the PostMortem Owner even opens it. The PostMortem Owner — a dedicated incident role we assign during response — shifts from writing the post-mortem to reviewing it.

Automated Debrief Scheduling

Post-mortems aren't just documents — they're meetings. For every P1 and significant P2, a debrief is automatically scheduled with the incident participants. The agenda is pre-populated from the AI-generated post-mortem. The PostMortem Owner (a dedicated incident role we assign during response) drives the session, but the preparation is done.

Follow-Ups That Actually Get Done

The biggest failure mode of traditional incident management isn't the response — it's the follow-through. Teams write action items in a post-mortem document, and six months later half of them are forgotten.

We solved this by making follow-ups first-class objects in incident.io, automatically synced to Linear (our project management tool). Every follow-up has:

An owner — assigned to a specific engineer, not a team
A priority — Urgent, High, Medium, or Low
A status — tracked through completion
A link back to the incident — so the context is never lost

The AI agent proposes follow-ups based on the root cause analysis. If a deployment caused the incident, it might suggest: "Add pre-deployment validation for price field handling." If a missing alert delayed detection, it might suggest: "Create Datadog monitor for order creation failure rate."

These aren't vague "improve monitoring" action items. They're specific, actionable, and routed to the team that owns the affected service.

SLO and CUJ Improvements

Follow-ups don't just fix the immediate problem. The AI agent also identifies patterns across incidents that suggest systemic improvements:

SLO gaps — if an incident reveals that a service doesn't have adequate SLOs, a follow-up is created to define them
CUJ (Critical User Journey) coverage — if a user-facing flow failed without detection, a follow-up is created to instrument it
Monitoring gaps — if the incident was detected by a human before an alert fired, a follow-up is created to close the detection gap

These systemic improvements are automatically routed to the responsible team. The Data Engineering team gets follow-ups about pipeline monitoring. The Platform team gets follow-ups about infrastructure alerting. The cell that owns the affected service gets follow-ups about application-level SLOs.

What We Learned

Agents are best at the boring parts. The highest-value thing the AI does isn't making brilliant diagnostic leaps — it's doing the twenty minutes of rote context assembly that every incident used to start with. Checking deploy logs, querying metrics, correlating timelines. This is exactly the kind of work humans hate doing under pressure and agents do well.

Post-mortems improve when the barrier drops. When writing a post-mortem means starting from a pre-filled draft instead of a blank page, teams actually write them. Every P1 and significant P2 gets a proper post-mortem now. That wasn't true before.

Follow-ups need to be tracked in the same system teams already use. We tried tracking follow-ups in a wiki. Nobody looked at the wiki. When we moved them to Linear — where engineers already manage their sprint work — completion rates went up dramatically. The AI-to-Linear sync made this seamless.

Escalation paths must be owned by teams, not by a central ops function. Terraform-managed escalation paths mean every cell owns their on-call routing. When a team's composition changes, they update their escalation path in code. No tickets to an ops team. No stale pager lists.

The war room is the product. Everything else — the AI investigation, the escalation routing, the post-mortem generation — exists to make the war room effective. The war room is where humans make judgment calls. The system's job is to give them the best possible context for those calls.

What's Next

We're pushing toward fully autonomous triage for P3 incidents — the AI agent resolves and documents without waking anyone up. For P1s and P2s, humans will always be in the loop for judgment calls, but the goal is that by the time a human joins the war room, the investigation is already 80% complete.

The incident management system is also feeding data back into our deployment pipeline. Incidents caused by deployments are automatically correlated with the specific PR and release. Over time, this builds a risk model: which services, which types of changes, and which times of day carry the highest incident probability. That model will eventually gate deployments — not blocking them, but flagging high-risk deploys for additional review before they ship.

Incident management used to be a human process with some tooling. Now it's an automated system with human oversight. The humans are still essential — they make the calls that matter. But they're no longer doing the work that doesn't.

If you're building something similar, or if you think we're doing it wrong — reach out. We're always happy to compare notes.