Tech Blog - Rohlik Group

The job nobody has time for

Running an AI agent in production means somebody on the team spends Monday morning reading transcripts. Skim the bad conversations, guess at patterns, pick a fix, hope it doesn't regress five other things. By the time a real insight surfaces, a week of signal is already stale.

We built a loop that does this overnight. A handful of agents read yesterday's conversations, cluster failures, root-cause the non-obvious ones, and open merge requests with prompt and flow fixes by morning. Each MR links to an eval run showing the change doesn't regress anything else. An engineer reviews and approves. The fixes ship before lunch.

Maia is our AI assistant handling voice, phone, and chat for Rohlik customers across CZ, DE, AT, HU, and RO. The loop runs every night.

What the loop actually does

Our AI agent, Maia, talks to a lot of customers. Every conversation where we have recording permission gets post-processed and anonymised: transcripts stored with PII redacted, metadata tagged, CSAT attached where we have it, tool calls logged. That's the basis.

The loop has four AI-driven stages, followed by three human-in-the-loop steps that close the cycle:

1. Triage       - read every bad conversation, analyse what went wrong
2. Root-cause   - dig into code and flows for non-obvious failures
3. Patterns     - cluster across conversations, compare day over day
4. Proposals    - turn findings into ranked, actionable fixes
5. Wait for engineers to approve
6. Implement the fixes
7. The next day, the AI agent knows about the fixes and can focus its investigation on how well they worked

Stage 1: Triage every bad conversation

Every 24 hours we pull the conversations that went badly: low CSAT, customer frustration signals, broken tool calls. For each one of those, an agent reads the full transcript, the tool calls, the context the model was working with, and writes a focused analysis of what the customer actually wanted, what Maia did, where it diverged, and what the proximate failure was.

Stage 2: Root-cause with a deep-thinking agent

For the worst cases, or anything Stage 1 flags as non-obvious, a second agent takes over. This one has access to the source code and runs on the Claude Agent SDK harness. It goes through the prompts, reads flow definitions, greps through tool implementations, fetches additional context.

The output is qualitatively different from Stage 1. Instead of "Maia gave wrong info about the delivery window," you get a detailed analysis that references concrete files and lines of code.

The output looks like what an engineer would write after 30 minutes of digging. We can run it on every flagged conversation.

A lot of it is noise - issues caused by factors we cannot reliably control (background noise, latency glitches, STT failures, technical limitations, and so on) - but some of it is relevant, and that part pays for the rest.

Stage 3: Find the patterns

A single bad conversation is noise. Ten with the same shape is a pattern. A third agent zooms out, clusters today's findings by failure mode, and compares against yesterday and the days before. Is this failure mode new? Growing? Always there and we never noticed? Did a prompt change two days ago regress something evals didn't catch?

Stage 4: Proposals ranked by ROI

The final stage turns findings into specific proposals:

Change line X of prompt Y to clarify Z.
Add a flow for refund requests outside working hours.
Tighten this tool's description so the model stops calling it when the user means something else.
Add an eval case for the timezone scenario we just root-caused.

Each proposal carries an estimated impact (how many conversations this week would have gone better if it had already shipped) and an estimated effort required. The list is sorted quick-win first: small, safe, high-leverage edits at the top; strategic bets that need real human judgment at the bottom.

Closing the loop

A report plus ranked proposals isn't enough - the human is still the bottleneck between insight and fix. A fourth agent picks up the top-ranked proposals and implements them. It opens a merge request with the fix, adds or updates eval cases and other tests, and writes a description linking back to the conversations that motivated the change.

An engineer reviews the MR like any other. Is the wording right? Does the new eval actually test the thing? Is the reasoning sound? Approve or reject.

If approved, it ships. Every change runs through the automated eval suite before merge, which catches the obvious failure mode where an AI "fix" silently breaks five other things. The evals are what make this safe.

Why it works, and where it doesn't

Four narrow jobs, not one model doing everything. Each agent has a specific input, a specific output and fresh context.

Observability and eval infrastructure came first. Without trustworthy conversation data, CSAT signals, source-code access, and a real eval suite, every stage would produce garbage.

The environment was made agent-friendly. Prompts live in git, not inline in Python. Repos carry READMEs and design docs an agent must read. Internal systems sit behind MCP. CI treats an agent's MR like any other. Agents have access to the same systems as our engineers. That's what lets Stage 4 produce MRs an engineer can approve rather than rewrite. For the occasional fix that spans multiple repositories, it hands off to Devin - which handles cross-repo work well.

Where it doesn't work: the loop is good at small, local fixes. It won't redesign a subsystem or rethink a flow from scratch. But it can flag recurring patterns that hint a redesign is overdue. Bigger changes stay with engineers.

AI is not great at writing prompts. On its own, a model proposing a prompt change tends to over-specify - it writes instructions tailored so closely to the one failing conversation that the prompt bloats, drifts off-topic, and starts to contradict itself. We had to build skills that run as subagents on top of whatever Stage 4 first produces: they simplify the instructions, trim excess scope, and use well-shaped examples as references rather than rigid rules. The best prompt changes usually come out noticeably shorter than what the first-pass agent wrote.

What's next

The loop evolves with the models running it. Each new model generation sharpens triage, root-cause, and prompt-writing - which means more of the work moves from human review toward auto-approval.

The safest changes (new eval cases, tool description tweaks) already ship without a reviewer. As the agents get better at judging risk and the eval suite covers more cases, that set grows. Fewer MRs need an engineer reading them, and the loop ships faster without losing the safety net.

The hard part isn't capability - it's design. Auto-approval only works if regressions get caught before they reach customers. Every time we widen the auto-approve set, the evals for that category have to be strong enough that we'd trust them without a human reading each diff. Get that wrong and the cost is customer experience, not just engineer time.