Tech Blog - Rohlik Group

We were building AI agents for real marketing workflows when we noticed a pattern: every production failure traced back to a system prompt, not the model.

The Monolith We Kept Editing

There's a moment every team building LLM agents eventually hits: the system prompt becomes a place nobody wants to touch. Hundreds of lines, business rules mixed with formatting instructions, workflow procedures sandwiched between constraint declarations. You make a small change to the recipe creation flow, redeploy, and something in the product discovery workflow quietly breaks. You have no idea why — all the logic lives in the same text.

We hit this point while building agents for our marketing workflows at Rohlik. Our initial setup was a single function that assembled the system prompt: it pulled together role definitions, workflow rules, business constraints, and procedural guides — concatenated into one string delivered to the model at the start of every run. It worked for a while.

The problems showed up at the seams. When a workflow failed, the failure was unattributable. Was it the model? A tool call gone wrong? A constraint buried in paragraph four of the prompt? There was no way to isolate the variable. Business logic encoded in natural language has no interface contract — you can't version it meaningfully, you can't test it in isolation, and as the number of supported workflows grows, the prompt grows with it into a single blob that requires deep familiarity to modify safely.

The deeper problem: it wasn't just business logic that got mixed up. Tools, persistent corrections, behavioral rules, workflow procedures — all context, all treated the same way, all poured into one undifferentiated string. That's what we needed to fix.

Context Has More Parts Than You Think

Context isn't a prompt. It's a set of managed layers, each with a different owner, a different change rate, and a different mechanism for keeping it current. Once we separated them, the system became tractable.

The identity layer. The system prompt is a policy document — about 27 lines in our production setup. It defines non-negotiable behavioral rules: delegation boundaries, market-specific invariants, recovery behavior. It doesn't carry business logic or workflow procedures. It changes on release cycles, and nobody edits it casually. That's intentional — it's the frame everything else is interpreted through.

The skills layer. Procedural knowledge lives as files. Each workflow procedure is a standalone markdown file with a structured metadata header declaring its name, description, and scope. The recipe creation procedure is a file. The product discovery checklist is a file. They live in version control, have a clear owner, and can be edited without touching agent code.

skills/
  recipe-editor/SKILL.md      # how to create and edit recipes
  product-discovery/SKILL.md  # how to find and link products
  microsite-editor/SKILL.md   # how to manage landing pages

Here's what one of those files looks like:

---
name: recipe-editor
description: Use for recipe creation, editing, and publishing workflows.
agents: [chef]
---

# Recipe Editor

## Before any write operation
1. Call `recipe_get` to fetch the current state
2. Check author permissions before modifying

## Creating a new recipe
1. Collect all required fields: title, ingredients, steps, metadata
2. Call `recipe_create` with the complete payload
3. Confirm creation and return the recipe URL

The agent doesn't execute this — it reads it as part of its context. It's the difference between embedding business logic into a language model and handing it a manual to consult. Change rate: days, not releases. Owner: the domain team.

The memory layer. Agents accumulate useful knowledge across sessions — a team's formatting preferences, a market's constraints, a recurring edge case that tripped up a workflow last month. Memory stores this separately from the thread so it survives across invocations. The agent can write to it; the infrastructure reads it back at the start of each run. It's not session state, which is ephemeral — it's a persistent record that follows the agent across time.

The tools layer. MCP tool contracts are a context layer too. The tool's name, description, and input schema aren't just plumbing — they're instructions the model reads before deciding what to call. Naming conventions encode intent at a glance: recipe_get signals a read-only exploration; microsite_replace signals a mutation. Tool descriptions carry operational guidance: when to use the tool, what ordering constraints exist, what makes a particular call risky. Output formatters compress high-volume results before the model sees them — a dense product record becomes a readable summary — so the model can reason about results rather than scan them.

This is context engineering too, just at the tool boundary. Vague tool descriptions, naming conventions that don't signal intent, outputs that dump raw data into the context window: these are context engineering failures with the same consequences as a bloated system prompt.

There's also a governance boundary between tools and skills worth naming. MCP tools are system capabilities — stable, admin-reviewed; agents can invoke them but cannot define them. Skills are team-authored procedures — fast revision cycle, agents can propose new ones. One track absorbs volatility from business procedure changes; the other preserves the operational contract with external systems.

How the Layers Come Together

Prompt assembly isn't arbitrary. Before every run, the layers arrive in a specific order:

# Context assembly order
1. Identity          — system prompt policy (always present)
2. Skills usage      — "consult your skill files before acting"
3. Skills authoring  — "you may propose new skills if needed"
4. Memory            — what persisted from prior sessions
5. Self-verification — integrity check instruction (always last)

The ordering is intentional. Models pay disproportionate attention to content at the beginning and end of the context window. Identity comes first — it frames everything that follows. Self-verification comes last — it stays salient right before the model produces output.

Each layer only appears when its corresponding middleware is active. A simple read-only run doesn't inject memory or skills authoring — no dead weight, no cognitive overhead. The context is assembled to match the task.

The before and after in abstract terms:

# Before: prompt builder returns a monolith
def build_prompt(workflow):
    return base_rules + workflow_rules + constraints + procedures

# After: layered assembly, each part conditional on active middleware
def build_prompt(workflow):
    return identity + skills_usage + skills_authoring + memory + self_verification
# Skills loaded separately by middleware at runtime

The V0 to V3 Journey

We validated this through a deliberate ablation: four versions, same model, same tools, same 54 real marketing tasks drawn from production. Each version changed exactly one thing in the context layer.

V0 — all logic inline, one static prompt builder. Works for demos, breaks under real workflows with any meaningful complexity.

V1 — static rule fragments added per workflow. Still concatenation, just larger. The smell got worse before it got better.

V2 — introduced retrieval guidance: the agent could fetch context dynamically via tools. Meaningfully better, but the core logic still lived in prompt strings.

V3 — skills middleware loads external procedure files at runtime. The prompt becomes a policy document. The skills carry procedure. All four context layers active.

Reliability improved across all versions, but more importantly, failures became attributable. When something went wrong in recipe creation, it traced to the recipe skill — not "somewhere in the 400-line prompt."

What We Learned

Skills are a unit of deployment. When a workflow changes — new API contract, updated validation rule, different checklist — you edit a skill file and ship it. No agent code changes. No prompt archaeology. The change has a clear boundary and a clear owner.

Seams make debugging tractable. With monolithic prompts, failures are opaque. With modular context layers, failures have a surface area. You can read a diff between skill versions the same way you'd read a code diff — and reason about what changed and why.

The model didn't change — our engineering did. V0 and V3 run on the same LLM. Every reliability gain came from how context was structured and delivered, not from prompt tricks or model upgrades. This reframes the core question of agent quality: less "find a better model," more "build better infrastructure around it."

Skills belong to the team, not the codebase. Because a skill is just a markdown file, the barrier to contributing is low. When we handed this to the marketing team, they started writing and updating skills themselves — adjusting checklists, adding workflow steps, refining procedures — without filing a ticket or waiting for an engineer. The agent's capabilities expanded through collaboration, not through code. That's a fundamentally different model from how most teams think about AI tooling.

Tools are context too. The discipline that helped with skills — clear naming, scoped descriptions, compressed outputs — applies equally to tools. A tool description that's vague, a naming convention that doesn't signal intent, an output that dumps raw fields into the context window: these are context engineering failures with the same downstream consequences as a bloated system prompt. The tool boundary is not a free pass.

Layer separation makes ownership tractable. When context is one blob, everyone owns it and nobody does. When it's four layers, each has a clear steward: platform engineers own the identity layer, domain teams own skills, the agent manages its memory, and API contract owners govern the tool layer. Ownership follows structure.

The infrastructure metaphor isn't aesthetic — it's the correct abstraction. Context has lifecycle, versioning, scope, and consumers. The moment you start treating it accordingly, building reliable agents starts to feel a lot more like building reliable software.