← Back to Writing

Why AI Agents Break in Production: Failure Modes I've Hit and How I Debug Them

My AI agents fail in predictable ways: context collapse, prompt drift, tool misuse, and silent delegation loops. Here's each failure mode, what caused it, and the debugging steps I use now.

Incident-room trace showing a dropped packet, dead node, and split execution path.

Why AI Agents Break in Production: Failure Modes I've Hit and How I Debug Them

AI agents work great in demos. Ask a question, get an answer, maybe call a tool, return a result. The flow looks clean.

Then you put agents into a real system that runs every day, handles different domains, and delegates between specialists. Things break. Not randomly. In patterns.

This post catalogs the failure modes I've hit running a multi-agent org in production, and the debugging approach I've settled on. If you're building agents that do real work, you'll hit these too.

Failure mode 1: Context collapse

The most common production failure. The agent starts a task, picks up context from its session, and somewhere around turn 8 or 9 the context window gets noisy enough that the agent loses the thread.

Symptoms:

  • The agent repeats a step it already completed
  • It asks for information it was given 3 turns ago
  • It starts solving a different (but related) problem than the one assigned
  • The output quality drops abruptly, not gradually

What causes it: The context window fills with earlier turns, tool outputs, and intermediate reasoning. At some point, the signal-to-noise ratio flips. The model starts weighting recent noise over earlier instructions.

How I debug it:

  1. Pull the full conversation log from the session
  2. Count the tokens (rough character count divided by 4 works for estimation)
  3. Find where the agent's first correct reference to the task drops off
  4. Check if a tool response or intermediate output inflated the context beyond what the system instruction can override

The fix is usually one of three things:

  • Move state out of context and into files. Write task state to a file after each step. Tell the agent to read the file, not rely on session memory. This is the single most effective change I've made.
  • Split the task into subtasks. If a single task needs 10+ turns, it's too big for one session. Delegate to a subagent with a clean context and a narrow scope.
  • Summarize earlier context. Before the context gets too long, have the agent write a 5-line summary of what's been done and what's left, then use that summary as the anchor for the next phase.

The agent memory post covers the memory stack I use to prevent this. The short version: files are more reliable than context windows.

Failure mode 2: Prompt drift

Over multiple sessions, the agent's behavior slowly shifts away from what its system prompt specifies.

Symptoms:

  • An agent that was concise starts returning verbose responses
  • Domain boundaries blur. The finance agent starts giving career advice
  • The agent's tone changes. Formal becomes casual, or vice versa
  • An agent stops following its escalation rules

What causes it: The model adapts to recent interactions. If the last few sessions all had the agent stretching beyond its scope, the model learns that pattern. Prompt drift is the model doing what it was trained to do (adapt to context) in a direction you didn't intend.

How I debug it:

  1. Compare the agent's last 5 session outputs against its system prompt
  2. Identify the first session where the drift appeared
  3. Check what inputs triggered the drift. Usually a task that sat at the boundary of the agent's scope, which the model handled instead of escalating

The fix:

  • Tighten the system prompt's boundary language. "Handle finance tasks" drifts. "Handle finance tasks only. Route career questions to Vector, content requests to Quill" does not.
  • Add a scope check step. Before the agent starts a task, have it confirm: "Is this in my scope? If not, escalate." This adds one extra turn but catches drift early.
  • Periodic prompt resets. If an agent has been running for many sessions, re-paste the full system prompt. Some agent frameworks do this automatically. In my setup, I do it manually when I notice drift.

Failure mode 3: Tool misuse

The agent calls a tool incorrectly, calls the wrong tool, or calls a tool when it should just reason.

Symptoms:

  • API errors from malformed parameters
  • The agent calls a search tool for information it already has in context
  • It calls the wrong tool for the task (email tool instead of file-write tool)
  • It retries a failed tool call with the same parameters, expecting a different result

What causes it: The model's tool-use behavior depends on how tools are described in the prompt. If the tool descriptions are vague, the model guesses. If the descriptions are too long, the model skips reading them. If the tool names are ambiguous (e.g., "send" for both email and messages), the model picks the wrong one.

How I debug it:

  1. Pull the tool call and its parameters from the session log
  2. Check the tool definition the agent was given. Did it match the call?
  3. Check if the agent had enough information to call the tool correctly
  4. Check if the tool description was clear enough to distinguish it from other tools

The fix:

  • Shorten tool names and descriptions. "web_search" is better than "search_the_internet_for_information". One line of description beats three.
  • Add parameter examples. Show the model what a valid call looks like. This cuts the error rate more than any other single change.
  • Separate ambiguous tools. If two tools do similar things, rename them to make the difference obvious. "search_web" vs "search_vault" is clearer than "search" vs "search_local".
  • Validate before executing. Run the tool parameters through a validation step before calling the actual API. If parameters are wrong, return a descriptive error, not a generic one.

Failure mode 4: Silent delegation loops

The orchestrator delegates to a specialist. The specialist delegates back to the orchestrator. Neither one completes the task.

Symptoms:

  • A task ping-pongs between agents with no progress
  • The final output is generic ("I've asked the relevant agent to handle this") instead of specific
  • The task shows as "in progress" indefinitely

What causes it: Ambiguous task descriptions, overlapping scopes, or an orchestrator that delegates instead of deciding.

How I debug it:

  1. Trace the delegation chain from the session logs
  2. Find where the loop starts. Which agent sent the task back, and why?
  3. Check if the original task description was specific enough for either agent to own it
  4. Check if the agents' scopes overlap on this type of task

The fix:

  • Make delegation one-directional. The orchestrator delegates down. Specialists don't delegate back up. If a specialist can't handle the task, it returns a status saying so. The orchestrator decides the next step, not the specialist.
  • Add an ownership rule. "If two agents could handle this, the one whose scope is narrower takes it." Narrow scope wins because narrower agents are usually more specific.
  • Set a max delegation depth. If a task has been delegated more than 2 times, the orchestrator handles it directly. This prevents infinite chains.
  • Write clearer task descriptions. "Handle the finance stuff" causes loops. "Reconcile this month's Stripe payouts against the ledger" gives a clear owner.

Failure mode 5: Rate limit and cost blowouts

The agent makes too many API calls, hits rate limits, or burns through tokens faster than expected.

Symptoms:

  • API 429 errors in the middle of a task
  • A single session costs 10x what the average session costs
  • The agent retries rapidly, making the rate limit worse

What causes it: The agent enters a retry loop. A tool call fails, the agent retries. The retry also fails. The agent retries with slight variations. Each retry costs tokens and API calls. Without a backoff or budget cap, this spirals.

How I debug it:

  1. Pull the token usage for the session
  2. Count tool calls. Look for the same endpoint called 5+ times
  3. Check if the agent changed its parameters between retries, or just repeated the same call

The fix:

  • Add retry limits. 3 retries max per tool call. After that, return an error and let the agent reason about alternatives.
  • Add exponential backoff. The agent should wait longer between retries, not shorter.
  • Set a session token budget. If a single session exceeds 2x the average, stop it. This is a blunt tool. It prevents runaway costs.
  • Log token usage per task type. This tells you which kinds of tasks are expensive, so you can redesign them.

My cost breakdown post has specific numbers for what a multi-agent org costs. The short version: most sessions are cheap. The expensive ones are almost always retry loops or tasks that should have been split.

How I debug agents now

After hitting these patterns enough times, I built a debugging sequence I run through every time an agent fails:

  1. Check the session log. Read the full conversation. Most failures are visible in the log.
  2. Check the token count. If the context is near the limit, context collapse is the likely cause.
  3. Check the tool calls. Look for retries, wrong tools, or malformed parameters.
  4. Check the delegation chain. If the task was handed off, trace where it went and whether it came back.
  5. Check the system prompt. Read it fresh. Is the scope clear? Is the boundary language tight? If you're skimming it, the model might be skimming it too.

This sequence catches about 90% of failures. The remaining 10% are usually model-level behavior issues (the model generates bad output for unclear reasons). Those get fixed by model upgrades, not by prompt engineering.

What production readiness actually means

Most "production-ready agent" checklists focus on infrastructure: rate limiting, error handling, logging.

Those matter. But the failures that actually hurt are the ones in the agent's reasoning. Context collapse. Prompt drift. Delegation loops. Tool misuse.

These are not infrastructure problems. They are design problems. The model will do what the prompt and the context tell it to do. When the prompt is vague or the context is noisy, the model will fill in the gaps with its training data. That's when things break.

The most reliable agent system I've built is the one where:

  • each agent has a narrow, explicit scope
  • state lives in files, not in context
  • tool descriptions are short and precise
  • delegation flows one direction
  • there are retry limits and token budgets
  • the debug sequence takes 5 minutes, not 50

That system isn't fancy. It works because the failure surface is small, and when something breaks, I know where to look.

Read next: AI Agent Workflows That Actually Ship and What It Actually Costs to Run AI Agents.

Some links on this site may be affiliate links. I only recommend tools I use. If you click through and make a purchase, I may earn a small commission at no extra cost to you.