Monitoring AI Agents in Production: What to Watch
Agent observability has to start at the task layer. Latency, uptime, error rate, CPU, memory, request count. I watch those too, but they don't tell me whether an agent completed useful work.
An agent can return 200s all day and still fail the job. A content agent like Quill can write a draft, call the right tools, and mark the task completed. Then the tool trace shows bloated context, the human reviewer rejects an invented customer example, and the task stays review-blocked instead of released. The run succeeded. The work did not.
That is the monitoring gap. A normal service fails when it crashes, times out, or returns bad data. An agent fails when it misunderstands the job, takes an unsafe action, loops through tools, loses context, delegates to the wrong worker, or produces output that looks fine until a human reads it.
This is for teams running agents that write, review, reconcile, update, or publish real work. If an agent can change a file, send a message, open a pull request, update a record, or spend money, request metrics are not enough.
Start with the unit that matters: the task
I don't start with requests. I start with tasks.
A request is a model call. A task is the thing the system was supposed to accomplish.
Write the draft. Reconcile the ledger. Review the pull request. Summarize the incident. Update the vault note. Those are the units that matter to the user.
For every task, I want a row with status plus lifecycle events:
- task id
- agent name
- assignee role
- start time and end time
- final status
- human-visible summary
- changed files or external side effects
- model used
- token cost
- tool call count
- retry count
- parent task id if it was delegated
This turns monitoring from request accounting into operational history. I can answer the questions that matter: Which agents finish their work? Which tasks time out? Which task types get expensive? Which roles require human correction?
The task record needs more than success or failure. My minimum set is: completed, approved, blocked, cancelled, crashed, timed out, rejected, and released. Completed means the agent produced a deliverable. Approved means a human accepted it. Released means the deliverable left the system: sent, published, merged, charged, or written back to a production record. Those are not the same state, and mixing them is how bad content, bad code, or bad financial actions slip through.
Watch outcomes before internals
The first dashboard should be boring:
- completion rate by agent
- timeout rate by task type
- blocked rate by task type
- median task duration
- p95 task duration
- human rejection rate
- rework rate
These numbers tell me whether the system works before I ask why.
A high timeout rate usually means the task is too large for one run, not that the model is slow. A high blocked rate means the task descriptions lack enough context, or the agent's escalation rules are too cautious. A high rework rate means the agent can finish the task mechanically but misses the quality bar.
The human rejection rate is the one I care about most. Agents are good at producing plausible artifacts. Monitoring has to catch the gap between "the run completed" and "the output was accepted." If humans keep correcting the same agent, the agent is not healthy.
Tool calls are where agent failures become visible
Tool traces are the closest thing agents have to a syscall log. When an agent goes wrong, the tool sequence usually shows it first.
I track:
- tools called per task
- arguments passed to each tool
- tool latency
- tool error type
- retry count per tool
- repeated calls with identical parameters
- side-effecting calls
Repeated calls with identical parameters are the smell. If an agent calls the same tool or endpoint five times with the same arguments, it is not debugging. It is looping. The fix is usually a retry cap, a clearer error message, or a tool contract that returns structured failure data the model can reason about.
Tool choice also exposes prompt problems. If the content agent reaches for git before writing the file, the prompt order is wrong. If the finance agent calls a web search tool before reading the ledger, the default lookup path is wrong. If two tools have similar names and the agent keeps picking the wrong one, the tool surface is bad.
Tool arguments need redaction before they hit logs. Store enough shape to debug the call, but strip secrets, credentials, raw customer data, long prompt bodies, and anything the agent read from private files unless the monitoring store is approved for that data class.
I want every side-effecting tool call logged separately. File writes, git pushes, messages, database writes, ticket updates, and publishes need an audit trail. A read-only mistake can leak context, burn quota, or poison the next turn. A write mistake changes the world. Both need logs. Writes need stricter gates.
Financial actions need their own approval class. Trades, transfers, payments, invoices, payroll, and bank changes should never hide inside generic "external writes." A task can recommend, draft, or reconcile, but execution should require a separately logged approval event tied to the exact amount, account, counterparty, and timestamp.
Token spend is a symptom, not the disease
Token metrics matter, but not because I care about the bill first. Token spikes tell me the agent is struggling.
I track input tokens, output tokens, total cost, and cost per completed task. Then I compare by task type. A blog draft and a pull request review should not have the same cost profile. If they do, one of them is loading too much context or using the wrong model tier.
The useful alert is not "this run cost $0.42." The useful alert is "this task type usually costs $0.04 and this run cost $0.42." That points to context bloat, retry loops, or delegation loops.
I also watch baseline drift. A single run that costs 3x more than normal is obvious. A task type that moves from $0.04 to $0.07 to $0.11 over two weeks is quieter and usually more dangerous. That is slow context bloat becoming the new normal.
I track budget burn by phase too:
- context loading
- model reasoning
- tool result ingestion
- final writing
- review pass
Phase markers do not need to be fancy. Log a phase name before large context loads, model calls, tool ingestion, and review passes, then attach token and cost deltas to that phase.
Most cost waste hides in context loading. Agents pull full files when they need 40 lines. They paste long tool results into the next turn. They carry stale conversation history because nobody wrote durable state to a file. When cost jumps, I inspect the input side before blaming the model.
This connects directly to model routing. I don't want frontier models handling heartbeat pings or low-stakes summaries. I want the expensive model only where the reasoning demand justifies it. The cost breakdown post covers the routing side in more detail.
Context pressure needs its own meter
Context collapse is the production agent failure I hit most often. The agent starts with the right objective, accumulates tool output, then loses the constraints that made the task specific.
So I want context pressure logged per turn:
- estimated input tokens
- percentage of context window used
- number of files loaded
- largest tool output included
- system prompt size
- user task size
- persistent memory size
The exact token count does not need to be perfect. A rough estimate is enough to catch runaway runs. If the context window is 80% full and the agent is still halfway through the task, it needs to write state out and start a narrower pass.
The important part is recording what filled the window. "Context at 120k tokens" is less useful than "context at 120k tokens because one tool returned a giant log dump." One tells me there was pressure. The other tells me what to fix.
My preferred pattern is blunt: large tool outputs get summarized before they enter the next agent turn. Task state gets written to disk. The agent reads the state file, not the whole conversation. I covered the failure mode in the production debugging post, but the monitoring rule is simple: if context pressure rises faster than task progress, the run is headed toward failure.
Delegation depth catches organizational bugs
Multi-agent systems fail in ways single agents don't. The most annoying failure is the delegation loop.
An orchestrator routes to a specialist. The specialist decides another agent owns it. The task moves again. Everyone acts reasonable. Nothing ships.
I track:
- parent task id
- child task ids
- delegation depth
- handoff reason
- assignee changes
- time spent waiting on dependencies
- number of agents touched
Depth greater than two is a warning. Depth greater than three is usually a design bug. Either the task is too vague, the roles overlap, or the orchestrator is avoiding a decision.
Handoff summaries matter here. A child agent should not receive "handle this." It should receive the task, the acceptance criteria, the relevant files, the allowed side effects, and the expected output shape. Monitoring should tell me when handoffs are thin. Thin handoffs create expensive failures because every downstream agent spends tokens reconstructing missing context.
The org chart post explains why I split agents by role. Monitoring is how I prove the split works. If every task touches every agent, I don't have an org. I have a slow chatbot with job titles.
Human interrupts are signal
Humans interrupt agents for a reason. They correct scope. They reject output. They clarify missing context. They stop unsafe actions. They ask for a different format.
I log those interrupts as first-class events:
- clarification requested
- clarification answered
- human correction
- human rejection
- human approval
- approval without execution
- manual override
- publish approval
- financial approval
The goal is not to remove humans from the system. The goal is to know where humans are adding judgment and where agents are wasting their time.
If an agent asks for clarification on 60% of tasks, the task intake is weak. If humans reject 30% of drafts from the same content agent, the voice guide is not encoded tightly enough. If manual overrides cluster around publishing, the system needs a stronger review gate, not looser permissions.
Approval events are especially important for content, code, and finance. A draft is not a publication. A patch is not a merge. An approved invoice is not a payment. The monitoring system should preserve that boundary.
Quality needs delayed measurement
Some failures don't show up at task completion. A blog post can pass review and still underperform. A support response can satisfy the immediate ticket and still create confusion later. A code change can pass tests and still cause an incident next week.
For agent work, I want delayed quality signals tied back to the original task:
- content performance after 7 and 30 days
- search impressions and click-through rate
- backlinks earned
- support reopen rate
- bug reports linked to agent-written code
- human edits made after completion
- incident references
This is where the task id matters. If the task id disappears after completion, delayed quality never connects back to the agent run. You get vibes instead of learning.
I don't need a perfect attribution model. I need enough linkage to see patterns. If agent-written docs keep getting edited for missing caveats, the doc agent needs better review prompts. If one model tier produces twice the rejection rate, the router needs to know that.
Alerts should be sparse and operational
Most alerting for agents should not page anyone. Agent systems produce weird intermediate behavior. A model can generate a bad paragraph and recover. A tool call can fail once and succeed on retry. Alerting on every oddity creates noise.
I only want alerts for conditions that require action:
- side-effecting tool failed after retry cap
- task exceeded budget by 3x baseline
- task baseline drifted upward over a rolling window
- context pressure crossed a threshold with no checkpoint written
- delegation depth crossed three
- side effect attempted without the required approval
- approval request pending past SLA for publish, merge, payment, or external write
- repeated rejection pattern over a rolling window
- scheduled agent failed to run
Everything else goes to the log. Dashboards are for pattern detection. Alerts are for intervention.
The minimum viable monitoring stack
If I were starting from zero, I would not install a full observability platform first. I would write structured JSON lines for every task and every tool call.
One task event:
{"event_type":"task_event","task_id":"t_123","agent":"quill","type":"blog_post","status":"completed","duration_ms":184000,"model":"fast-default","input_tokens":42000,"output_tokens":3800,"tool_calls":9,"cost_usd":0.018}
One tool event:
{"event_type":"tool_event","task_id":"t_123","agent":"quill","tool":"write_file","side_effect":true,"status":"ok","latency_ms":120,"retry":0}
One human event:
{"event_type":"human_event","task_id":"t_123","agent":"quill","action":"rejected","reviewer":"harbor","at":"2026-05-07T11:24:00Z","note":"invented customer example; keep task review-blocked"}
One approval event:
{"event_type":"approval_event","task_id":"t_123","agent":"ledger","action":"approved_for_payment","approver":"will","amount_usd":420.00,"currency":"USD","account":"operating","counterparty":"vendor_456","scope_id":"invoice_789","at":"2026-05-07T11:24:00Z"}
Put those in files first. Then query them with SQLite or DuckDB. Only add a dashboard when the questions outgrow the log.
The distinction is simple: human_event records judgment, correction, rejection, or clarification. approval_event records a control boundary that permits release, merge, publish, payment, or another side effect. Bind approvals to the exact artifact: invoice id, pull request, file path, diff id, publish URL, or artifact hash.
The copy-paste mental model:
task_event: what work was attempted, who owned it, what status it reached, and what it costtool_event: what the agent touched, how long it took, whether it retried, and whether it had side effectshuman_event: where a person corrected, rejected, clarified, or accepted the outputapproval_event: which gate allowed work to leave the system, with the approver and exact scope
That is enough to reconstruct the run without buying a platform first.
The mistake is buying monitoring before deciding what health means. Use your existing observability stack if it is already instrumented, but instrument the work layer before shopping for another dashboard. For agents, health means the system completes useful tasks with bounded cost, bounded risk, and enough traceability to debug failures.
What I watch now
At Mimir Works, I monitor the agent system the same way I monitor the business: did useful work leave the queue? Quill, our content agent, can write a draft, but the task is not done until the draft is submitted for review and later approved for publishing. Harbor reviews delivery quality. Forge reviews buyer fit. The log keeps those states separate. That boundary has already caught the exact mistake I want monitoring to catch: treating a completed run as shipped work.
My current scorecard is simple:
- Did the task complete?
- Did a human accept the output?
- How much did it cost compared to the task baseline?
- Did the agent call the right tools in the right order?
- Did context pressure stay under control?
- Did delegation stay shallow?
- Were side effects reviewed before they left the system?
- Did delayed quality hold up?
That tells me more than a latency chart ever will.
Agents are not just services that call models. They are work systems. Monitor the work.
If your agents complete runs but still drop work, start this week with one JSONL stream for task events, tool events, human events, and approval events. Then pipe it into your existing observability stack when the questions outgrow the log.
This is the kind of control layer I build into production agent systems before I trust them with real work.
Read next: Why AI Agents Break in Production and What It Costs to Run AI Agents.
