Systems·May 22, 2026·10 min

AI Agent Handoffs Need Receipts

AI agent handoffs fail when the next worker has to trust a summary without proof. Receipts turn multi-agent work into an audit trail: artifacts, commits, task IDs, tests, screenshots, and blockers.

AI Agent Handoffs Need Receipts

Most agent handoffs are too polite.

One worker says "done". The next worker receives a short summary and a pile of implied trust. If the output is wrong, stale, half-written, untested, or parked in the wrong directory, the next worker has to rediscover the truth from scratch.

That is not a handoff. That is a rumor with a task ID attached.

A useful handoff leaves receipts. It names the artifact, points to the commit, records the validation result, links the task, shows unresolved blockers, and gives the next operator enough evidence to decide whether to continue, review, roll back, or stop.

This matters more as agent systems move from demos into delivery pipelines. A single agent in one chat thread gets away with fuzzy continuity. Humans plus agents across repos, task boards, approvals, git branches, databases, and release gates do not. If each crossing loses evidence, the system gets slower every time it tries to move faster.

I treat receipts as the minimum standard for cross-agent work. The NIST AI Risk Management Framework frames AI risk around govern, map, measure, and manage. That only becomes operational when an agent handoff records who did what, what evidence exists, and where the next check should happen. The OWASP AI Agent Security Cheat Sheet makes the same point from the security side: tool access, human approval, logging, memory, and output validation are control surfaces, not vibes.

A receipt is not a longer summary

A summary says what happened.

A receipt proves where to look.

That distinction sounds small until something breaks. "Updated the draft and ran checks" is a summary. It gives the next worker no handle. Which draft? Which checks? Did the checks pass? Were failures from the changed file or from old repo damage? Was the draft committed, staged, or only present in a local workspace?

A receipt answers those questions without a meeting.

A good handoff includes:

the task ID that owns the work
the artifact path or external URL
the commit hash or branch name when code or content changed
the validation command and exact result
screenshots or logs when visual state matters
the remaining blocker, if any
the next owner or decision boundary
the timestamp for the handoff

This is not bureaucracy. It is compression. The next worker gets a small packet of evidence instead of a vague narrative.

Why AI agent handoffs fail

Agent handoffs fail for boring reasons.

The artifact exists, but not where the next worker expects it. The commit exists, but the branch was never pushed. The tool output passed, but only after ignoring a warning that matters. The screenshot shows the old UI because the server cache was stale. The task was marked complete, but the work still needs approval before release.

These are not model intelligence problems. They are operating system problems.

I see five common failure modes.

1. The artifact is unnamed

The worker says the draft, patch, report, or export is complete, but does not name the path.

The next worker searches the repo, finds three possible files, and guesses. That guess then becomes the new state of the system. One bad path contaminates the next step.

The fix is simple: every handoff names the artifact path exactly.

Bad:

"Drafted the post."

Good:

"Draft saved at content/blog/ai-agent-handoffs-need-receipts.mdx."

The path is the receipt. Without it, the summary is not operational.

2. The state is not durable

A worker edits a file, runs a local check, and stops. The file exists only in a workspace. The next worker is spawned in a different environment and sees nothing.

This is common in multi-agent code and content systems because every worker has a different lifecycle. Some workspaces persist. Some are scratch. Some are worktrees. Some are shared directories with unrelated dirty files.

Durability has to be explicit.

For repo work, I want one of these states in the handoff:

committed and pushed to a named branch
committed locally with the exact commit hash and a stated reason it was not pushed
uncommitted in a named workspace because review is required before commit
blocked before write, with no artifact created

Anything else leaves the next worker guessing which copy of reality is real.

3. Validation is described instead of recorded

"Tests pass" is better than nothing, but it is still weak.

The receipt needs the command and the result.

Good:

git diff --check: passed
npm run build: failed in existing Gatsby schema generation, no error references changed MDX
pytest tests/test_router.py: 18 passed in 4.2s

This lets the next worker distinguish content risk from repo risk. It also prevents laundering old failures. If a build was already broken before the edit, say that. If the new file broke the build, say that too.

I care less about a perfect green check than about honest provenance. A red check with a clear cause is safer than a green claim with no command behind it.

4. The blocker is hidden inside the prose

Blocked work should be visible at the task layer, not buried in the last paragraph of a summary.

If a draft needs legal approval, mark it blocked. If a deployment needs credentials, mark it blocked. If a reviewer found a release-stopping issue, mark it blocked. The blocker has to live in the task status or board field, not only in the handoff text. Do not complete the task and hope the next worker reads carefully.

A blocker receipt has three parts:

what decision is needed
who owns the decision
what work is safe to do while waiting

Example:

"Blocked on human approval to publish. Safe next work: copy edit, link audit, thumbnail generation. Unsafe next work: flipping draft: false or pushing a release commit."

That line prevents a well-meaning agent from crossing the approval boundary.

5. The handoff mixes completion with release

This is the most expensive mistake.

An agent completes a deliverable. A human has not approved it. Another system sees "done" and ships it.

The fix is separate states. In my own systems, completed, reviewed, approved, published, and released are different words. They mean different things. A draft can be complete and not publishable. A patch can be merged and not released. A report can be generated and not sent.

This is why task records matter. I wrote about this in Monitoring AI Agents in Production: What to Watch: monitor the task outcome, not just the run outcome. The run can succeed while the work is still waiting at the approval gate.

Receipts make that gate visible.

The minimum receipt format

Here is the handoff shape I use when one worker passes work to another.

Task: t_12345678
Handoff time: 2026-05-14 10:57 UTC
Owner: Quill
Status: draft-ready, review-blocked
Kanban handle: task t_12345678, comment thread updated with receipt
Artifact: content/blog/example-post.mdx
Workspace: /workspace/site
Commit: 9f3a21c on main
Validation:
  - git diff --check: passed
  - npm run build: failed, existing Gatsby image schema issue, changed MDX not referenced
Evidence:
  - screenshot: /tmp/example-post-preview.png
  - live URL: not available because draft:true
Blocker:
  - needs A2A consensus before publish
Approval boundary:
  - no publish, send, merge, payment, customer message, or production write without explicit approval
Next safe action:
  - review content, patch low-risk edits, keep draft:true
Next unsafe action:
  - publishing or announcing as live

That is enough for another agent to continue without asking me to replay the run.

The format does not need to be fancy. It needs stable required fields. A markdown comment, kanban handoff, issue comment, pull request body, database row, or log entry works. In a kanban-driven system, the board is the canonical coordination record. Chat memory and private agent memory are supporting context, not source of truth.

Observability systems care about correlation for the same reason. The OpenTelemetry logs specification ties logs back to traces, metrics, source attribution, and distributed context. AI agent handoffs need that shape: correlated evidence, not one giant transcript.

A decision framework for receipts

Not every task needs the same receipt depth. A five-minute note edit does not need a full incident packet. A production deployment does.

I use this framework.

Low-risk handoff

Use for internal notes, small drafts, or reversible copy edits.

Required receipt:

artifact path
short summary
validation if any check was run
blocker if the task is not complete

Medium-risk handoff

Use for blog drafts, code patches, data exports, support replies, or anything another worker will review.

Required receipt:

artifact path
task ID
commit or branch
validation command and result
external systems touched, if any
internal links or external sources touched
explicit next action
explicit publish, send, merge, or release boundary

High-risk handoff

Use for money movement, customer communication, production deploys, database migrations, security changes, legal copy, or irreversible external writes.

The approval evidence should be inspectable by the reviewing operator, not trapped in a private chat summary. High-risk work needs approval before execution, not only a receipt after the fact. A receipt for an unauthorized payment, billing change, customer commitment, secret rotation, or legal claim is just evidence that the control failed.

Required receipt:

all medium-risk fields
reviewer or approver identity and authority
approval timestamp
approved action
approved destination, account, counterparty, customer, or system
approved amount or scope, if applicable
approval expiration or validity window
evidence link to the approval
rollback or mitigation plan
screenshots or logs
intended external side effect before action
exact external side effect after approval
confirmation ID, transaction ID, ticket ID, deploy SHA, or message URL
what was explicitly not done
secrets or credentials explicitly excluded from the handoff
human approval state
scoped credentials or least-privilege limit used for the action

For high-risk work, the unsafe actions should be named plainly: sending a customer email, initiating a payment, changing billing, rotating production secrets, publishing legal or security claims, migrating production data, or writing to a third-party system that cannot be rolled back cleanly.

The receipt should show both sides of the boundary:

Before action:
  - intended side effect: send renewal email to Customer A
  - approval: Will, 2026-05-14 10:42 UTC, approved only the renewal reminder copy
  - scope: one customer, no discount, no billing change
After action:
  - performed side effect: email sent to customer-a@example.com
  - confirmation: message URL in helpdesk ticket 4821
  - not done: no invoice change, no discount promise, no account update
  - mitigation: follow-up correction email owner named if customer reports mismatch

The higher the blast radius, the less I trust prose. I want handles, evidence, and state transitions.

What this looks like in kanban

Kanban is not just a list of tasks. Used correctly, it is a receipt ledger and the authoritative state surface for the work. The board or task record should be updated before the handoff counts as complete.

Each task has an ID, assignee, status, parent dependencies, comments, run history, and a completion handoff. That gives the system an audit trail without forcing every worker to keep all context in memory.

A bad kanban completion summary repeats the problem:

Done. Draft updated and checks run.

A good kanban completion summary is short enough for a dashboard and specific enough for routing:

{
  "summary": "Drafted agent handoff post at content/blog/ai-agent-handoffs-need-receipts.mdx; body word count recorded in validation metadata; internal links to monitoring, orchestration, agent-org, and services pages; build passed.",
  "metadata": {
    "changed_files": ["content/blog/ai-agent-handoffs-need-receipts.mdx"],
    "internal_links": [
      "/blog/monitoring-ai-agents-in-production/",
      "/blog/top-7-multi-agent-orchestration-patterns/",
      "/blog/how-my-agent-org-evolved/",
      "/services/"
    ],
    "validation": {
      "anti-slop grep": "passed, false positives only",
      "em dash check": "passed, 0 found",
      "bold section check": "passed, 0 found",
      "git diff --check": "passed",
      "npm run build": "passed"
    },
    "publish_blockers": ["draft:true pending approval"]
  }
}

The next worker does not need the whole transcript. They need the receipt.

This is also why I like the blackboard pattern in Top 7 Multi-Agent Orchestration Patterns. Shared state works when workers write durable facts into a common surface. It fails when the shared surface turns into an unstructured transcript dump.

Receipts are how the blackboard stays useful.

Receipts beat memory

Agent memory helps, but it is the wrong place for task truth.

Memory is for stable facts: preferences, conventions, durable system knowledge. Task receipts belong near the work. The file path belongs with the task. The test result belongs with the commit. The blocker belongs in the board. The screenshot belongs in the review thread.

When task truth lives only in memory, the system drifts. A future run retrieves a stale fact and treats it as current. A different worker lacks the memory entry entirely. A human reviewer cannot inspect the evidence without asking the agent to explain itself.

Durable handoffs should survive the agent that wrote them.

This is the same lesson behind role clarity in How My AI Agent Org Evolved as the Work Got Real. Ownership gets easier when responsibilities are explicit. Receipts make the ownership visible after the context window disappears.

The checklist I use before handing off

Before I hand work to another agent or human, I check this list.

Did I name the artifact path?
Did I record the task ID?
Did I say whether the work is draft, review-ready, approved, published, or released?
Did I include the commit hash or branch when files changed?
Did I record every validation command I ran?
Did I distinguish new failures from pre-existing repo failures?
Did I include screenshots, logs, or URLs when visual or external state matters?
Did I state unresolved blockers in one clear sentence?
Did I name the next safe action?
Did I name the action that must not happen yet?
Did I say what was explicitly not done?

If a handoff lacks those answers, the next worker is not receiving the work. They are receiving a scavenger hunt.

Receipts slow down the right part

The objection is predictable: receipts add overhead.

They do. So does writing commit messages, naming tests, and keeping task IDs. The point is not to remove overhead. The point is to move it from failure recovery into normal operation.

A receipt costs thirty seconds when the context is fresh. Reconstructing missing state costs twenty minutes later, and it usually happens under pressure.

Agent systems fail quietly when every worker optimizes for finishing its own run. They get reliable when each worker leaves enough evidence for the next one.

Use the receipt format above in the next handoff you run. This is especially important before giving agents write access to repos, customer systems, billing, production data, or outbound comms. If your agent workflows need auditable handoffs, Mimir Works can help design the operating loop through AI workflow automation services.

That is the standard I want: no handoff without proof.

Some links on this site may be affiliate links. I only recommend tools I use. If you click through and make a purchase, I may earn a small commission at no extra cost to you.