Most teams evaluate agents with demos and vibes. They ask the agent to do one impressive thing, watch it work once, then argue about whether it is ready for real work.
That is not evaluation. That is theater.
An agent is dependable only when it survives a task suite that looks like production: messy inputs, missing context, tool errors, review gates, budget limits, and work that can be rejected by a human. The score that matters is not whether the answer sounded smart. The score that matters is whether the work shipped without unsafe actions, hidden rework, or a reviewer cleaning up the mess.
I run Mimir Works like an operating system for agent labor. Every serious task leaves receipts: board state, tool calls, files changed, review notes, commits, blocks, and outcomes. Without that measurement layer, I would be guessing.
Public benchmarks are useful for orientation. SWE-bench tests GitHub issue fixes. GAIA tests assistant work across tools and reasoning. AgentBench and tau-bench push closer to task execution. I read those as signals, not replacements for my own workflow suite.
The question I care about is narrower: can this agent do my work, under my constraints, with evidence I can audit?
Start with the work, not the model
The worst agent evaluation starts with a model comparison.
Model A gets 91. Model B gets 88. Model C is cheaper. None of that tells me whether the agent can close the work loop inside my system.
I start with a task inventory.
A task is a unit of work a person would recognize as done:
- triage five support tickets and draft replies
- reconcile a monthly transaction export against receipts
- review a pull request for security and regression risk
- write a blog post from a brief, then route it through review
- monitor a workflow and escalate only when a threshold is crossed
- update a CRM field after verifying the source record
Each task needs a known input, expected artifact, acceptance criteria, allowed tools, forbidden actions, and review path. If I cannot write those down, I am still designing the job.
The tool contract belongs in that task spec. I want preconditions, permission tier, input schema, side effects, idempotency, retry behavior, rollback path, and audit log requirement written before the run. "Can use the CRM" is not a tool contract. "Can read customer records, cannot write fields, must log record IDs read, and must block if two records conflict" is closer.
This matters because agents rarely fail as abstract chatbots. They fail at boundaries. They call the wrong tool. They skip a required file. They assume approval. They produce a plausible artifact with one missing constraint. A task suite exposes those failures. A chat benchmark hides them.
Build a task suite with real variance
A good suite has 20 to 50 tasks for one workflow. Five is enough for a smoke test. It is not enough for trust.
I split the suite into four buckets:
- Happy path tasks. Clean inputs, clear instructions, expected output.
- Messy normal tasks. Missing fields, ambiguous labels, noisy context, stale notes.
- Tool failure tasks. Timeouts, validation errors, empty search results, rate limits.
- Policy boundary tasks. External writes, money movement, publishing, customer contact, private data.
The point is not to trick the agent. The point is to find the shape of the work it can handle without a person babysitting it.
For a content agent, the happy path is a complete post brief. The messy normal case is a brief with two competing angles and an old draft in the repo. The tool failure case is a build that fails because of a pre-existing TypeScript issue. The policy boundary case is a draft that looks ready but still needs review before publication.
That last case is important. Agents love finishing. Evaluation has to reward stopping at the right boundary. A blocked task can be a successful outcome when the next step requires human approval.
Write acceptance criteria before the run
I do not score agents against whatever I wish they had done after the fact. I write the acceptance criteria first.
For each task, I want criteria in this shape:
- Output exists in the expected location.
- Output follows the required schema or format.
- Required sources were read or cited.
- Forbidden words, claims, or actions are absent.
- Side effects are logged.
- The agent used the required review gate.
- The final summary includes enough evidence to audit the run.
A blog task can be accepted if the MDX file exists, frontmatter is valid, the piece hits the target length, two internal links resolve, anti-slop checks pass, the draft stays unpublished during review, and the work log records the status.
A finance task can be accepted if the reconciliation table balances, unmatched rows are listed with dollar amounts, source files are named, no balancing plug was invented, no payment was initiated, and the agent escalated anything above its authority.
A code review task can be accepted if findings include file paths, severity, reproduction notes, and no invented line numbers. If the agent says "looks good" without inspecting the diff, it fails even if the pull request happens to be safe.
Acceptance criteria turn agent evaluation from taste into inspection. OpenAI Evals is a useful reference for reusable tests, but the hard part is still defining tasks and criteria that match your workflow.
Score outcomes in layers
I use a layered score because one number hides too much.
The first layer is binary: did the agent complete the task without violating a hard rule?
Hard rules are things like:
- no publish without review
- no customer message without approval
- no trade, transfer, payment, refund, payroll, tax filing, customer credit, account change, or contract change without explicit approval
- no write to production records from a read-only task
- no external publish or customer contact without the required release gate
- no invented source, quote, number, or file path
- no secret or private data in logs
A hard-rule violation is a fail, even if the artifact looks good.
The second layer is acceptance: did the deliverable meet the criteria?
The third layer is effort: how much rework did the human need to do? The reviewer assigns this score. The agent does not grade its own homework.
I use this simple scale:
| Score | Meaning | Human action | | --- | --- | --- | | 5 | Accepted as-is | Review only | | 4 | Minor edits | Small copy, formatting, or labeling fix | | 3 | Usable with rework | Human changes a section, reruns a check, or corrects assumptions | | 2 | Mostly wrong | Human salvages fragments | | 1 | Unsafe or unusable | Throw it away |
The rework score is more honest than a pass rate. Agents often produce something salvageable. If every task scores a 3, the demo looks fine and the operation is losing time. The human is doing the real work after the agent declares victory.
The ROI calculation starts after reviewer cleanup, not before it. A cheap run that burns 20 minutes of senior review time is not cheap.
I track acceptance rate and rework rate separately. A task that gets accepted after 30 minutes of cleanup is not the same as a task accepted as-is.
A copyable scorecard row looks like this:
| Task | Hard-rule result | Acceptance result | Rework score | Failure tag | Receipt | | --- | --- | --- | --- | --- | --- | | Support triage with stale CRM note | Pass | Fail | 3 | Context miss | Draft replies, CRM record IDs read, reviewer note, blocked send action |
That row tells me more than "agent passed 70 percent." It shows safety, acceptance, rework, failure type, and proof.
Measure the handoff, not just the artifact
An agent that writes a decent artifact but leaves no receipts is not production-ready.
I expect every run to leave a handoff:
- what changed
- where the artifact lives
- which checks ran
- what failed
- what remains blocked
- which decisions were made
- which side effects happened
- which policy or scorecard version was used
This is why task-level monitoring matters. I covered the operating metrics in Monitoring AI Agents in Production: What to Watch. Evaluation uses the same substrate, but with a tighter question: can this agent produce auditable work under known conditions?
Receipts catch hidden failures. If an agent says it ran tests but the shell history shows no test command, that is a failure. If it says the draft is ready but the file is still missing frontmatter, that is a failure. If it completes a task that should have blocked for approval, that is a failure.
The artifact is only one output. The handoff is part of the work.
A task is not complete until the reviewer can verify the artifact, checks, and side effects from receipts without rerunning the whole job.
For money movement and other financial actions, the approval receipt needs amount, currency, counterparty, source system, account, payment rail or order type, approver identity and authority, timestamp, source evidence, duplicate-check status, and final verification result. If any field is missing, the agent does not act.
Track failure modes by type
A flat fail count does not help me fix the system. I need a taxonomy.
These are the categories I use:
- Instruction miss. The agent ignored or misunderstood a stated requirement.
- Context miss. The agent failed to read available context or used stale context.
- Tool misuse. Wrong tool, wrong parameters, repeated call, or skipped required tool.
- Verification gap. The agent claimed success without checking the artifact.
- Boundary failure. The agent acted where it should have blocked or escalated.
- Fabrication. Invented source, number, quote, file path, or result.
- Format failure. Output exists but violates schema, frontmatter, JSON, markdown, or API contract.
- Cost failure. The task technically succeeds but burns far more tokens, time, or tool calls than the baseline.
- Coordination failure. Wrong assignee, bad handoff, missing dependency, or duplicate work.
- State contamination. Stale memory, previous run artifacts, or hidden session state leaks into the next task.
The category tells me what to change.
Instruction misses usually need better task specs or prompt wording. Context misses need smaller context loads, explicit required files, or state moved into durable artifacts. Tool misuse needs narrower contracts, dry-run modes, schema validation, permission tiers, and recovery rules. Boundary failures need stronger gates.
The production failure mode post has more examples of how these break in live systems: Why AI Agents Break in Production. The evaluation version is the same idea with a scorecard attached.
Include negative tasks
Every serious suite needs tasks where the correct answer is no.
Do not publish this draft because review is missing. Do not send this email because the customer name conflicts with the CRM. Do not reconcile this row because two receipts match the same transaction. Do not call the write API because the task is read-only. Do not summarize this document because it contains private data outside the allowed scope.
Negative tasks test restraint. They also reveal whether the agent optimizes for completion over judgment.
I want to see agents block cleanly. A good block says what decision is needed and why. A bad block says "stuck." A worse agent plows ahead and creates cleanup work.
The best agents are not the ones that always act. They are the ones that know when action would corrupt the workflow.
Run the same suite more than once
Single-run evaluation is weak. Agents vary by prompt, model, context length, tool availability, and previous failure state.
I run suites in batches and compare:
- first-pass acceptance rate
- accepted as-is rate
- average rework score
- hard-rule violation count
- timeout count
- blocked count
- tool error recovery rate
- cost per accepted task
- review minutes per accepted task
- approval latency for human decisions
- median and p95 duration
Then I rerun the same suite after changes. New prompt, same tasks. New tool contract, same tasks. New model, same tasks. Fresh session, same tasks.
This prevents the common mistake of changing five things at once and declaring victory. If acceptance improves but cost doubles and boundary failures remain, I moved the failure. If the new agent improves average score but regresses on hard-rule cases, it does not ship.
For multi-agent systems, I also track delegation depth and handoff quality. A manager that routes work to the wrong specialist makes every downstream worker look worse. The patterns in Top 7 Multi-Agent Orchestration Patterns matter because each creates different evaluation risks.
A practical evaluation checklist
Before I let an agent near production work, I want this checklist complete:
- The workflow is described as tasks, not prompts.
- Each task has input, expected output, allowed tools, forbidden actions, and acceptance criteria.
- Tool contracts define permissions, side effects, retries, rollback behavior, and audit logs.
- The suite includes happy path, messy normal, tool failure, and policy boundary cases.
- At least 20 to 50 tasks exist for serious scoring, with five allowed only for a smoke test.
- Negative tasks are included.
- Hard-rule failures are defined before the run.
- Human reviewers score rework on a 1 to 5 scale.
- The owner, reviewer, approval threshold, and escalation path are defined.
- Baseline comparison exists: human-only, previous agent, or current workflow.
- Go/no-go thresholds are explicit: zero hard-rule violations, target first-pass acceptance, p95 duration, and review minutes per accepted task.
- Failure modes are tagged by category.
- The agent leaves receipts: files, logs, checks, summaries, side effects.
- Cost and duration are measured per accepted task, not per model call.
- The same suite can be rerun after prompt, tool, or model changes.
- A release gate exists for actions that change the world: money movement, customer contact, external publishing, contract changes, and production writes.
Use this as a pre-production launch gate, not just a retrospective scorecard. In week one, pick one workflow, write five smoke-test tasks, run them with a named owner and reviewer, then expand to 20 to 50 tasks before production access.
If any line is missing, I treat the evaluation as provisional.
What good looks like
A dependable agent does not need a perfect score. It needs a known operating envelope. That is also the point of the Mimir Works services model: design the workflow, run it with receipts, and keep the release gate visible.
I want to know: in this illustrative run, this agent handles 82 percent of normal support triage as-is, blocks correctly on payment questions, fails mostly on missing CRM fields, costs $0.06 per accepted task, and needs human rework on 11 percent of runs. That is useful. I can decide whether to deploy it, constrain it, or improve the data it reads.
If you need this evaluation layer built around your workflows, Mimir Works can build the task suite, scorecard, review gate, and monitoring loop around the work your team already does.
I do not trust: this agent seems pretty good.
The whole point of evaluation is to replace vibes with operating facts. Tasks. Criteria. Scores. Rework. Failure categories. Receipts.
That is how I decide whether an agent is ready. Not by asking if it sounded intelligent. By checking whether the work survived the system.
Read next: Monitoring AI Agents in Production: What to Watch and Why AI Agents Break in Production.
