← Back to Writing

Agent Runtime Config Migrations Need Rollback Plans

Agent runtime config migrations fail quietly when they rewrite files without dry runs, diffs, backups, validation, and a verified rollback path.

Runtime config migrations scare me more than model failures.

A bad model answer is visible in the transcript. A bad config migration can sit in a dotfile, break the next worker at boot, and look like a model, tool, or credential problem until somebody opens the generated file by hand.

That is exactly the class of failure behind Hermes issue #26250: hermes codex-runtime migrate produced an invalid ~/.codex/config.toml. The file hit duplicate TOML keys, wrote a global permission value inside an existing MCP server table, and embedded a stale, process-specific HERMES_HOME value that made the generated config point at the wrong runtime home. Codex refused to load it. The operator had to repair the file manually.

PR #26249, later superseded by the merged fix in #33452, points at a related update-path failure from another angle. This was not a config migration bug: hermes update checked one branch state, returned early, and skipped the upstream sync path fork users needed. The command looked finished because the local condition was true. The runtime state was still stale.

Together, these are not isolated bugs. They are the AI agent config migration problem: config writes are side effects, but they often get treated like setup chores. They need the same discipline as database migrations, deploy scripts, and release gates.

If an agent runtime can rewrite the files that decide which tools load, which credentials are visible, which permissions apply, and which process boots next, it needs a rollback plan before the first write.

Config is part of the runtime, not paperwork

Most agent stacks still treat config like documentation with syntax.

It is not. Config is executable policy.

A TOML file can grant shell access. A YAML block can start MCP servers. An environment file can expose the production token instead of a sandbox token. A profile file can choose the model for a task and the tools the model can call.

That means a config migration is a runtime mutation. It can change behavior without changing application code.

The failure mode is worse because config sits in the boot path. If the migration corrupts a normal data file, the agent may still start and report the bad artifact. If the migration corrupts runtime config, the next worker may never boot far enough to explain what happened.

I want agent systems to treat config writes as first-class release events. Not huge ceremonies. This is small-team process: dry run, diff, backup, validation, rollback, and a clear stop condition.

The quiet failure pattern

The dangerous migration is the one that mostly works.

It finds the file. It writes a managed block. It prints success. The command exits zero. Then a later runtime reads the file differently than the migration renderer expected.

That gap creates quiet failures.

In the Hermes Codex migration bug, the managed block was only safe if no TOML table scope was open above it. Real user config can already contain [mcp_servers.name] tables, and TOML scope rules do not reset because a comment says the next section is managed. If the renderer does not close or structure the file correctly, a generated default_permissions line can land inside the previous table.

That is not an exotic edge case. It is what happens when migration code assumes a user's file shape instead of parsing and preserving it.

The other bug, duplicate generated keys, is the same story. A migration that appends or regenerates without proving uniqueness can create a file that no parser accepts. The third bug, leaking a stale HERMES_HOME into generated config, turns a portable runtime file into a machine-specific artifact.

None of these require malice or a dramatic crash. They are ordinary config migration mistakes. They break agents because agents boot from config.

Dry run should be the default mode

A runtime migration should be able to show me the exact write before it writes.

I want a command shape like this:

hermes codex-runtime migrate --dry-run --diff

This is a contract pattern, not a claim about the current Hermes command surface. The point is the preview: the runtime should expose the exact plan it will apply.

The dry run should answer five questions:

  1. Which file will change?
  2. Which managed block will be added, replaced, or removed?
  3. Which user-owned lines will stay untouched?
  4. Which values come from the current runtime environment?
  5. Which validation checks will run after the write?

That output needs to be machine-readable enough for agents and readable enough for humans. A terminal diff is good. A JSON plan is better if another tool will inspect it before approval.

The important part is that dry run and apply use the same planner. One trap is a preview path that renders one thing while the mutation path writes another. That creates a fake preview. The write path should execute the already validated plan.

For an agent runtime, dry run is not just user experience. It is a safety contract. A worker can run the plan, attach the diff to a task, ask for review if the change crosses a boundary, then apply exactly what was reviewed.

A diff is the approval payload

Config approvals should bind to the actual diff, because intent is too vague on its own.

A generic approval like "migrate Codex config" is too loose. The migration might write a permission value, add an MCP server, change an environment path, or delete a user block. Those are different risk levels.

The approval payload should include:

  • target path
  • plan or migration id
  • before hash
  • after hash
  • unified diff
  • generated section owner
  • environment-derived values, redacted when they contain secrets
  • approver or run actor
  • approval timestamp
  • risk class: permissions, credentials, tool access, boot path, or cosmetic
  • expected blast radius and affected runtimes
  • validation commands
  • rollback path

That sounds like more work than a single yes. It is less work than debugging a broken worker that cannot start.

This is the same line I use for content, code, and finance agents: approval attaches to the payload, not the vague intent. A reviewed draft does not clear a rewritten draft. A reviewed migration diff does not clear a different diff.

If the file changed between dry run and apply, the migration should stop and ask for a new plan. That prevents one worker from approving a diff against stale state while another worker edits the same config.

During apply, the migration should take a file lock or equivalent guard so two runtimes cannot rewrite the same config at once.

Backups need names, not vibes

Every runtime migration that changes an existing config should create a backup before it writes. If it creates a new file, it should record that there was no previous state to restore.

The backup should be boring and searchable:

~/.codex/config.toml.hermes-backup.2026-05-15T09-00-40Z

It should record:

  • original path
  • timestamp
  • command name
  • tool version
  • before hash
  • after hash if the write succeeded
  • task or run id when available

The backup should not live only in the agent transcript. It should live next to the file or in a known recovery directory. If the runtime cannot boot, the operator should still know where to look.

Backups also need retention rules. Keeping the last 10 per file is often enough. Keeping none because "git has it" is wrong for home directory dotfiles. Many runtime configs are not in git, and even when they are, local secrets and generated paths usually are not.

Backups must preserve or tighten the original file permissions, especially when the config may contain tokens, credential paths, or environment values. A rollback artifact should not turn a private config into a broadly readable secret dump.

I also want a restore command:

hermes codex-runtime rollback --to ~/.codex/config.toml.hermes-backup.2026-05-15T09-00-40Z

A backup without a restore path is an artifact, not a rollback plan.

Validate with the real parser

String checks are not enough.

If the runtime emits TOML, parse it with the same TOML rules the consumer uses. Same for YAML and JSON. If it writes a shell environment file, run the safest available syntax check. If it writes an MCP server block, validate the server names, required fields, and environment references.

For config that another command consumes, the best validation is often the consumer itself in no-start or check mode, when that mode exists. If the consumer does not expose a check command, the migration should still run parser validation and structural checks before it reports success.

My minimum post-write checks are:

  • file parses
  • generated block appears exactly once
  • user-owned blocks remain present
  • duplicate keys, table headers, and managed sections are absent in every generated scope
  • environment-derived paths are either intentional or rejected
  • required commands and paths exist
  • runtime can load config without starting a long-lived worker

The success message should not say "migration complete" until those checks pass.

Managed blocks need hard boundaries

Generated config should live inside a clearly owned boundary.

Comments help humans, but parsers ignore them. A migration cannot rely on a comment to reset syntax state. It has to understand the file format.

For TOML, that means the renderer must know table scope. For YAML, it must know indentation and anchors. For JSON, it has to preserve ordering where humans care and reject duplicate keys before writing. For shell files, it has to quote values safely and avoid executing user content during validation.

A managed block also needs an owner and a version:

# BEGIN hermes-agent managed block: codex-runtime v2
# source: hermes codex-runtime migrate
# rollback: hermes codex-runtime rollback --latest
# END hermes-agent managed block: codex-runtime v2

The migration should replace only that region unless the user explicitly asks for a full-file rewrite. If no managed block exists, it should insert one at a format-safe location. If more than one exists, it should stop. Duplicate managed blocks are a recovery task, not a normal apply.

The more I run agents, the more I prefer narrow ownership. The runtime owns its block. The user owns the rest. The migration proves it stayed in its lane.

Rollback has to be verified before the next worker boots

The hard part is not writing rollback code. It is making rollback part of the migration flow.

After a config migration, the system should know three things before it starts the next worker:

  1. The new config loads.
  2. The previous config has a known restore command and, where supported, a dry-run check proving the backup is usable.
  3. The restore path has been recorded outside the model context.

That last point matters because, if the next worker fails to boot, you cannot depend on the failed worker to explain rollback. The recovery instruction has to exist in a file, log, or backup manifest the operator can read directly.

In a multi-agent system, I would also write a task event:

{
  "event": "runtime_config_migrated",
  "path": "~/.codex/config.toml",
  "task_id": "t_...",
  "run_id": "r_...",
  "actor": "codex-runtime-migrate",
  "before_hash": "...",
  "after_hash": "...",
  "backup": "~/.codex/config.toml.hermes-backup.2026-05-15T09-00-40Z",
  "validation": "passed",
  "rollback_owner": "operator"
}

That is a receipt. It lets a later agent decide whether to continue, roll back, or block. The rollback_owner should be a named role or actor class, not a vague hope that somebody will notice. It also keeps completed, approved, and released from collapsing into one sloppy state. I wrote about that split in Monitoring AI Agents in Production: What to Watch. A runtime migration needs the same separation.

The migration checklist I want

Here is the checklist I would require before an agent runtime migration writes config:

  1. Preflight

    • Resolve the target path.
    • Confirm the file exists or declare that a new file will be created.
    • Compute a before hash.
    • Detect unrelated concurrent changes if a prior plan exists.
    • Name the control owner for rollback decisions.
  2. Parse

    • Parse the current file with a real parser.
    • Identify user-owned sections and managed sections.
    • Reject duplicate managed blocks.
    • Reject syntax errors before mutation unless the command is explicitly a repair command.
  3. Plan

    • Build a proposed after state in memory.
    • Assign a plan id that apply can bind to.
    • Produce a unified diff.
    • List environment-derived values without exposing raw secrets.
    • List permission changes.
  4. Backup

    • Write a timestamped backup before apply.
    • Write a manifest with before hash, command, version, and task id.
    • Preserve or tighten file permissions.
    • Verify the backup can be read.
  5. Apply

    • Write atomically through a temporary file and rename.
    • Hold a file lock or equivalent concurrency guard.
    • Preserve file permissions where safe.
    • Touch only the managed block unless full rewrite was requested.
  6. Validate

    • Parse the after file.
    • Run consumer config check if available.
    • Verify generated block count is one.
    • Verify user-owned sections survived.
  7. Rollback

    • Record the restore command.
    • Test restore against the backup manifest in dry-run mode if supported.
    • Verify the rollback manifest or log can be read without the agent runtime.
    • Record the stop state outside model context.
    • Restore automatically only when the policy allows it. Otherwise stop and emit recovery instructions.
    • Stop the next worker from booting if validation fails.

This is not heavy process. It is the minimum standard for code that edits the files agents depend on to exist.

The rule I use now

If a migration can break the next agent before it starts, it needs a rollback plan.

That applies to Codex config, MCP server config, model provider config, gateway routing, profile toolsets, cron definitions, environment files, and anything else in the boot path.

The agent ecosystem is moving toward more automatic setup: one command to install tools, migrate runtimes, register servers, update profiles, and wire credentials. Good. I want less hand-editing. But automatic setup without recoverability is just faster breakage.

The fix is not to avoid migrations. The fix is to make migrations inspectable and reversible.

Dry-run the write. Diff the payload. Back up the old file, or record that no old file existed. Validate with the real parser. Record the rollback command. Do not boot the next worker until the config passes.

That is the difference between an agent runtime that can upgrade itself and one that quietly bricks the operator at 9 a.m.

If your agent runtime writes config automatically, audit the migration path before you enable it. The useful question is simple: can you prove the exact diff, restore the old file, and prevent the next worker from booting into bad state? If not, Mimir Works can audit the migration and rollback path before you let agents mutate production config.

Read next: Why AI Agent Setups Fail Within 48 Hours and Monitoring AI Agents in Production: What to Watch.

Some links on this site may be affiliate links. I only recommend tools I use. If you click through and make a purchase, I may earn a small commission at no extra cost to you.