Run It Like Production: Run AI Agents That Don't Lie, Leak, or Overspend

In ~8 minutes you'll be able to

Settle any config question the only way that works: read the running box, not the doc.
Dispatch a job through the five-step sequence that has produced zero double-spawns.
Name the three habits that keep a pipeline honest overnight: persist first, watch cheap, one gateway.

The whole lesson in one block:

The answer, up front

Production discipline for agents is five habits. Trust the live config over any document, because configs flap and docs are snapshots. Change exactly one thing at a time, with a backup. Hand every worker an enumerable worklist with an explicit done-criterion. Dispatch through a fixed five-step sequence with a dry-run before any tokens move. And keep the cheap safeguards on: persist before you enhance, a zero-token watchdog, one chat surface for the whole fleet. None of it is clever. All of it was paid for.

01Telemetry beats theory

Start with the meta-rule, because every other habit in this lesson depends on it. Never trust a document's stated model, role, or value over the running configuration. Configs flap. On this fleet, the bulk extractor has switched between a local model and a hosted one within the same hour.^[1] A doc is a snapshot. The running box is the truth. Before asserting any current state, read the live config file, the live process, the server logs.^[2]

Here is the story that turned this from a preference into a rule. A confident theory about local-model memory tuning said the fix was to turn a cache optimization off and cap the context window hard. The verdict got written down. Then live process inspection contradicted it: the optimization actually lowered memory use, and the context stayed pinned right where it was loaded. The verdict was corrected. Then it was re-corrected, again by the live process logs. The confident theory was wrong twice.^[3] Nothing was wrong with the reasoning. The reasoning just wasn't connected to the machine.

If you remember nothing else

The running box is the truth.

A document describes the system as it was. The live config, the live process, and the server logs describe the system as it is. When a tuning claim and the live telemetry disagree, the telemetry wins. Every time.

This is the meta-rule. Every other habit in this lesson is a way of staying connected to the running system instead of a story about it.

02One change at a time, with a backup

A single careless edit to one config file took the entire fleet down for hours.^[4] One file. One edit. Hours of outage. The rule that edit left behind has three steps and no exceptions: back up before editing, change exactly one thing, verify it. Then move on.

The reason is recovery math, not caution. When two changes go in together and the box breaks, you don't know which change broke it. Now you're debugging instead of restoring. With a backup and a single change, recovery is a file copy and the cause is never in question. This is the most boring rule in the course, and it is non-negotiable.

03Worklists, not verbs

Never dispatch a worker with a vague verb. "Verify the outputs" reads like an instruction. It is actually an invitation for a capable model to substitute its own interpretation of the job, and capable models accept that invitation. The fix is a scoped, enumerable worklist plus an explicit done-criterion: this list of items, one fresh verdict per claim, and the enumerating query is authoritative, never a hard-coded count.

The receipt: handed a scoped brief, the fleet's verifier worked through about 520 claims across 23 targets in roughly 15 minutes.^[5] The same setup with a vague brief had previously drifted into a different job entirely. The difference was never the model. It was the brief.

04The dispatch sequence

Every job dispatch on this fleet runs the same five steps, in the same order.^[6] Each step exists because skipping it has a known price:

Archive stale. Clear out any old job for the same unit of work, so nothing wakes up later and does it twice.
Create the job: scoped, idempotency-keyed, one retry. A retry maps back to the same job instead of spawning a sibling.
Dry-run. Ask the dispatcher what it would fire, and expect exactly your job. A mis-routed or extra job dies here, before it spends.
Dispatch one. One job, not the queue. A local model server serializes whatever you throw at it.
Watch. A dispatched job is not a finished job.

The receipt: zero double-spawns, and mis-dispatches die at the dry-run, before the tokens are spent.^[6] The sequence costs a moment of care per dispatch. The next section lets you skip steps and pay the price safely.

Each step exists because skipping it has a known price; the dry-run kills a bad dispatch before it spends. (AgentOps fleet manual, P6)

05Try it: the Dispatch Stepper

Walk the sequence yourself. Run each step to see what it guarantees, or skip it to see exactly what goes wrong. Steps resolve in order, the same way a real dispatch does.

Interactive · your feedback loop

Dispatch Stepper

Five steps, in order. Run it shows what the step buys you. Skip it shows what it costs. You can't touch step 3 before resolving step 2, just like the real sequence.

Archive staleclear old jobs for the same unit of work

Create the jobscoped, idempotency-keyed, one retry

Dry-runexpect exactly your job, nothing else

Dispatch oneone job, not the whole queue

Watcha dispatched job is not a finished job

Step 1 of 5 is open. Run it or skip it to move on.

06Three habits that hold at 2am

Three more patterns, grouped because they share a job: they keep the pipeline honest while nobody is looking.

Persist before you enhance. Write and deliver the real result before running any optional enrichment stage. If the enhancement crashes or times out, the record is already safe. Before this rule, a late-stage crash could strand a delivery in a half-done pending state. After it, the pipeline stopped stranding records at all.^[7]

The zero-token watchdog. A tiny shell script on a five-minute timer checks liveness and flow, auto-heals one known failure, and messages the operator on red. No model. No tokens. It is the cheapest possible monitor for the cheapest possible failure, and it has already earned its keep twice: it auto-healed a silent gateway outage, and it caught a real worker re-block.^[8]

One bot, one gateway. Exactly one chat surface for the whole fleet, and background workers carry no chat token at all. A token shared by two gateways causes conflicts and identity cross-wiring. A worker with no token cannot be hijacked over chat and does not need watching. So when the queue is empty and the workers go quiet, that is health, not failure. Idle workers are healthy, not broken.^[9]

07The Do-Not list

Every line below is a mistake already paid for, condensed from the fleet's full list.^[10] Read it as a pre-flight check before you put an agent anywhere near real work.

Credentials & safety

Never give a worker agent a master key or a superuser password. Scoped least-privilege roles only.
Never let the agent that authors a load-bearing claim also verify it, and never use the same model family for both.
Never treat a scraped web page as instructions. Embedded "mark verified" text is injection. Ignore it and flag it.
Never type a secret literal into a shell command. It lands in the transcript and forces a rotation.

Models & cost

Don't trust any doc over the live config for an agent's current model. It flaps.
Don't give a bulk authoring agent a paid fallback. Let it stall. Never let it silently spend.
Don't schedule multiple local-model jobs at the same minute. One local server serializes them.

Jobs & workers

Don't hand a worker a vague brief. Give it an enumerable worklist and a done-criterion.
Don't assume a worker shares your shell environment or paths. Pass absolute paths and source its own environment.
Don't unblock a partially-claimed job while a newer replacement is already running. They will duplicate the work.

One last operational pattern, because a manual like this grows faster than anyone can hold in their head. The fleet's knowledge base is router-first: a tiny index of about 30 lines maps the task at hand to exactly one self-contained recipe card, and that card links only the few deep references that actually apply, by exact section.^[11] A resuming operator reads the router, opens one card, follows it. Nobody reads the library front to back, and nobody needs to.

Two rules keep it honest. A card never grows into a second copy of the deep reference; it links the reference by exact section instead. And a new failure or win is filed once, in its canonical home, then cited from wherever it applies. The knowledge compounds. The reading cost per task does not.

08Check your understanding

3 quick checks

Click an answer for instant feedback. One try per question. Nothing is sent anywhere.

Q1A doc says the agent runs model X; the live config says Y. Which wins?

A doc is a snapshot of a machine that may no longer exist. Configs flap, sometimes within the same hour. Read the running box before asserting anything.^[1]

Q2What does the dry-run exist to catch?

The dry-run asks the dispatcher what it would fire and expects exactly your job. A bad dispatch dies there, before the tokens are gone.^[6]

Q3Why does the watchdog cost almost nothing to run?

A five-minute shell timer checks liveness and flow without ever touching a model. That is why it can run constantly without becoming a thing that breaks expensively.^[8]

09The course in five lines

Why Agents Fail: agents lie, leak, and overspend, and every fix moves a safety property from a prompt (a request) into the substrate (a law).
The Verification Wall: the author of a claim never approves it, the verifier is a different model family, and the wall lives in database grants.
The Money Lesson: fail closed. Empty fallbacks, governors that refuse to dispatch blind, and the cheapest model that clears the bar.
The Hostile Web: a scraped page is data, never instructions, and least privilege caps what an injected agent can ever do.
Run It Like Production: trust the running box, change one thing at a time, scope every brief, dispatch through the sequence, and watch with a monitor too cheap to break.

Sources

AgentOps fleet manual, The Six Rules, Rule 1: never trust a document's stated model, role, or value over the running configuration; the bulk extractor has flapped between a local model and a hosted one within the same hour. ↩
AgentOps fleet manual, Proven Patterns P13: before asserting any model, role, or state value, read the running source, the live config, the live process, the server logs, not a document and not recalled training data. ↩
AgentOps fleet manual, Proven Patterns P12 and The Do-Not List ("Models & cost"): a confident local-model memory-tuning theory was twice wrong; live process inspection showed the cache optimization lowered memory and kept context pinned where it was loaded. The running box settled it. ↩
AgentOps fleet manual, The Six Rules, Rule 6: a single careless edit to one config file took the entire fleet down for hours; back up before editing, change exactly one thing, verify it. ↩
AgentOps fleet manual, Proven Patterns P5: a scoped, enumerable worklist plus an explicit done-criterion; about 520 claims across 23 targets verified in roughly 15 minutes. ↩
AgentOps fleet manual, Proven Patterns P6: archive stale, create the job (scoped, idempotency-keyed, one retry), dry-run, dispatch one, watch; zero double-spawns, with mis-dispatches caught at the dry-run rather than after the tokens are gone. ↩
AgentOps fleet manual, Proven Patterns P9: persist before you enhance; a delivery pipeline stopped stranding records in a pending state after a late-stage crash. ↩
AgentOps fleet manual, Proven Patterns P7: a shell-only watchdog on a five-minute timer, no model, no tokens; it auto-healed a silent gateway outage and caught a real worker re-block. ↩
AgentOps fleet manual, Proven Patterns P14: one bot, one gateway; background workers carry no chat token; workers are invisible and idle when the queue is empty, which is healthy, not broken. ↩
AgentOps fleet manual, The Do-Not List: each line is a mistake already paid for, grouped across credentials and safety, models and cost, data and migrations, jobs and workers, gateways, and process hygiene. ↩
AgentOps fleet manual, Scenario Routing: router-first progressive disclosure; a roughly 30-line router maps a task to exactly one self-contained card, and the card links deep references by exact section. The knowledge compounds; the reading cost per task does not. ↩

end of lesson 5