the checklist pattern for agentic runs

a pattern for writing prompts that survive long agentic runs. numbered steps, explicit verification, a single 'done' criterion. nothing else works as well.

after three months of running long agentic tasks, i’ve converged on one prompt shape: the numbered checklist. it survives 4-hour runs better than any other format i’ve tried. here’s the shape and why each piece matters.

the shape

# goal: <one sentence, present tense>

## context (optional)
<3-5 lines of background only the agent couldn't infer>

## steps
1. <verb> <object> — <success criterion>
2. <verb> <object> — <success criterion>
3. ...

## done when
<a single, verifiable condition>

three rules:

every step is a verb. “implement the rate limiter” not “the rate limiter.” passive headings get summarized away under load.
every step has its own success criterion. “run pnpm test rate-limit, all green” beats “verify it works.”
there’s one ‘done when’ at the bottom. without it, the agent ships the first thing that compiles. with it, the agent has a finish line and stops trying to add more.

why it works

agents under load — long runs, deep context — degrade at the same edges that humans do: they forget intermediate goals, they paraphrase instructions until the meaning drifts, they substitute proxies for the real metric. the checklist shape resists all three:

numbered steps survive summarization better than prose
success criteria per step give the agent a self-check between steps
“done when” is a single anchor the agent can return to no matter how far it has drifted

the failure mode it doesn’t fix

the checklist won’t save you from a vague step. “improve performance” with success criterion “it’s faster” will still produce nonsense. you have to write a real step — “reduce p95 latency on /api/chat below 200ms, measured with pnpm bench” — for the pattern to do its job.