i let cursor agent run for 4 hours and it shipped a feature

1,200 lines of code, three dead-ends, one saved afternoon. but only after i rewrote the prompt. a real debrief on 4 hours of unattended cursor agent.

i wanted to know what happens when you stop hand-holding. so i gave cursor’s background agent a six-line spec, hit run, and walked away. four hours later it had merged a working feature, hit three dead-ends, and burned about $4.20 in tokens. here’s what i learned.

the only prompt format that consistently worked

first: paragraphs got summarized into nothing. bullet lists got partly executed. but a numbered checklist with explicit verification steps — that, the agent could follow:

# goal: add per-team rate limiting to /api/chat

1. read src/middleware/rate-limit.ts — understand the existing per-user limiter
2. extend it to accept a scope arg of "user" | "team"
3. add a migration for rate_limit_buckets.team_id
4. wire /api/chat to use scope: "team" for paid plans
5. verify by running pnpm test rate-limit — all green before opening PR
6. open PR titled "feat(rate-limit): per-team buckets"

the verification step in line 5 is the single most important line. without it the agent will ship code that compiles but fails its own tests. with it, the agent loops on the test runner until tests pass — which, in practice, is the only signal that matters.

the silent-disambiguation failure mode

the second insight was darker: the model is allergic to ambiguity, but it doesn’t tell you that.

when step 3 hit a column-naming collision in the migration, the agent silently picked a slightly different name — team_bucket_id instead of team_id — and propagated it through three files. the tests passed. the code review caught it. just barely.

i now write prompts with an explicit “if a name collides, stop and ask” line. it doesn’t always work — the model has a strong prior toward shipping over asking — but it cuts the rate of silent renames roughly in half on my repo.

the new shape of the workday

by hour three i had a working PR, two file-level review comments to write, and a strong opinion: the future isn’t “AI writes the code.” it’s “you write the spec, the AI proposes the diff, you write the review.” the bottleneck shifted from typing to reading. which, honestly? feels good.

what i’m still figuring out:

how to write specs that survive context truncation in long runs
whether to let the agent open the PR itself or stage it for me
the right cost ceiling per task before i kill the run

i’ll write each of those up. for now: yes, four hours unattended is real, and yes, it’s better than three hours of me typing — but only if the spec is honest about what “done” looks like.

cost includes 2 failed runs i killed at hour 1.
tested on cursor 0.45.x with the claude 3.7 sonnet thinking model.
full PR diff on github.