A previous build of the same app shipped 8 phases with green ticks, 61 tests passing, screenshots in every chunk report, done markers everywhere.
First launch was a black screen.
The contract had drifted three phases earlier.
The same context that introduced the drift was the one validating the work.
Self-audit can't catch this.
The degraded session that produced the bug is the same one evaluating it.
So I stopped letting one model do all three jobs.
The split
Brain holds the spec, the plan, the briefs, the verification judgment. One consistent context across the whole project.
Hands execute one chunk at a time against a tight contract. No memory between chunks. Cheap to dispatch, ruthless about scope.
Eyes approve the visuals.
That's me.
Refusing to auto-approve based on a commit message.
Shipped a Mac app in three days on this loop.
Two phases tagged clean. 123 tests green. Atmospherics visually approved on first build. Zero rebuilds.
The five artifacts
- Spec. What the thing is. Architecture, decisions, and two inventories most specs miss: a craft inventory (the prototype details that aren't in the original brief, every glow and curve) and a staleness inventory (paths and tools that drifted since prior work).
- Plan. How to build it. Every task is two to five minutes of work, lists files-to-touch, steps, tests, and the exact commit message. TDD baked in.
- Briefs. Dispatchable units of plan. One per chunk, five to eight tasks each, 60 to 95 lines.
- Prompts. Paste-ready, no fill-in-the-blanks. Continuation prompts and self-contained fresh-session prompts.
- Handoff. End-of-session continuity doc. Where things are, the next prompt to paste, decisions made, open items. The next day's session reads it like a contract.
The dispatch loop
Per chunk: paste prompt, executor runs (5 to 30 min), executor reports, planner verifies with fresh context, approve or correct.
Verification is non-negotiable. Git log, file system, test count, snapshot record-flag, launch and look. The executor sometimes claims victory prematurely. The verification step catches it.
Slice mode
Auto-compaction crashed mid-chunk twice. Pattern: chunks of 3 to 4 tasks were stable. 5+ tasks crashed about half the time near task 5 to 7.
Fix: per-prompt scope is the rate limiter, not the chunk size. Briefs stay milestone-shaped. Prompts dispatch slices. The dispatch overhead of one extra continue prompt is much smaller than recovering from a compaction crash.
Plan corrections in flight
The most valuable signal in the whole build was the executor stopping and surfacing a divergence between spec and reality. Four times in two phases. Each one was a real bug in the contract.
Token drift caught against canonical CSS. A property-wrapper pattern that didn't compile. A polling test that needed a deterministic timer. A snapshot regression preempted before it broke.
Don't paper over divergences. Treat them as signal.
The takeaway
Self-audit is a trap. The session that wrote the bug is the worst session to evaluate it. Verification has to come from a context that didn't produce the work.
Three roles, three contexts, one product. Brain. Hands. Eyes.
Full breakdown. The five artifacts, the dispatch loop, plan corrections in flight, visual gates, test discipline, anti-patterns, and the day-one setup for a new app, all live here:
//A<3