New benchmark out of Meta FAIR, Stanford, and Harvard called ProgramBench.
The setup: you get a compiled executable plus its docs. Source code stripped. Rebuild the program from scratch in any language you want. Tests check input/output behavior against the original binary.
200 tasks, from small CLI tools up to FFmpeg, SQLite, and the PHP interpreter.
š Results across 9 models:
Zero tasks fully solved. Opus 4.7 was the best, passing 95% of tests on only 3% of tasks. GPT 5.4, Gemini 3.1 Pro, and Haiku 4.5 hit 0% in that bucket.
The interesting part is section 5. Even the model solutions that "worked" looked nothing like the human reference. Median 1,173 lines vs 3,068 in the original. Flat directories. Fewer functions, each one longer. GPT 5.4 wrote 96% of its final code in a single turn on most tasks and never modified existing files on roughly 40% of runs.
šÆ Why it matters for us:
The benchmark separates writing code from designing software. Models can produce syntax all day. They cannot yet decompose a real system into coherent modules, pick the right abstractions, or organize a codebase the way a working engineer would.
That gap is what computational orchestration points at. It is also where the durable value lives.
š Try it:
Pick an easier task from the repo (the paper flags nnn, fzf, gron, and jq as more tractable). Run it against Claude or your model of choice. Watch where you and the model split. Note the design decisions you make that the model never even raises.
Post your runs and attempts to create a harness that would allow the model to do it. Wins, failures, weird outputs, all of it.
I'm building something on top of this right now. More soon.