Paper Bench: AI's Replicating Research by OpenAI

Apr 4 (edited) • 💬 General

"""

We introduce PaperBench (pdf), a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments.

We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

"""

https://github.com/openai/mle-bench

Once AI's are doing their own research and that is used to enhance them the fast flywheel to the Singularity has started.

Right now Scaffolded Claude 3.5 Sonnet (latest version,) was the best. Gemini 2.5 pro was not tested.

2 comments