Apr 4 (edited) • 💬 General
Paper Bench: AI's Replicating Research by OpenAI
"""
We introduce PaperBench (pdf), a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments.
.
.
.
We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.
"""
Once AI's are doing their own research and that is used to enhance them the fast flywheel to the Singularity has started.
Right now Scaffolded Claude 3.5 Sonnet (latest version,) was the best. Gemini 2.5 pro was not tested.
5
2 comments
Anaxareian Aia
7
Paper Bench: AI's Replicating Research by OpenAI
Data Alchemy
skool.com/data-alchemy
Your Community to Master the Fundamentals of Working with Data and AI — by Datalumina®
Leaderboard (30-day)
Powered by