Deep research benchmark
finally someone is benchmarking the deep research products out there.
# A Benchmark for Deep Research Agents? ๐Ÿ‘€
**DeepResearch Bench** is a new benchmark of 100 PhD-level research tasks across 22 distinct fields to systematically assess the report generation quality and citation accuracy of deep research agents.
## Benchmark Creation:
1. Collected over 96,000 raw user queries from an actual web search chatbot.
2. Filtered to 44,019 *deep research tasks* (multiple search steps, information synthesis, and a detailed report).
3. Classified into 22 distinct topic domains (e.g., Finance, Science & Tech, Health).
4. Over 100 PhD-level domain experts contributed to create 100 final benchmark tasks.
## Insights:
- ๐Ÿฅ‡ **Gemini-2.5-Pro Deep Research** achieved the best overall report quality (RACE score).
- ๐Ÿ” **Perplexity** had the highest citation accuracy, but not the highest report quality.
- โš–๏ธ A trade-off exists between generating many citations (Gemini) and ensuring their accuracy (Perplexity).
- ๐Ÿ **RACE evaluation** demonstrated a correlation with human expert judgments.
- ๐Ÿญ **FACT** used for automating the measurement of factual abundance and citation.
- ๐Ÿค— Benchmark dataset and script are **open source**.
6
0 comments
Ray Merlin
6
Deep research benchmark
AI Marketing
skool.com/ai-community
Improve your marketing with AI. For entrepreneurs, business owners, marketers and creators.
Leaderboard (30-day)
Powered by