Deep research benchmark
finally someone is benchmarking the deep research products out there.
# A Benchmark for Deep Research Agents? 👀
**DeepResearch Bench** is a new benchmark of 100 PhD-level research tasks across 22 distinct fields to systematically assess the report generation quality and citation accuracy of deep research agents.
## Benchmark Creation:
1. Collected over 96,000 raw user queries from an actual web search chatbot.
2. Filtered to 44,019 *deep research tasks* (multiple search steps, information synthesis, and a detailed report).
3. Classified into 22 distinct topic domains (e.g., Finance, Science & Tech, Health).
4. Over 100 PhD-level domain experts contributed to create 100 final benchmark tasks.
## Insights:
- 🥇 **Gemini-2.5-Pro Deep Research** achieved the best overall report quality (RACE score).
- 🔍 **Perplexity** had the highest citation accuracy, but not the highest report quality.
- ⚖️ A trade-off exists between generating many citations (Gemini) and ensuring their accuracy (Perplexity).
- 🏁 **RACE evaluation** demonstrated a correlation with human expert judgments.
- 🏭 **FACT** used for automating the measurement of factual abundance and citation.
- 🤗 Benchmark dataset and script are **open source**.
6
0 comments
Ray Merlin
6
Deep research benchmark
AI Marketing
skool.com/ai-community
Improve your marketing with AI. For entrepreneurs, business owners, marketers and creators.
Leaderboard (30-day)
Powered by