finally someone is benchmarking the deep research products out there.
# A Benchmark for Deep Research Agents? 👀
**DeepResearch Bench** is a new benchmark of 100 PhD-level research tasks across 22 distinct fields to systematically assess the report generation quality and citation accuracy of deep research agents.
## Benchmark Creation:
1. Collected over 96,000 raw user queries from an actual web search chatbot.
2. Filtered to 44,019 *deep research tasks* (multiple search steps, information synthesis, and a detailed report).
3. Classified into 22 distinct topic domains (e.g., Finance, Science & Tech, Health).
4. Over 100 PhD-level domain experts contributed to create 100 final benchmark tasks.
## Insights:
- 🥇 **Gemini-2.5-Pro Deep Research** achieved the best overall report quality (RACE score).
- 🔍 **Perplexity** had the highest citation accuracy, but not the highest report quality.
- ⚖️ A trade-off exists between generating many citations (Gemini) and ensuring their accuracy (Perplexity).
- 🏁 **RACE evaluation** demonstrated a correlation with human expert judgments.
- 🏭 **FACT** used for automating the measurement of factual abundance and citation.
- 🤗 Benchmark dataset and script are **open source**.