Deep research benchmark · AI Marketing

Deep research benchmark

finally someone is benchmarking the deep research products out there.

# A Benchmark for Deep Research Agents? 👀

**DeepResearch Bench** is a new benchmark of 100 PhD-level research tasks across 22 distinct fields to systematically assess the report generation quality and citation accuracy of deep research agents.

## Benchmark Creation:

1. Collected over 96,000 raw user queries from an actual web search chatbot.

2. Filtered to 44,019 *deep research tasks* (multiple search steps, information synthesis, and a detailed report).

3. Classified into 22 distinct topic domains (e.g., Finance, Science & Tech, Health).

4. Over 100 PhD-level domain experts contributed to create 100 final benchmark tasks.

## Insights:

- 🥇 **Gemini-2.5-Pro Deep Research** achieved the best overall report quality (RACE score).

- 🔍 **Perplexity** had the highest citation accuracy, but not the highest report quality.

- ⚖️ A trade-off exists between generating many citations (Gemini) and ensuring their accuracy (Perplexity).

- 🏁 **RACE evaluation** demonstrated a correlation with human expert judgments.

- 🏭 **FACT** used for automating the measurement of factual abundance and citation.

- 🤗 Benchmark dataset and script are **open source**.

https://github.com/Ayanami0730/deep_research_bench

0 comments

AI Marketing

skool.com/ai-community

Improve your marketing with AI. For entrepreneurs, business owners, marketers and creators.

Free AI Newsletter

Supercharge Claude Code

Leaderboard (30-day)

🔥

+15