-
Notifications
You must be signed in to change notification settings - Fork 21
Recommending a New Research Benchmark to Better Showcase Agent Capabilities #41
Description
Problem Statement
With the rapid development of auto research agents, it is becoming increasingly impressive to see systems capable of multi-step reasoning, coding, and tool use.
However, there is still a lack of benchmarks that can rigorously evaluate whether these agents can truly perform end-to-end scientific research.
Most existing benchmarks focus on:
- knowledge recall,
- reasoning tasks,
- or code generation,
but they do not capture the full research workflow — from raw data understanding, to analysis, to producing paper-level conclusions.
As a result, it is still unclear:
- whether current agents can genuinely reproduce scientific findings,
- how different research agents compare under a unified setting,
- and what gaps remain between current systems and real-world research capability.
Proposed Solution
We would like to suggest trying ResearchClawBench, a benchmark specifically designed for evaluating auto research agents.
It introduces a two-stage evaluation framework:
-
Stage 1 — Autonomous Research
The agent is given raw datasets, task instructions, and references, and must independently perform data analysis, coding, visualization, and report writing. -
Stage 2 — Paper-level Evaluation
The generated report is compared against a real published paper using expert-designed checklists (rubrics) and an LLM-based judge.
The scoring is calibrated such that:
- ~50 corresponds to reproducing the original paper (Re-Discovery)
- higher scores indicate surpassing the original work (New Discovery)
The benchmark includes:
- 40 tasks across 10 scientific domains
- real datasets and reproducible setups
- fine-grained evaluation grounded in expert annotations
- support for multiple agents and easy integration of custom systems
We believe this setup may provide a more direct way to evaluate and demonstrate research capabilities of agents.
If relevant, it could be interesting to see how your system performs under such a benchmark.
Links:
- https://internscience.github.io/ResearchClawBench-Home/
- https://github.com/InternScience/ResearchClawBench
ResearchClawBench.mp4
Alternatives Considered
No response
Feature Area
AI / Chat / Agent
Additional Context
No response