Skip to content

Recommending a New Research Benchmark to Better Showcase Agent Capabilities #41

@black-yt

Description

@black-yt

Problem Statement

With the rapid development of auto research agents, it is becoming increasingly impressive to see systems capable of multi-step reasoning, coding, and tool use.

However, there is still a lack of benchmarks that can rigorously evaluate whether these agents can truly perform end-to-end scientific research.

Most existing benchmarks focus on:

  • knowledge recall,
  • reasoning tasks,
  • or code generation,

but they do not capture the full research workflow — from raw data understanding, to analysis, to producing paper-level conclusions.

As a result, it is still unclear:

  • whether current agents can genuinely reproduce scientific findings,
  • how different research agents compare under a unified setting,
  • and what gaps remain between current systems and real-world research capability.

Proposed Solution

We would like to suggest trying ResearchClawBench, a benchmark specifically designed for evaluating auto research agents.

It introduces a two-stage evaluation framework:

  • Stage 1 — Autonomous Research
    The agent is given raw datasets, task instructions, and references, and must independently perform data analysis, coding, visualization, and report writing.

  • Stage 2 — Paper-level Evaluation
    The generated report is compared against a real published paper using expert-designed checklists (rubrics) and an LLM-based judge.

The scoring is calibrated such that:

  • ~50 corresponds to reproducing the original paper (Re-Discovery)
  • higher scores indicate surpassing the original work (New Discovery)

The benchmark includes:

  • 40 tasks across 10 scientific domains
  • real datasets and reproducible setups
  • fine-grained evaluation grounded in expert annotations
  • support for multiple agents and easy integration of custom systems

We believe this setup may provide a more direct way to evaluate and demonstrate research capabilities of agents.
If relevant, it could be interesting to see how your system performs under such a benchmark.

Links:

ResearchClawBench.mp4

Alternatives Considered

No response

Feature Area

AI / Chat / Agent

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions