Name	Name	Last commit message	Last commit date
parent directory ..
data	data
evals	evals
src	src
.gitignore	.gitignore
README.md	README.md
pyproject.toml	pyproject.toml

Name

Last commit message

Last commit date

evals

WebCode Benchmark

Search evals for coding agents. Blog post.

Evals

Eval	What it measures	Runner
Contents	Extraction fidelity against golden markdown (250 URLs)	`python -m evals.contents`
Highlights	In-document retrieval — given a URL + query, surface the relevant section (250 queries)	`python -m evals.highlights`
RAG	Full-web retrieval + synthesis on long-context code docs (307 queries)	`python -m evals.rag`
E2E	Sandboxed coding tasks requiring web search (33 tasks, dataset only)	`python -m evals.e2e --info`

Quick Start

cd webcode-benchmark
uv sync

export EXA_API_KEY="..."
export OPENAI_API_KEY="..."

python -m evals.contents --searchers exa tavily parallel --limit 20
python -m evals.highlights --searchers exa tavily parallel --limit 20
python -m evals.rag --searchers exa brave perplexity --limit 20
python -m evals.e2e --info

Searchers

Provider	Contents	Highlights	RAG
Exa	✓	✓	✓
Tavily	✓	✓	✓
Parallel	✓	✓	✓
Claude	✓	✓	—
Brave	—	—	✓
Perplexity	—	—	✓

Datasets

All datasets are JSONL files in data/:

Dataset	Rows	Schema
contents	250	`{id, url, title, tags}`
highlights	250	`{id, query, expected_answer, citation_url, citation_excerpt}`
rag	307	`{id, query, expected_answer, source_url, citation_excerpt}`
e2e	33	`{id, slug, repo, repo_url, release_tag, task_description, test_patch, metadata}`

Note: Some URLs have been excluded from the contents and highlights datasets due to licensing restrictions.

Golden markdown (contents eval)

The contents dataset contains URLs only; the golden markdown is not included for licensing reasons. To run the contents eval, you need to generate data/contents/golden_markdown.jsonl yourself. Each row should have the shape {id, expected_markdown}.

We built the golden references using the following pipeline:

Render each URL in a cloud browser (e.g. Browserbase) with full JS execution, lazy loading, and dynamic rendering
Capture full-page screenshots and extract the HTML DOM
Feed screenshots + DOM into a multimodal language model to produce markdown faithful to the rendered page

See the blog post for more details on this approach.

Output

Pass --output results.json to save per-query scores. Results are structured as:

{
  "exa": [{"id": "contents_001", "completeness": 0.9, "accuracy": 0.95, ...}],
  "tavily": [...]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

WebCode Benchmark

Evals

Quick Start

Searchers

Datasets

Golden markdown (contents eval)

Output

FilesExpand file tree

webcode-benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

webcode-benchmark

Folders and files

parent directory

README.md

WebCode Benchmark

Evals

Quick Start

Searchers

Datasets

Golden markdown (contents eval)

Output