Curator Evals

A library for evaluating language models on various tasks using the Curator Eval Bench dataset

🎉 What's New

[2025.10.20] Added Coherence evaluation metric using preference_ranking_agreement, with new input/output formats (coherence_llm_judge, collinear_llama3_judge).
[2025.10.16] Introduced the preference_ranking_agreement() function, a new metric for evaluating alignment between model-generated preference scores and human-annotated rankings.
[2025.10.02] Added Math Correctness evaluation metric with support for accuracy, precision, recall, F1, and new prompt options (llama_math_correctness_prompt, phi_math_correctness_prompt).
[2025.09.04] Added support for Together.ai hosted models with asynchronous generation and built-in rate limiting for efficient concurrent requests.
[2025.08.23] Improved OpenAI integration with asynchronous generation, concurrent request handling, and reasoning support for GPT-5 and o-series models.
[2025.08.19] Added vLLM integration with chat template support, asynchronous generation, and concurrent request handling for efficient completions.

Features

Task-Specific Evaluations – Evaluate models on code and math correctness tasks using Curator Eval Bench dataset.
Flexible Model Support – Works with LLMs on huggingface, togetherai, and openai.
Detailed Metrics – Provides accuracy, coherence scores, complexity ratings, and component breakdowns.
Command-Line and Python API – Run quick CLI commands or integrate programmatically in your workflow.

Setup

conda create -n collinear python=3.11 -y

conda activate collinear

git clone https://github.com/collinear-ai/curator-evals.git

cd curator-evals

pip install uv

uv pip install -e .

Basic Example

Run vllm server in one terminal.

python -u \
    -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model Qwen/Qwen2.5-Coder-3B-Instruct

Start code correctness benchmark on second terminal.

curator-evals --task code_correctness \
  --model Qwen/Qwen2.5-Coder-3B-Instruct \
  --model-type llm \
  --use-server \
  --server-url http://localhost:8000 \
  --input-format code_correctness_prompt \
  --output-format collinear_code_qwen_judge

You can find more examples in configs folder.

Code Correctness LeaderBoard

Evaluated using --task code_correctness with --input-format code_correctness_prompt and --output-format collinear_code_qwen_judge.

Rank	Model	Accuracy (%)
1	Qwen2.5-Coder-7B-Instruct	76.88
2	Seed-Coder-8B-Instruct	71.27
3	gpt-4o	63.74
4	DeepSeek-R1-0528-Qwen3-8B	63.67
5	Qwen3-8B	60.59
6	Qwen2.5-Coder-3B-Instruct	46.77

Math Correctness LeaderBoard

Evaluated using --task math_correctness with --input-format phi_math_correctness_prompt and --output-format collinear_phi_judge.

Rank	Model	Accuracy (%)	Precision	Recall	F1
1	Qwen3-8B	93.95	0.968	0.970	0.969
2	Qwen2.5-Coder-7B-Instruct	93.90	0.969	0.968	0.968
3	gemma-3-12b-it	93.75	0.968	0.967	0.968
4	Seed-Coder-8B-Instruct	87.20	0.967	0.898	0.931
5	Qwen2.5-Coder-3B-Instruct	86.30	0.966	0.889	0.926
6	DeepSeek-R1-0528-Qwen3-8B	76.00	0.967	0.779	0.863

Coherence Evaluation Leaderboard

Evaluated using --task coherence with --input-format coherence_llm_judge and --output-format collinear_llama3_judge.

Rank	Model	Preference Ranking Agreement
1	gemma-3-12b-it	0.8189
2	Qwen2.5-Coder-3B-Instruct	0.8052
3	Qwen2.5-Coder-7B-Instruct	0.7866
4	DeepSeek-R1-0528-Qwen3-8B	0.7754
5	Qwen3-8B	0.7593

Benchmarking Details

The evaluation dataset is hosted on HuggingFace Hub at collinear-ai/curator_evals_bench. Each task is a subset of the dataset, containing different splits for various dataset sources.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find Curator Evals useful, do not forget to cite us!

@misc{curator-evals,
  author       = {Mackey, Tsach and Shafique, Muhammad Ali and Kumar, Anand},
  title        = {Curator Evals: A Benchmark for High-quality Post-training Data Curation},
  year         = {2025},
  month        = {Sep},
  howpublished = {\url{https://github.com/collinear-ai/curator-evals}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
configs		configs
docs		docs
src/curator_evals		src/curator_evals
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Curator Evals

🎉 What's New

Features

Setup

Basic Example

Code Correctness LeaderBoard

Math Correctness LeaderBoard

Coherence Evaluation Leaderboard

Benchmarking Details

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

collinear-ai/curator-evals

Folders and files

Latest commit

History

Repository files navigation

Curator Evals

🎉 What's New

Features

Setup

Basic Example

Code Correctness LeaderBoard

Math Correctness LeaderBoard

Coherence Evaluation Leaderboard

Benchmarking Details

Contributing

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages