Skip to content

Curator Evals is a library to benchmark LLMs and reward models on post-training data quality, starting with code correctness.

License

Notifications You must be signed in to change notification settings

collinear-ai/curator-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Curator Evals

A library for evaluating language models on various tasks using the Curator Eval Bench dataset

Ask DeepWiki Blog

🎉 What's New

  • [2025.10.20] Added Coherence evaluation metric using preference_ranking_agreement, with new input/output formats (coherence_llm_judge, collinear_llama3_judge).
  • [2025.10.16] Introduced the preference_ranking_agreement() function, a new metric for evaluating alignment between model-generated preference scores and human-annotated rankings.
  • [2025.10.02] Added Math Correctness evaluation metric with support for accuracy, precision, recall, F1, and new prompt options (llama_math_correctness_prompt, phi_math_correctness_prompt).
  • [2025.09.04] Added support for Together.ai hosted models with asynchronous generation and built-in rate limiting for efficient concurrent requests.
  • [2025.08.23] Improved OpenAI integration with asynchronous generation, concurrent request handling, and reasoning support for GPT-5 and o-series models.
  • [2025.08.19] Added vLLM integration with chat template support, asynchronous generation, and concurrent request handling for efficient completions.

Features

  • Task-Specific Evaluations – Evaluate models on code and math correctness tasks using Curator Eval Bench dataset.
  • Flexible Model Support – Works with LLMs on huggingface, togetherai, and openai.
  • Detailed Metrics – Provides accuracy, coherence scores, complexity ratings, and component breakdowns.
  • Command-Line and Python API – Run quick CLI commands or integrate programmatically in your workflow.

Setup

conda create -n collinear python=3.11 -y

conda activate collinear

git clone https://github.com/collinear-ai/curator-evals.git

cd curator-evals

pip install uv

uv pip install -e .

Basic Example

Run vllm server in one terminal.

python -u \
    -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model Qwen/Qwen2.5-Coder-3B-Instruct

Start code correctness benchmark on second terminal.

curator-evals --task code_correctness \
  --model Qwen/Qwen2.5-Coder-3B-Instruct \
  --model-type llm \
  --use-server \
  --server-url http://localhost:8000 \
  --input-format code_correctness_prompt \
  --output-format collinear_code_qwen_judge

You can find more examples in configs folder.

Code Correctness LeaderBoard

Evaluated using --task code_correctness with --input-format code_correctness_prompt and --output-format collinear_code_qwen_judge.

Rank Model Accuracy (%)
1 Qwen2.5-Coder-7B-Instruct 76.88
2 Seed-Coder-8B-Instruct 71.27
3 gpt-4o 63.74
4 DeepSeek-R1-0528-Qwen3-8B 63.67
5 Qwen3-8B 60.59
6 Qwen2.5-Coder-3B-Instruct 46.77

Math Correctness LeaderBoard

Evaluated using --task math_correctness with --input-format phi_math_correctness_prompt and --output-format collinear_phi_judge.

Rank Model Accuracy (%) Precision Recall F1
1 Qwen3-8B 93.95 0.968 0.970 0.969
2 Qwen2.5-Coder-7B-Instruct 93.90 0.969 0.968 0.968
3 gemma-3-12b-it 93.75 0.968 0.967 0.968
4 Seed-Coder-8B-Instruct 87.20 0.967 0.898 0.931
5 Qwen2.5-Coder-3B-Instruct 86.30 0.966 0.889 0.926
6 DeepSeek-R1-0528-Qwen3-8B 76.00 0.967 0.779 0.863

Coherence Evaluation Leaderboard

Evaluated using --task coherence with --input-format coherence_llm_judge and --output-format collinear_llama3_judge.

Rank Model Preference Ranking Agreement
1 gemma-3-12b-it 0.8189
2 Qwen2.5-Coder-3B-Instruct 0.8052
3 Qwen2.5-Coder-7B-Instruct 0.7866
4 DeepSeek-R1-0528-Qwen3-8B 0.7754
5 Qwen3-8B 0.7593

Benchmarking Details

The evaluation dataset is hosted on HuggingFace Hub at collinear-ai/curator_evals_bench. Each task is a subset of the dataset, containing different splits for various dataset sources.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you find Curator Evals useful, do not forget to cite us!

@misc{curator-evals,
  author       = {Mackey, Tsach and Shafique, Muhammad Ali and Kumar, Anand},
  title        = {Curator Evals: A Benchmark for High-quality Post-training Data Curation},
  year         = {2025},
  month        = {Sep},
  howpublished = {\url{https://github.com/collinear-ai/curator-evals}}
}

About

Curator Evals is a library to benchmark LLMs and reward models on post-training data quality, starting with code correctness.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •