A library for evaluating language models on various tasks using the Curator Eval Bench dataset
- [2025.10.20] Added Coherence evaluation metric using preference_ranking_agreement, with new input/output formats (
coherence_llm_judge,collinear_llama3_judge). - [2025.10.16] Introduced the preference_ranking_agreement() function, a new metric for evaluating alignment between model-generated preference scores and human-annotated rankings.
- [2025.10.02] Added Math Correctness evaluation metric with support for accuracy, precision, recall, F1, and new prompt options (
llama_math_correctness_prompt,phi_math_correctness_prompt). - [2025.09.04] Added support for Together.ai hosted models with asynchronous generation and built-in rate limiting for efficient concurrent requests.
- [2025.08.23] Improved OpenAI integration with asynchronous generation, concurrent request handling, and reasoning support for GPT-5 and o-series models.
- [2025.08.19] Added vLLM integration with chat template support, asynchronous generation, and concurrent request handling for efficient completions.
- Task-Specific Evaluations – Evaluate models on code and math correctness tasks using Curator Eval Bench dataset.
- Flexible Model Support – Works with LLMs on huggingface, togetherai, and openai.
- Detailed Metrics – Provides accuracy, coherence scores, complexity ratings, and component breakdowns.
- Command-Line and Python API – Run quick CLI commands or integrate programmatically in your workflow.
conda create -n collinear python=3.11 -y
conda activate collinear
git clone https://github.com/collinear-ai/curator-evals.git
cd curator-evals
pip install uv
uv pip install -e .Run vllm server in one terminal.
python -u \
-m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen2.5-Coder-3B-InstructStart code correctness benchmark on second terminal.
curator-evals --task code_correctness \
--model Qwen/Qwen2.5-Coder-3B-Instruct \
--model-type llm \
--use-server \
--server-url http://localhost:8000 \
--input-format code_correctness_prompt \
--output-format collinear_code_qwen_judgeYou can find more examples in configs folder.
Evaluated using --task code_correctness with
--input-format code_correctness_prompt and
--output-format collinear_code_qwen_judge.
| Rank | Model | Accuracy (%) |
|---|---|---|
| 1 | Qwen2.5-Coder-7B-Instruct | 76.88 |
| 2 | Seed-Coder-8B-Instruct | 71.27 |
| 3 | gpt-4o | 63.74 |
| 4 | DeepSeek-R1-0528-Qwen3-8B | 63.67 |
| 5 | Qwen3-8B | 60.59 |
| 6 | Qwen2.5-Coder-3B-Instruct | 46.77 |
Evaluated using --task math_correctness with
--input-format phi_math_correctness_prompt and
--output-format collinear_phi_judge.
| Rank | Model | Accuracy (%) | Precision | Recall | F1 |
|---|---|---|---|---|---|
| 1 | Qwen3-8B | 93.95 | 0.968 | 0.970 | 0.969 |
| 2 | Qwen2.5-Coder-7B-Instruct | 93.90 | 0.969 | 0.968 | 0.968 |
| 3 | gemma-3-12b-it | 93.75 | 0.968 | 0.967 | 0.968 |
| 4 | Seed-Coder-8B-Instruct | 87.20 | 0.967 | 0.898 | 0.931 |
| 5 | Qwen2.5-Coder-3B-Instruct | 86.30 | 0.966 | 0.889 | 0.926 |
| 6 | DeepSeek-R1-0528-Qwen3-8B | 76.00 | 0.967 | 0.779 | 0.863 |
Evaluated using --task coherence with
--input-format coherence_llm_judge and
--output-format collinear_llama3_judge.
| Rank | Model | Preference Ranking Agreement |
|---|---|---|
| 1 | gemma-3-12b-it | 0.8189 |
| 2 | Qwen2.5-Coder-3B-Instruct | 0.8052 |
| 3 | Qwen2.5-Coder-7B-Instruct | 0.7866 |
| 4 | DeepSeek-R1-0528-Qwen3-8B | 0.7754 |
| 5 | Qwen3-8B | 0.7593 |
The evaluation dataset is hosted on HuggingFace Hub at collinear-ai/curator_evals_bench. Each task is a subset of the dataset, containing different splits for various dataset sources.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you find Curator Evals useful, do not forget to cite us!
@misc{curator-evals,
author = {Mackey, Tsach and Shafique, Muhammad Ali and Kumar, Anand},
title = {Curator Evals: A Benchmark for High-quality Post-training Data Curation},
year = {2025},
month = {Sep},
howpublished = {\url{https://github.com/collinear-ai/curator-evals}}
}