OpenJury makes it easy to benchmark language models against each other while giving you complete control over the evaluation process. Whether you're comparing proprietary models or testing your own fine-tuned creations, OpenJury lets you choose your judge.
🎯 Flexible Benchmarking – Evaluate models on Alpaca-Eval, Arena-Hard, m-Arena-Hard and others
🔄 Swappable Judges – Switch between self-hosted (vLLM) or remote judges (OpenAI, Together AI, OpenRouter)
🌍 Multilingual Support – Test models across multiple languages with m-Arena-Hard
🛠️ Provider Agnostic – Works with any model available in LangChain
Compared to other libraries, here is a breakdown of features:
| Framework | MT-Bench | AlpacaEval | Arena-Hard | M-Arena-Hard | Tuned judge configuration | Support vLLM Judges |
|---|---|---|---|---|---|---|
| FastChat | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AlpacaEval | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| Arena-Hard-Auto | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| Lighteval | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Evalchemy | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| OpenJury | 🔜 | ✅ | ✅ | ✅ | ✅ | ✅ |
The table has been done on Oct 2025, in case some libraries implemented missing features, please open an issue or send a PR, we will be happy to update the information.
git clone https://github.com/OpenEuroLLM/OpenJury
cd OpenJury
uv sync
uv sync --extra vllm # Optional: install vLLM supportCompare two models head-to-head:
python openjury/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A gpt4_1106_preview \
--model_B VLLM/utter-project/EuroLLM-9B \
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
--n_instructions 10 What happens here?
- Use completions available for
gpt4_1106_previewin Alpaca-Eval dataset - Generates completions for
model_Bif not already cached onvLLM - Compares two models using
deepseek-chat-v3.1which the cheapest option available onOpenRouter
It will then display the results of the battles:
============================================================
🏆 MODEL BATTLE RESULTS 🏆
📊 Dataset: alpaca-eval
🤖 Competitors: Model A: gpt4_1106_preview vs Model B: VLLM/utter-project/EuroLLM-9B
⚖️ Judge: OpenRouter/deepseek/deepseek-chat-v3.1
📈 Results Summary:
Total Battles: 10
Win Rate (A): 30.0%
✅ Wins: 3
❌ Losses: 6
🤝 Ties: 1
============================================================Models are specified using the format: {LangChain Backend}/{Model Path}
Examples:
Together/meta-llama/Llama-3.3-70B-Instruct-Turbo
ChatOpenAI/gpt-4o
LlamaCpp/jwiggerthale_Llama-3.2-3B-Q8_0-GGUF_llama-3.2-3b-q8_0.gguf
VLLM/utter-project/EuroLLM-9B
OpenRouter/deepseek/deepseek-chat-v3.1For instance, to run everything locally with vLLM:
python openjury/generate_and_evaluate.py \
--dataset alpaca-eval \
--model_A VLLM/Qwen/Qwen2.5-0.5B-Instruct \
--model_B VLLM/Qwen/Qwen2.5-1.5B-Instruct \
--judge_model VLLM/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 \
--n_instructions 10 Note: Ensure you have the required LangChain dependencies installed for your chosen provider. If you use remote endpoint, you would have to set your credentials.
| Dataset | Description |
|---|---|
alpaca-eval |
General instruction-following benchmark |
arena-hard |
More challenging evaluation suite |
m-arena-hard |
Translated version of Arena-Hard in 23 languages |
m-arena-hard-{lang} |
Language-specific variants (e.g., ar, cs, de) |
m-arena-hard-EU |
All EU languages combined |
fluency-{lang} |
Fluency evaluation for pretrained models (finnish, french, german, spanish, swedish) |
Pre-download all datasets before running jobs:
python -c "from openjury.utils import download_all; download_all()" # Download all datasets (optional)Datasets are stored in:
$OPENJURY_EVAL_DATA(if set)~/openjury-eval-data/(default)
We welcome contributions! Whether it's bug fixes, new features, or additional benchmark support, feel free to open an issue or submit a pull request.
If you use this work in your research, please cite the following paper.
@inproceedings{
salinas2025tuning,
title={Tuning {LLM} Judge Design Decisions for 1/1000 of the Cost},
author={David Salinas and Omar Swelam and Frank Hutter},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=cve4NOiyVp}
}The judge configurations was tuned in this paper and a lot of code is reused in this package.