Toolkit for evaluating and comparing LLM outputs. Supports automated metrics, code execution testing, and RLHF-style pairwise comparison.
Built this while working on AI model evaluation tasks - needed something lighter than lm-harness that I could customize quickly.
- Text metrics - exact match, fuzzy match, ROUGE-L, keyword checking
- Code evaluation - run generated code against test cases with timeout handling
- Pairwise comparison - compare two models head-to-head (like RLHF preference collection)
- LLM-as-judge - plug in any LLM to judge response quality
- Dataset loading - JSONL format evaluation sets
- Aggregation - pass rates, averages, per-metric breakdowns
from llm_eval import Evaluator, exact_match, fuzzy_match, rouge_l
ev = Evaluator()
ev.add_metric("exact", exact_match)
ev.add_metric("fuzzy", fuzzy_match)
ev.add_metric("rouge", rouge_l)
dataset = [
{"prompt": "Capital of France?", "response": "Paris", "expected": "Paris"},
{"prompt": "2+2?", "response": "4", "expected": "4"},
]
summary = ev.evaluate(dataset)
print(f"Pass rate: {summary.pass_rate:.1%}")
print(f"Avg scores: {summary.avg_scores}")from llm_eval import code_correctness
result = code_correctness(
code="x = int(input()); print(x * 2)",
test_cases=[
{"input": "5", "expected_output": "10"},
{"input": "3", "expected_output": "6"},
],
timeout=5.0,
)
print(f"Pass rate: {result['pass_rate']:.0%}")from llm_eval import compare_responses, fuzzy_match
results, win_rate = compare_responses(
prompts=["Explain gravity", "What is DNA?"],
responses_a=["Gravity is...", "DNA is..."],
responses_b=["Force that...", "Molecule that..."],
metric_fns={"fuzzy": fuzzy_match},
expected=["Gravity is a force...", "DNA is a molecule..."],
)
print(f"Model A wins: {win_rate.model_a_rate:.0%}")
print(f"Model B wins: {win_rate.model_b_rate:.0%}")pytest -v