Skip to content

ama228/llm-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-eval

Toolkit for evaluating and comparing LLM outputs. Supports automated metrics, code execution testing, and RLHF-style pairwise comparison.

Built this while working on AI model evaluation tasks - needed something lighter than lm-harness that I could customize quickly.

Features

  • Text metrics - exact match, fuzzy match, ROUGE-L, keyword checking
  • Code evaluation - run generated code against test cases with timeout handling
  • Pairwise comparison - compare two models head-to-head (like RLHF preference collection)
  • LLM-as-judge - plug in any LLM to judge response quality
  • Dataset loading - JSONL format evaluation sets
  • Aggregation - pass rates, averages, per-metric breakdowns

Quick start

from llm_eval import Evaluator, exact_match, fuzzy_match, rouge_l

ev = Evaluator()
ev.add_metric("exact", exact_match)
ev.add_metric("fuzzy", fuzzy_match)
ev.add_metric("rouge", rouge_l)

dataset = [
    {"prompt": "Capital of France?", "response": "Paris", "expected": "Paris"},
    {"prompt": "2+2?", "response": "4", "expected": "4"},
]

summary = ev.evaluate(dataset)
print(f"Pass rate: {summary.pass_rate:.1%}")
print(f"Avg scores: {summary.avg_scores}")

Code evaluation

from llm_eval import code_correctness

result = code_correctness(
    code="x = int(input()); print(x * 2)",
    test_cases=[
        {"input": "5", "expected_output": "10"},
        {"input": "3", "expected_output": "6"},
    ],
    timeout=5.0,
)
print(f"Pass rate: {result['pass_rate']:.0%}")

Model comparison (RLHF-style)

from llm_eval import compare_responses, fuzzy_match

results, win_rate = compare_responses(
    prompts=["Explain gravity", "What is DNA?"],
    responses_a=["Gravity is...", "DNA is..."],
    responses_b=["Force that...", "Molecule that..."],
    metric_fns={"fuzzy": fuzzy_match},
    expected=["Gravity is a force...", "DNA is a molecule..."],
)

print(f"Model A wins: {win_rate.model_a_rate:.0%}")
print(f"Model B wins: {win_rate.model_b_rate:.0%}")

Tests

pytest -v

About

LLM evaluation toolkit - ROUGE-L, fuzzy match, code correctness metrics with RLHF-style pairwise comparison

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages