Lightweight, configurable benchmark harness for LLM providers. Measure accuracy (EM/F1/MC), latency/throughput, and resource timelines (CPU/RAM, optional GPU).
This repository is intentionally small and pluggable — add adapters for your provider and drop in any JSONL dataset.
- 
Create and activate a Python virtualenv. The project supports Python 3.10+. 
- 
(Optional) install dev deps for testing and plotting: 
python -m pip install -r requirements-dev.txt
# Optional GPU support: python -m pip install -e .[gpu]- Run the example harness (uses an in-repo MockProvider):
python - <<'PY'
import asyncio, sys
sys.path.insert(0, '')
from benches.harness import run_bench
asyncio.run(run_bench('bench_config.yaml'))
PYOutput files will be written to the reports/ prefix declared in bench_config.yaml (JSONL per-sample, CSV summary, resources timeline and a compact Markdown report).
Key fields:
- provider: select- kind: mock | ollama | openaiand provider-specific connection options.
- io.dataset_path: path to JSONL dataset.
- io.output_prefix: prefix for the output artifacts in- reports/.
- prompt.systemand- prompt.template: system message and per-sample template using- {input}and other fields from the dataset.
- load.concurrencyand- load.batch_size: concurrency/batch settings.
- limits.max_samples: limit number of samples for fast experiments.
- metrics.normalization: optional normalization (e.g.,- lower_strip) applied to accuracy metrics.
Example config is included as bench_config.yaml.
- Free‑text QA JSONL (one object per line):
{"id":"1","input":"Capital of France?","target":"Paris"}- Multiple choice JSONL:
{"id":"1","input":"Capital of France?","choices":["Paris","Lyon"],"answer":"Paris"}Implement a Provider with two async methods:
- generate(prompt, system=None, options=None) -> dict— returns at least- outputand may provide- latency_s,- ttft_s,- prompt_eval_count,- eval_count.
- tokenize(text) -> int— optional but helpful for token counts.
Included adapters:
- OllamaProvider(calls /api/generate and /api/tokenize)
- OpenAIStyleProvider(calls /v1/chat/completions)
- MockProvider(local, for testing and CI)
Add your provider implementation to benches/providers.py and register it in _load_provider in benches/harness.py.
- Exact Match (EM), token-F1, multiple-choice accuracy implemented in benches/metrics.py.
- BLEU and ROUGE-L are optional; they require sacrebleuandrouge-scorerespectively.
benches/monitor.py samples process CPU/RAM (via psutil) and optionally GPU stats via NVML.
- GPU sampling is optional and controlled by the environment variable LLM_BENCH_SKIP_GPU=1(CI sets this variable by default).
- GPU support is available via the optional package extra gpu(recommended packagenvidia-ml-py, fallback topynvmlis supported).
Install the GPU extra locally with:
python -m pip install -e .[gpu]Note: GitHub-hosted runners do not provide GPUs; the CI workflow sets LLM_BENCH_SKIP_GPU=1.
- *.jsonl— per-sample detailed results
- *_summary.csv— single-row summary (latency percentiles, accuracy means, tokens)
- *_resources.csv— timeline of CPU/RAM/(optional)GPU samples
- *_report.md— compact human report
- Unit and integration tests live in tests/.
- Run tests locally with pytestormake test.
- CI (.github/workflows/ci.yml) runs tests and setsLLM_BENCH_SKIP_GPU=1so GPU sampling is skipped on GitHub runners.
- examples/run_mock.py— programmatic example that runs the harness against the MockProvider.
- benches/plot.py— helper to plot- resources.csv(requires- matplotlib+- pandas).
- Add a provider: implement Provider.generate()andtokenize(), and register it in_load_provider.
- Add a metric: implement in benches/metrics.pyand wire intobenches/harness.py.
- Throughput sweeps: write a wrapper that modifies bench_config.yamlconcurrency/batch settings and re-runs the harness to gather scaling data.
MIT — do what you want, but please share interesting improvements.
GPU (optional):
The harness can sample NVIDIA GPU stats via NVML. This is optional — GitHub Actions runners don't have GPUs, and CI skips GPU sampling by default.
To install the optional GPU dependency locally:
python -m pip install -e .[gpu]
# or: pip install nvidia-ml-pyCI note: the provided GitHub Actions workflow sets LLM_BENCH_SKIP_GPU=1 so GPU sampling is disabled in CI.