Skip to content

constructorfleet/llm-bench

Repository files navigation

LLM Bench — Accuracy • Speed • Memory

Lightweight, configurable benchmark harness for LLM providers. Measure accuracy (EM/F1/MC), latency/throughput, and resource timelines (CPU/RAM, optional GPU).

This repository is intentionally small and pluggable — add adapters for your provider and drop in any JSONL dataset.

Quick start (local, no external model)

  1. Create and activate a Python virtualenv. The project supports Python 3.10+.

  2. (Optional) install dev deps for testing and plotting:

python -m pip install -r requirements-dev.txt
# Optional GPU support: python -m pip install -e .[gpu]
  1. Run the example harness (uses an in-repo MockProvider):
python - <<'PY'
import asyncio, sys
sys.path.insert(0, '')
from benches.harness import run_bench
asyncio.run(run_bench('bench_config.yaml'))
PY

Output files will be written to the reports/ prefix declared in bench_config.yaml (JSONL per-sample, CSV summary, resources timeline and a compact Markdown report).

Configuration (bench_config.yaml)

Key fields:

  • provider: select kind: mock | ollama | openai and provider-specific connection options.
  • io.dataset_path: path to JSONL dataset.
  • io.output_prefix: prefix for the output artifacts in reports/.
  • prompt.system and prompt.template: system message and per-sample template using {input} and other fields from the dataset.
  • load.concurrency and load.batch_size: concurrency/batch settings.
  • limits.max_samples: limit number of samples for fast experiments.
  • metrics.normalization: optional normalization (e.g., lower_strip) applied to accuracy metrics.

Example config is included as bench_config.yaml.

Dataset formats

  • Free‑text QA JSONL (one object per line):
{"id":"1","input":"Capital of France?","target":"Paris"}
  • Multiple choice JSONL:
{"id":"1","input":"Capital of France?","choices":["Paris","Lyon"],"answer":"Paris"}

Providers

Implement a Provider with two async methods:

  • generate(prompt, system=None, options=None) -> dict — returns at least output and may provide latency_s, ttft_s, prompt_eval_count, eval_count.
  • tokenize(text) -> int — optional but helpful for token counts.

Included adapters:

  • OllamaProvider (calls /api/generate and /api/tokenize)
  • OpenAIStyleProvider (calls /v1/chat/completions)
  • MockProvider (local, for testing and CI)

Add your provider implementation to benches/providers.py and register it in _load_provider in benches/harness.py.

Metrics

  • Exact Match (EM), token-F1, multiple-choice accuracy implemented in benches/metrics.py.
  • BLEU and ROUGE-L are optional; they require sacrebleu and rouge-score respectively.

Resource monitoring

benches/monitor.py samples process CPU/RAM (via psutil) and optionally GPU stats via NVML.

  • GPU sampling is optional and controlled by the environment variable LLM_BENCH_SKIP_GPU=1 (CI sets this variable by default).
  • GPU support is available via the optional package extra gpu (recommended package nvidia-ml-py, fallback to pynvml is supported).

Install the GPU extra locally with:

python -m pip install -e .[gpu]

Note: GitHub-hosted runners do not provide GPUs; the CI workflow sets LLM_BENCH_SKIP_GPU=1.

Outputs

  • *.jsonl — per-sample detailed results
  • *_summary.csv — single-row summary (latency percentiles, accuracy means, tokens)
  • *_resources.csv — timeline of CPU/RAM/(optional)GPU samples
  • *_report.md — compact human report

Tests & CI

  • Unit and integration tests live in tests/.
  • Run tests locally with pytest or make test.
  • CI (.github/workflows/ci.yml) runs tests and sets LLM_BENCH_SKIP_GPU=1 so GPU sampling is skipped on GitHub runners.

Examples

  • examples/run_mock.py — programmatic example that runs the harness against the MockProvider.
  • benches/plot.py — helper to plot resources.csv (requires matplotlib + pandas).

Extending

  • Add a provider: implement Provider.generate() and tokenize(), and register it in _load_provider.
  • Add a metric: implement in benches/metrics.py and wire into benches/harness.py.
  • Throughput sweeps: write a wrapper that modifies bench_config.yaml concurrency/batch settings and re-runs the harness to gather scaling data.

License

MIT — do what you want, but please share interesting improvements.

GPU (optional):

The harness can sample NVIDIA GPU stats via NVML. This is optional — GitHub Actions runners don't have GPUs, and CI skips GPU sampling by default.

To install the optional GPU dependency locally:

python -m pip install -e .[gpu]
# or: pip install nvidia-ml-py

CI note: the provided GitHub Actions workflow sets LLM_BENCH_SKIP_GPU=1 so GPU sampling is disabled in CI.

About

Simple lightweight LLM benchmarking suite

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published