EdgeAI Small Open LLMs Benchmark

This project is a friendly, reproducible home for testing small open-source LLMs on edge hardware. Every run is scripted so teammates (or reviewers) can see exactly what happened, compare notes, and feel confident repeating the experiment on their own machines.

What’s included

A ready-to-run suite covering deepseek-r1:1.5b, llama3.2:1b, gemma2:2b, phi3:3.8b (feel free to extend the list)
Deterministic inference settings baked in (temperature=0, top_p=1.0, max_tokens=256, fixed seed)
Automatic cache clears and Ollama restarts so tests start fresh
Single-thread runs to keep CPU usage consistent across devices
Auto-grading with clear scoring plus sanitized system metadata for context
Export and plotting scripts so you can move smoothly from raw JSON to spreadsheets or visuals

Model snapshots

deepseek-r1:1.5b — 1.5B-parameter reasoning model tuned for concise step-by-step answers on modest hardware.
llama3.2:1b — Meta’s 1B-parameter general model optimized for low-latency edge inference with balanced quality.
gemma2:2b — Google’s 2B successor to Gemma with stronger multilingual coverage and stable generation at small scale.
phi3:3.8b — Microsoft’s 3.8B compact transformer focused on code and reasoning tasks while staying edge-friendly.

Folder Structure

EdgeAI_SmallOpen_LLMs/
├── benchmark_data/questions.json         # Benchmark questions and scoring rules
├── benchmark_runner.py                  # Main benchmarking script
├── export_results_csv.py                # Export all results to CSV
├── plot_benchmark_scores.py             # Visualize model scores
├── requirements.txt                     # Python dependencies
├── results/                             # Model outputs and plots
│   └── <model_name>/
│       └── <timestamp>/                 # One benchmark session
│           ├── run_01.json              # Individual run (status, scores, resources)
│           └── run_02.json              # Additional runs share same timestamp
└── README.md                            # Project documentation

Quick start

Install the tools
```
python3 -m venv .venv        # create once
source .venv/bin/activate    # activate the workspace venv
pip install -r requirements.txt
```
Installs everything the scripts need (psutil, matplotlib, etc.) inside .venv. The helper script automatically uses the active environment; set PYTHON_BIN=$PWD/.venv/bin/python if you ever want to stay outside the venv.
Grab at least one model
```
ollama pull llama3.2:1b
```
Repeat for deepseek-r1:1.5b, gemma2:2b, phi3:3.8b, or any other Ollama model you want to test.
Run the benchmark
```
python benchmark_runner.py --model llama3.2:1b
```
Results land under results/llama3.2:1b/<timestamp>/run_01.json with scores, responses, and system metadata. Rerun the command as many times as you like—each pass creates a new timestamped folder.
Look at the outputs
- Open the JSON from step 3 to inspect a single run.
- python export_results_csv.py collects every run into results/benchmark_results.csv.
- python plot_benchmark_scores.py creates results/benchmark_scores.png for a quick comparison chart.

Helpful extras (optional)

Reproducible cold start and log capture:
```
 python scripts/full_repro_benchmark.py --model llama3.2:1b --output run.log
```
Clears Ollama caches, enforces single-thread inference, and writes outputs to the same results folder. After a cache purge, Ollama re-downloads weights the next time it serves a prompt, so make sure the Ollama app/server stays running while the helper executes.
Fresh slate before a rerun:
```
python benchmark_runner.py --model llama3.2:1b --purge-results
```
Removes the existing results/llama3.2:1b directory before writing new data.

Quick reproducibility helper

When you want a fresh, cold-cache run without thinking about the prep steps, use the helper:
```
python scripts/full_repro_benchmark.py --model llama3.2:1b --output llama_runs.log
```
The script clears cached weights, enforces single-thread inference, and then runs benchmark_runner.py for you while leaving the Ollama daemon untouched. Prefer to keep the cache? Pass --no-cache-clear. Want a clean slate for that model’s outputs? Add --purge-results to clear results/<model> before the run, or delete the folder manually afterward. Re-run the helper whenever you need another pass. The helper works the same on macOS and Raspberry Pi OS as long as Ollama and Python are installed and already running. Each invocation writes outputs to results/<model>/<timestamp>/run_<index>.json and streams a copy of stdout to the optional log file so you can revisit the transcript later. If the log shows an ERROR: message (for example, missing model or offline download), rerun after resolving it.

Every run_<index>.json includes a short snapshot of the host (OS, machine type, major OS/Python versions, Ollama version, CPU and RAM capacity) so collaborators can see where the run happened without revealing anything sensitive.

Requirements

Python 3.8+
Ollama installed and models pulled (see https://ollama.com/)
macOS or Linux recommended

Contributing

Keep the code modular, readable, and kind to the next person who opens it
Suggest new tasks, models, or visualizations through issues or PRs
Stick to objective, reproducible methods so results stay comparable across machines

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EdgeAI Small Open LLMs Benchmark

What’s included

Model snapshots

Folder Structure

Quick start

Helpful extras (optional)

Quick reproducibility helper

Requirements

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
benchmark_data		benchmark_data
results/llama3.2:1b		results/llama3.2:1b
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_runner.py		benchmark_runner.py
export_results_csv.py		export_results_csv.py
plot_benchmark_scores.py		plot_benchmark_scores.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

EdgeAI Small Open LLMs Benchmark

What’s included

Model snapshots

Folder Structure

Quick start

Helpful extras (optional)

Quick reproducibility helper

Requirements

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages