This project is a friendly, reproducible home for testing small open-source LLMs on edge hardware. Every run is scripted so teammates (or reviewers) can see exactly what happened, compare notes, and feel confident repeating the experiment on their own machines.
- A ready-to-run suite covering deepseek-r1:1.5b, llama3.2:1b, gemma2:2b, phi3:3.8b (feel free to extend the list)
- Deterministic inference settings baked in (temperature=0, top_p=1.0, max_tokens=256, fixed seed)
- Automatic cache clears and Ollama restarts so tests start fresh
- Single-thread runs to keep CPU usage consistent across devices
- Auto-grading with clear scoring plus sanitized system metadata for context
- Export and plotting scripts so you can move smoothly from raw JSON to spreadsheets or visuals
- deepseek-r1:1.5b — 1.5B-parameter reasoning model tuned for concise step-by-step answers on modest hardware.
- llama3.2:1b — Meta’s 1B-parameter general model optimized for low-latency edge inference with balanced quality.
- gemma2:2b — Google’s 2B successor to Gemma with stronger multilingual coverage and stable generation at small scale.
- phi3:3.8b — Microsoft’s 3.8B compact transformer focused on code and reasoning tasks while staying edge-friendly.
EdgeAI_SmallOpen_LLMs/
├── benchmark_data/questions.json # Benchmark questions and scoring rules
├── benchmark_runner.py # Main benchmarking script
├── export_results_csv.py # Export all results to CSV
├── plot_benchmark_scores.py # Visualize model scores
├── requirements.txt # Python dependencies
├── results/ # Model outputs and plots
│ └── <model_name>/
│ └── <timestamp>/ # One benchmark session
│ ├── run_01.json # Individual run (status, scores, resources)
│ └── run_02.json # Additional runs share same timestamp
└── README.md # Project documentation
- Install the tools
Installs everything the scripts need (psutil, matplotlib, etc.) inside
python3 -m venv .venv # create once source .venv/bin/activate # activate the workspace venv pip install -r requirements.txt
.venv. The helper script automatically uses the active environment; setPYTHON_BIN=$PWD/.venv/bin/pythonif you ever want to stay outside the venv. - Grab at least one model
Repeat for deepseek-r1:1.5b, gemma2:2b, phi3:3.8b, or any other Ollama model you want to test.
ollama pull llama3.2:1b
- Run the benchmark
Results land under
python benchmark_runner.py --model llama3.2:1b
results/llama3.2:1b/<timestamp>/run_01.jsonwith scores, responses, and system metadata. Rerun the command as many times as you like—each pass creates a new timestamped folder. - Look at the outputs
- Open the JSON from step 3 to inspect a single run.
python export_results_csv.pycollects every run intoresults/benchmark_results.csv.python plot_benchmark_scores.pycreatesresults/benchmark_scores.pngfor a quick comparison chart.
- Reproducible cold start and log capture:
Clears Ollama caches, enforces single-thread inference, and writes outputs to the same results folder. After a cache purge, Ollama re-downloads weights the next time it serves a prompt, so make sure the Ollama app/server stays running while the helper executes.
python scripts/full_repro_benchmark.py --model llama3.2:1b --output run.log
- Fresh slate before a rerun:
Removes the existing
python benchmark_runner.py --model llama3.2:1b --purge-results
results/llama3.2:1bdirectory before writing new data.
- When you want a fresh, cold-cache run without thinking about the prep steps, use the helper:
The script clears cached weights, enforces single-thread inference, and then runs
python scripts/full_repro_benchmark.py --model llama3.2:1b --output llama_runs.log
benchmark_runner.pyfor you while leaving the Ollama daemon untouched. Prefer to keep the cache? Pass--no-cache-clear. Want a clean slate for that model’s outputs? Add--purge-resultsto clearresults/<model>before the run, or delete the folder manually afterward. Re-run the helper whenever you need another pass. The helper works the same on macOS and Raspberry Pi OS as long as Ollama and Python are installed and already running. Each invocation writes outputs toresults/<model>/<timestamp>/run_<index>.jsonand streams a copy of stdout to the optional log file so you can revisit the transcript later. If the log shows anERROR:message (for example, missing model or offline download), rerun after resolving it.
Every run_<index>.json includes a short snapshot of the host (OS, machine type, major OS/Python versions, Ollama version, CPU and RAM capacity) so collaborators can see where the run happened without revealing anything sensitive.
- Python 3.8+
- Ollama installed and models pulled (see https://ollama.com/)
- macOS or Linux recommended
- Keep the code modular, readable, and kind to the next person who opens it
- Suggest new tasks, models, or visualizations through issues or PRs
- Stick to objective, reproducible methods so results stay comparable across machines
MIT