This project benchmarks local inference solutions for running Large Language Models (LLMs):
- Ollama - Running models locally via Ollama API
- WebLLM - Running models in the browser using WebLLM with WebGPU acceleration
- Llama.cpp - Running models in the browser using Llama.cpp port
- OpenAI - Running models via OpenAI API (cloud-based)
edge_inference_benchmarks/
├── benchmark/ # All benchmark-related code and data
│ ├── benchmark_runner.py # Main benchmark orchestration script
│ ├── compare_results.py # Script to compare benchmark results
│ ├── tests/ # Test data directory
│ │ └── simple_benchmark.csv # Simple benchmark test cases
│ ├── results/ # Directory for benchmark results (gitignored)
│ └── comparison_results/ # Directory for comparison charts (gitignored)
├── requirements.txt # Project dependencies
├── ollama/ # Ollama implementation
│ ├── run_benchmark.py # Ollama-specific benchmark code
│ └── requirements.txt # Ollama-specific dependencies
├── openai/ # OpenAI implementation
│ ├── run_benchmark.py # OpenAI-specific benchmark code
│ ├── requirements.txt # OpenAI-specific dependencies
│ └── .env # Environment file with OpenAI API key
├── webllm/ # WebLLM implementation with WebGPU acceleration
│ ├── run_benchmark.py # WebLLM-specific benchmark code
│ ├── requirements.txt # WebLLM Python bridge dependencies
│ └── web/ # Browser-based WebLLM app
│ ├── index.html # HTML page for WebLLM benchmark
│ ├── js/ # JavaScript code
│ │ └── index.js # Main WebLLM benchmark logic
│ ├── package.json # NPM dependencies
│ └── webpack.config.js # Webpack configuration
└── llamacpp/ # Llama.cpp implementation
├── run_benchmark.py # Llama.cpp-specific benchmark code
└── requirements.txt # Llama.cpp-specific dependencies
- Python 3.8+
- Ollama installed locally (for Ollama benchmarks)
- Web browser with WebGPU support (for WebLLM and Llama.cpp benchmarks)
- OpenAI API key (for OpenAI benchmarks)
- Node.js and npm (for WebLLM benchmarks)
- Chrome or Chromium browser (for WebLLM benchmarks)
- Clone the repository:
git clone https://github.com/yourusername/edge_inference_benchmarks.git
cd edge_inference_benchmarks- Install dependencies:
pip install -r requirements.txt- For Ollama benchmarks, install Ollama:
# For macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
# For Windows: Download from https://ollama.com/download- For OpenAI benchmarks, make sure your API key is in the openai/.env file:
OPEN_AI_KEY=your-api-key-here
- For WebLLM benchmarks, install Node.js and npm if not already installed:
# For macOS with Homebrew
brew install node
# For Ubuntu/Debian
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs- Start the Ollama service:
ollama serve- Run the benchmark:
python benchmark/benchmark_runner.py --implementation ollamaYou can specify a particular model by setting the environment variable:
OLLAMA_MODEL=llama2:7b python benchmark/benchmark_runner.py --implementation ollamaOption 1: Let the benchmark script start the llama.cpp server:
python benchmark/benchmark_runner.py --implementation llamacppThe benchmark will automatically start a llama.cpp server with the default model (bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF).
You can specify a different model:
LLAMACPP_MODEL=/path/to/your/model.gguf python benchmark/benchmark_runner.py --implementation llamacppOption 2: Start the server manually:
# Start server manually
llama-server -m /path/to/your/model.gguf --host 0.0.0.0 --port 8080
# Run benchmark
python benchmark/benchmark_runner.py --implementation llamacppRun the benchmark using OpenAI's API:
python benchmark/benchmark_runner.py --implementation openaiYou can specify a particular model by setting the environment variable:
OPENAI_MODEL=gpt-4 python benchmark/benchmark_runner.py --implementation openaiRun the benchmark using WebLLM in a browser with WebGPU acceleration:
python benchmark/benchmark_runner.py --implementation webllmYou can specify a particular model by setting the environment variable:
WEBLLM_MODEL=Llama-3.1-8B-Instruct-q4f32_1-MLC python benchmark/benchmark_runner.py --implementation webllmYou can also specify a different test file:
python benchmark/benchmark_runner.py --implementation ollama --test-file tests/custom_benchmark.csvResults will be saved as JSON files in the benchmark/results directory. You can specify an output file using the --output parameter:
python benchmark/benchmark_runner.py --implementation ollama --output my_benchmark_results.jsonYou can compare results from different implementations using the comparison script:
python benchmark/compare_results.py The script will automatically look for the result files in the benchmark/results directory. You can also specify a different output directory for comparison charts:
python benchmark/compare_results.py my_benchmark_results_1.json my_benchmark_results_2.json --output-dir my_comparisonThis will generate:
- A grouped bar chart for accuracy by test and implementation
- A bar chart for average latency by implementation
- A bar chart for average tokens per second by implementation
- A summary JSON file with the key metrics
All charts will use distinct colors for each implementation for better visual comparison.
To add new test cases, you can:
- Edit an existing test file like
benchmark/tests/simple_benchmark.csv - Create a new test file in the
benchmark/testsdirectory following the same format
The CSV format includes these columns:
id: Unique identifier for the testprompt: The text prompt to send to the modelmax_tokens: Maximum number of tokens to generatetemperature: Temperature parameter for generation (0.0-1.0)expected_class: Category of the expected responsenotes: Additional information about the test
This project is licensed under the MIT License - see the LICENSE file for details.