A simple Python wrapper for running llama.cpp server binaries directly, bypassing outdated Python bindings.
llama-cpp-python often lags behind the official llama.cpp C++ implementation by weeks or months. New model architectures (like Qwen3-VL, Gemma3, etc.) are supported in llama.cpp but fail with:
unknown model architecture: 'qwen3vl'
Use the official llama.cpp server binary directly from Python. This wrapper provides:
- OpenAI-compatible API endpoints
- Simple Python interface
- Full control over server lifecycle
- Support for ANY model architecture llama.cpp supports
# Run the build script
./scripts/build_llama_cpp.sh
# Or manually:
git clone https://github.com/ggerganov/llama.cpp /tmp/llama-cpp-standalone
cd /tmp/llama-cpp-standalone
mkdir build && cd build
# With CUDA (recommended)
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j
# Without CUDA (CPU only)
cmake .. -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -jfrom llama_cpp_standalone import LlamaCppServer
# Start server
server = LlamaCppServer("/path/to/llama-server")
server.start(
model_path="/path/to/model.gguf",
port=8080,
n_gpu_layers=-1, # Full GPU offload
n_ctx=4096
)
# Use with OpenAI Python client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
# Stop when done
server.stop()- ✅ Works with ANY model architecture llama.cpp supports
- ✅ OpenAI-compatible API (drop-in replacement)
- ✅ Full CUDA/Metal/OpenCL support
- ✅ Multimodal support (vision models with mmproj)
- ✅ Context management and health checks
- ✅ Automatic process lifecycle management
server.start(
model_path="/path/to/qwen-vl.gguf",
mmproj_path="/path/to/mmproj.gguf", # Vision projector
port=8080,
n_gpu_layers=-1
)import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
server.start(
model_path="/path/to/model.gguf",
port=8080
)server.start(
model_path="/path/to/model.gguf",
port=8080,
extra_args=["--rope-freq-base", "10000", "--rope-freq-scale", "0.5"]
)- Python 3.8+
- requests library (
pip install requests) - Built llama.cpp server binary
| Feature | llama-cpp-python | This Wrapper |
|---|---|---|
| Model support | Lags behind C++ | Same day as llama.cpp |
| Installation | Pip (large download) | Just copy files |
| GPU support | Version dependent | Full llama.cpp support |
| Architecture | Python bindings | Direct binary |
| Updates | Wait for PyPI | Build llama.cpp anytime |
- Check binary path:
which llama-server - Test manually:
/path/to/llama-server --help - Check model path exists
- Verify CUDA setup (if using GPU)
- Rebuild llama.cpp from latest master
- Verify model file is valid GGUF format
- Use different port in
start() - Or kill existing server:
lsof -ti:8080 | xargs kill
We welcome contributions! Please:
- Test with different model architectures
- Add support for new llama.cpp features
- Improve documentation
- Share your use cases
MIT License - See LICENSE file
Created by Gregor Koch (@cronos3k)
- llama.cpp - The amazing C++ implementation
- Community members who identified the Python bindings lag issue
The llama.cpp project moves fast. Python bindings take time to update. This bridges the gap, letting you use cutting-edge features immediately while maintaining a clean Python interface.
Perfect for:
- Testing new model architectures
- Production deployments needing stability
- Projects requiring specific llama.cpp versions
- Anyone frustrated with binding version mismatches