llama.cpp Python Standalone

A simple Python wrapper for running llama.cpp server binaries directly, bypassing outdated Python bindings.

The Problem

llama-cpp-python often lags behind the official llama.cpp C++ implementation by weeks or months. New model architectures (like Qwen3-VL, Gemma3, etc.) are supported in llama.cpp but fail with:

unknown model architecture: 'qwen3vl'

The Solution

Use the official llama.cpp server binary directly from Python. This wrapper provides:

OpenAI-compatible API endpoints
Simple Python interface
Full control over server lifecycle
Support for ANY model architecture llama.cpp supports

Quick Start

1. Build llama.cpp (one-time setup)

# Run the build script
./scripts/build_llama_cpp.sh

# Or manually:
git clone https://github.com/ggerganov/llama.cpp /tmp/llama-cpp-standalone
cd /tmp/llama-cpp-standalone
mkdir build && cd build

# With CUDA (recommended)
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j

# Without CUDA (CPU only)
cmake .. -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j

2. Use in Python

from llama_cpp_standalone import LlamaCppServer

# Start server
server = LlamaCppServer("/path/to/llama-server")
server.start(
    model_path="/path/to/model.gguf",
    port=8080,
    n_gpu_layers=-1,  # Full GPU offload
    n_ctx=4096
)

# Use with OpenAI Python client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

# Stop when done
server.stop()

Features

✅ Works with ANY model architecture llama.cpp supports
✅ OpenAI-compatible API (drop-in replacement)
✅ Full CUDA/Metal/OpenCL support
✅ Multimodal support (vision models with mmproj)
✅ Context management and health checks
✅ Automatic process lifecycle management

Advanced Usage

Vision Models (Qwen-VL, LLaVA, etc.)

server.start(
    model_path="/path/to/qwen-vl.gguf",
    mmproj_path="/path/to/mmproj.gguf",  # Vision projector
    port=8080,
    n_gpu_layers=-1
)

Multiple GPUs

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

server.start(
    model_path="/path/to/model.gguf",
    port=8080
)

Custom Server Arguments

server.start(
    model_path="/path/to/model.gguf",
    port=8080,
    extra_args=["--rope-freq-base", "10000", "--rope-freq-scale", "0.5"]
)

Requirements

Python 3.8+
requests library (pip install requests)
Built llama.cpp server binary

Comparison with llama-cpp-python

Feature	llama-cpp-python	This Wrapper
Model support	Lags behind C++	Same day as llama.cpp
Installation	Pip (large download)	Just copy files
GPU support	Version dependent	Full llama.cpp support
Architecture	Python bindings	Direct binary
Updates	Wait for PyPI	Build llama.cpp anytime

Troubleshooting

Server won't start

Check binary path: which llama-server
Test manually: /path/to/llama-server --help
Check model path exists
Verify CUDA setup (if using GPU)

"Unknown model architecture" error

Rebuild llama.cpp from latest master
Verify model file is valid GGUF format

Port already in use

Use different port in start()
Or kill existing server: lsof -ti:8080 | xargs kill

Contributing

We welcome contributions! Please:

Test with different model architectures
Add support for new llama.cpp features
Improve documentation
Share your use cases

License

MIT License - See LICENSE file

Author

Created by Gregor Koch (@cronos3k)

Acknowledgments

llama.cpp - The amazing C++ implementation
Community members who identified the Python bindings lag issue

Why This Exists

The llama.cpp project moves fast. Python bindings take time to update. This bridges the gap, letting you use cutting-edge features immediately while maintaining a clean Python interface.

Perfect for:

Testing new model architectures
Production deployments needing stability
Projects requiring specific llama.cpp versions
Anyone frustrated with binding version mismatches

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
scripts		scripts
.gitignore		.gitignore
GITHUB_SETUP.md		GITHUB_SETUP.md
LICENSE		LICENSE
README.md		README.md
llama_cpp_standalone.py		llama_cpp_standalone.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama.cpp Python Standalone

The Problem

The Solution

Quick Start

1. Build llama.cpp (one-time setup)

2. Use in Python

Features

Advanced Usage

Vision Models (Qwen-VL, LLaVA, etc.)

Multiple GPUs

Custom Server Arguments

Requirements

Comparison with llama-cpp-python

Troubleshooting

Server won't start

"Unknown model architecture" error

Port already in use

Contributing

License

Author

Acknowledgments

Why This Exists

About

Uh oh!

Releases

Packages

Languages

License

cronos3k/llama-cpp-python-standalone

Folders and files

Latest commit

History

Repository files navigation

llama.cpp Python Standalone

The Problem

The Solution

Quick Start

1. Build llama.cpp (one-time setup)

2. Use in Python

Features

Advanced Usage

Vision Models (Qwen-VL, LLaVA, etc.)

Multiple GPUs

Custom Server Arguments

Requirements

Comparison with llama-cpp-python

Troubleshooting

Server won't start

"Unknown model architecture" error

Port already in use

Contributing

License

Author

Acknowledgments

Why This Exists

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages