Skip to content

Releases: boneylizard/llama-cpp-python-cu128-gemma3

llama_cpp_python-0.3.9-cp312-cp312-win_amd64-qwen3_cuda124.whl

07 Jun 03:54
7e2af6d

Choose a tag to compare

llama-cpp-python Custom Build for Python 3.12 & CUDA 12.8

A community-provided, up-to-date wheel for high-performance LLM inference on Windows, now supporting Qwen3.

This release provides a custom-built .whl file for llama-cpp-python with CUDA acceleration, compiled to bring modern model support to Python 3.12 environments on Windows (x64) with NVIDIA CUDA 12.8.

It was created to address the gap left by slow or inactive official releases, especially for users who need support for recent Python versions and new GGUF models like Qwen3.

Highlights

  • Python 3.12 Ready — Built natively for cp312.
  • CUDA 12.4 Binary — Compatible with modern NVIDIA drivers and built with CUDA 12.8 toolchain.
  • Multi-GPU Support — Verified on dual-GPU system (RTX 3090 + 4060 Ti) with full layer offloading and GPU splitting.
  • Latest GGUF Compatibility — Built with llama.cpp release b5602, supporting models like Qwen3.

Build & Verification

  • OS: Windows 11 (x64)
  • Build Tools: Visual Studio 2022 ("Desktop development with C++" workload)
  • CUDA Toolkit: 12.8
  • Python: 3.12.10
  • llama.dll: From llama.cpp release b5602

Test Case

The wheel was successfully used to load the Qwen3-4B-Q5_K_M.gguf model with:

  • n_gpu_layers = -1

All 37 model layers were offloaded and automatically distributed across both GPUs.

Why This Exists

Many “official” repos rely on a single maintainer and fall behind on updates. This wheel was built to fill that gap.

It is based on the principle that transparency and community validation matter more than an official label. Anyone can reproduce or audit this build process.

Installation

  1. Download the .whl file from this release.

  2. In a Python 3.12 virtual environment, run:

    pip install llama_cpp_python-[version]-cp312-cp312-win_amd64.whl

Credits

Maintained by

Bernard Peter Fitzgerald

llama-cpp-python 0.3.8 (CUDA 12.8, Gemma 3 Support) — Windows x64 Prebuilt Wheel

27 Apr 06:37
1073f58

Choose a tag to compare

Summary

This release provides a prebuilt .whl for llama-cpp-python version 0.3.8, compiled for Windows 10/11 (x64) with CUDA 12.8 acceleration enabled.

It includes full Gemma 3 model support (1B, 4B, 12B, 27B) and is based on llama.cpp release b5192 (April 26, 2025).

Highlights

  • Prebuilt for Windows x64: ready to install using pip.
  • Built against CUDA 12.8 for full GPU acceleration.
  • Verified multi-GPU offloading with Google's Gemma 3 open-weight models.
  • No manual Visual Studio or CMake compilation required.
  • Optimized for high-performance local LLM inference.

Installation

Download the .whl file below.

In a Python 3.11 virtual environment (recommended):

pip install llama_cpp_python-0.3.8+cu128.gemma3-cp311-cp311-win_amd64.whl

System Requirements:

Windows 10 or 11 (64-bit)

NVIDIA GPU with CUDA 12.8 compatible drivers

Python 3.8+ (tested on 3.11)

Acknowledgments:

Built by Bernard Peter Fitzgerald (@boneylizard). Based on abetlen/llama-cpp-python and ggml-org/llama.cpp.

License: MIT

llama_cpp_python-0.3.16-cp312-cp312-win_amd64

24 Aug 00:52
7e2af6d

Choose a tag to compare

RTX 5000 Series–Ready llama-cpp-python Wheel (Python 3.12, Windows)

Status: ✅ CONFIRMED WORKING — No more “invalid resource handle” errors
Wheel: llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl
License: MIT (same as upstream llama-cpp-python)

Platform: Windows 10/11 x64
Python: 3.12
CUDA: 12.8 (optimized for Blackwell)


🚀 Performance (Verified on RTX 5090)

  • ~64 tokens/sec on Mistral Small 24B (5-bit quant)
  • Full GPU offload (n_gpu_layers = -1) working as expected
  • ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s)
  • 32 GB VRAM fully utilized (no kernel crashes)

Notes: numbers vary with quant, context, and params; these are representative.


🔧 Why This Works

The wheel forces cuBLAS instead of ggml’s custom CUDA kernels.
On RTX 5090 (Blackwell, sm_120), ggml’s custom kernels can trigger:
“CUDA error: invalid resource handle”.

cuBLAS is stable on 5090 and avoids those kernel issues.

Key CMake flags used:
-DGGML_CUDA=ON
-DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels
-DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7
-DGGML_CUDA_F16=0 # Disable problematic FP16 code paths
-DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included


📋 Requirements

  • NVIDIA RTX 5090 (or other Blackwell GPU)
  • NVIDIA drivers 570.86.10+
  • CUDA Toolkit 12.8
  • Python 3.12
  • Windows 10/11 x64
  • Microsoft Visual C++ Redistributable 2015–2022

🛠️ Installation

  1. Download the wheel:
    llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl

  2. Install:
    pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl


✅ Quick Verification

from llama_cpp import Llama

# Full GPU offload on 5090
llm = Llama(
    model_path="your_model.gguf",
    n_gpu_layers=-1,   # full GPU
    n_ctx=2048,
    verbose=True
)

out = llm("Hello, how are you?", max_tokens=20)
print(out["choices"][0]["text"])

What to look for in stdout:

  • CUDA device assignment lines (e.g., using CUDA:0)
  • VRAM allocations without any “invalid resource handle” errors

🏗️ Build It Yourself (Advanced)

Prereqs: CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12

mkdir C:\wheels
cd C:\wheels

set FORCE_CMAKE=1
set CMAKE_BUILD_PARALLEL_LEVEL=15
set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major

pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose

Build time: ~10 minutes on a modern CPU
Wheel size: ~231 MB (larger due to cuBLAS inclusion)


🐛 Troubleshooting

“Invalid resource handle” errors

  • This wheel specifically fixes this. If you still see them, verify:
    • CUDA 12.8 is installed
    • Latest NVIDIA drivers are installed
    • No other CUDA apps are interfering

CPU fallback

  • If GPU isn’t detected, check nvidia-smi and ensure CUDA_VISIBLE_DEVICES isn’t set.

🙏 Credits

Built using the open-source llama-cpp-python project by abetlen and the llama.cpp project by ggml-org.
This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.

  • For issues with this specific wheel: open an issue here (this repo/thread).
  • For general llama-cpp-python issues: use the official repository.

Finally — RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! 🎉