Local LLM inference on AMD GPUs and Apple Silicon — no ROCm, no MLX, one binary.
35B parameter model running locally — Zig + Vulkan/Metal, no ROCm, no MLX
| Platform | GPU | Backend | Status |
|---|---|---|---|
| Linux | AMD RDNA4 (RX 9070, AI PRO R9700) | Vulkan | Primary — hand-tuned shaders |
| Linux | AMD RDNA3 (RX 7900 XTX, etc.) | Vulkan | Supported |
| macOS | Apple Silicon (M1, M2, M3, M4, M5) | Metal | Supported — native MSL shaders |
Works the same on Linux (AMD GPU) and macOS (Apple Silicon):
git clone https://github.com/zolotukhin/zinc.git
cd zinc
zig build -Doptimize=ReleaseFast
# On RDNA4 Linux, enable cooperative matrix
export RADV_PERFTEST=coop_matrix # skip on macOS
# Verify GPU, shaders, and runtime
./zig-out/bin/zinc --check
# See which models fit this machine
./zig-out/bin/zinc model list
# Download a model
./zig-out/bin/zinc model pull llama31-8b-q4k-m
# Run a prompt (--chat applies the model's chat template for instruct models)
./zig-out/bin/zinc --model-id llama31-8b-q4k-m --prompt "Hello" --chat
# Or open the chat UI in your browser
./zig-out/bin/zinc chatThe server exposes the built-in chat UI at / and an OpenAI-compatible API at /v1.
- Single-stream CLI inference on the validated Qwen3.5 models listed below
- OpenAI-compatible
/v1API with streaming - Built-in browser chat UI with thinking mode support
- Managed model workflow:
list,pull,use,active,rm zinc chat— start server and open browser in one command- AMD path: RDNA4-tuned Vulkan shaders (wave64, cooperative matrix, fused ops)
- Apple Silicon path: native Metal shaders (MSL, zero-copy mmap, simdgroup ops)
- Auto-detection: ZINC picks the right backend (Vulkan or Metal) at build time
- Continuous batching and multi-tenant serving are still roadmap work
- The supported-model list is intentionally narrow
- Apple Silicon performance tuning is ongoing (RDNA4 path is more mature)
Consumer GPUs have the hardware for fast LLM inference — bandwidth, compute, VRAM — but the software doesn't use it:
- AMD RDNA3/RDNA4: ROCm doesn't support them. vLLM requires ROCm. llama.cpp's Vulkan path has no RDNA-specific tuning. These $500–1500 cards sit idle.
- Apple Silicon: MLX and llama.cpp Metal work, but leave performance on the table. No engine is built from scratch around Metal's strengths (unified memory, simdgroup ops, zero-copy mmap).
ZINC builds an inference engine tuned for the hardware you actually have.
Hand-tuned shaders for each platform. On AMD: wave64, cooperative matrix, architecture-aware tiling via Vulkan compute. On Apple Silicon: native MSL kernels with simdgroup reductions, zero-copy model loading, and Metal pipeline tuning. Not a generic backend that happens to run — built to extract real performance from each GPU.
One binary, no driver stack. No ROCm, no CUDA, no Python. Build with Zig, point at a GGUF, run inference. The right backend (Vulkan or Metal) is selected automatically at build time.
Drop-in compatible. OpenAI-compatible API, built-in chat UI, managed model catalog. Point your existing client at it and it works.
The table below lists the exact GGUFs ZINC currently supports end-to-end, not a broader wishlist.
| Model | GGUF | AMD RDNA4 | Apple Silicon |
|---|---|---|---|
| Llama 3.1 8B Instruct | Q4_K_M | — | ~10 tok/s |
| Qwen3 8B | Q4_K_M | — | ~8 tok/s |
| Qwen3.5 2B | Q4_K_M | ~27 tok/s | ~17 tok/s |
| Qwen3.5 35B-A3B UD | Q4_K_XL | ~38 tok/s | needs 24+ GB unified |
- AMD: Radeon AI PRO R9700 (RDNA4, 32 GB),
RADV_PERFTEST=coop_matrix - Apple Silicon: tested on M1 Max 32 GB
- All numbers: single-stream CLI decode,
zig build -Doptimize=ReleaseFast - Latest validation: 2026-04-02
- Use
zinc model list --jsonfor machine-readable model metadata
Quantization formats: Q4_K, Q5_K, Q6_K, Q8_0, F16
| Tool | Install |
|---|---|
| Zig 0.15.2+ | ziglang.org/download |
| Vulkan loader + tools | apt install libvulkan-dev vulkan-tools (Linux) or brew install vulkan-loader vulkan-headers (macOS) |
glslc on Linux |
apt install glslc |
| Bun for tests and the docs site | curl -fsSL https://bun.sh/install | bash |
Important: On Linux with RDNA4, newer glslc releases can cause a large regression. Use the system package version.
git clone https://github.com/zolotukhin/zinc.git
cd zinc
# Build the CLI and server
# macOS: shaders are skipped
# Linux: shaders are compiled automatically
zig build -Doptimize=ReleaseFastThe binary is placed in zig-out/bin/zinc. Compiled SPIR-V shaders go to zig-out/share/zinc/shaders/.
Use ReleaseFast for any performance measurement or server deployment. Plain zig build is not a fair throughput baseline.
Before your first prompt, run --check. The target state is a clean READY [OK] run with no warnings.
# General machine + Vulkan + shader preflight
./zig-out/bin/zinc --check
# Recommended on RDNA4 before measuring performance
export RADV_PERFTEST=coop_matrix
./zig-out/bin/zinc --check
# Check one exact GGUF file
./zig-out/bin/zinc --check -m /path/to/model.gguf
# Check one managed catalog model by id
./zig-out/bin/zinc --check --model-id qwen35-35b-a3b-q4k-xl--check verifies:
- host environment and RDNA4-specific shell hints
- compiled shader assets
- Vulkan device discovery and the selected GPU
- GGUF metadata when you pass
-m /path/to/model.gguf - managed-model compatibility when you pass
--model-id <id> - estimated single-GPU VRAM fit for the current runtime
If --check reports warnings, treat them as setup work to finish before judging runtime behavior. For the full walkthrough, see Running ZINC and Hardware requirements.
The README keeps the supported-model table narrow on purpose and leaves the full managed-model workflow to the docs.
Use these for model selection, cache management, and API details:
./zig-out/bin/zinc -m /path/to/model.gguf --prompt "The capital of France is"Start the server — no --prompt flag means server mode:
./zig-out/bin/zinc -m /path/to/model.gguf -p 8080Then open http://localhost:8080/ in your browser for the built-in chat interface.
ZINC exposes an OpenAI-compatible API at /v1.
For the actual request examples and SDK usage, use the website docs instead of the README:
- Running ZINC for CLI, server mode, and first-run examples
- Serving HTTP API for
curl, OpenAI SDK examples, endpoint behavior, and response shapes
The built-in chat UI is served at /, the API is under /v1, and the health endpoint is /health.
For building, testing, debugging, benchmarking, graph export, and contributing — see the Development Guide (web version).
Quick start:
zig build -Doptimize=ReleaseFast # build
zig build test # run all tests
./zig-out/bin/zinc --check # verify GPU/runtime setupSee also: CONTRIBUTING.md · Code of Conduct
All numbers below were measured on AMD Radeon AI PRO R9700 (RDNA4, 32 GB, 576 GB/s) on a clean RDNA4 node using RADV_PERFTEST=coop_matrix and zig build -Doptimize=ReleaseFast.
| Path | Shape | Result |
|---|---|---|
| Qwen3.5-35B-A3B-UD CLI plain decode | --prompt "The capital of France is"; 128 generated tokens |
37.95 tok/s, 26.3 ms/tok |
| Qwen3.5-2B-Q4_K_M CLI plain decode | --prompt "The capital of France is"; 128 generated tokens |
26.71 tok/s, 37.4 ms/tok |
For reference, the current llama.cpp baseline on the same node and 35B model is about 107 tok/s decode.
- The clean ReleaseFast decode path is approaching
40 tok/son the 35B model. - The 2B model is currently slower than the 35B MoE model on this node, which means today's bottleneck is not just "smaller model = faster"; kernel shape, architecture mix, and decode-path efficiency matter more than parameter count alone.
At 33.58 tok/s, the modeled full-token decode bandwidth is about 112.5 GB/s, or 19.5% of the card's 576 GB/s peak.
That is not a contradiction. Single-stream decode is not a pure DRAM-streaming workload. The remaining headroom is dominated by serialized medium/small kernels and graph depth, not by large host-side stalls. If the goal is to drive memory bandwidth materially higher than this, the next lever is concurrent decode / batching, not expecting one stream to saturate all DRAM bandwidth on its own.
The older March 27–29 optimization logs in .zinc_optimize/ were useful for correctness and early performance work, but many of the old 7–16 tok/s figures came from debug-heavy or non-ReleaseFast builds. The snapshot above is the current clean baseline to compare against.
| Component | Status |
|---|---|
| Vulkan infrastructure | Done |
| GGUF parser + model loader | Done |
| GPU detection (RDNA3/4) | Done |
| Native BPE tokenizer (from GGUF) | Done |
| GLSL compute shaders (16) | Done |
| Compute graph + architecture builders | Done |
| Forward pass (decode loop) | Working — 33.58 tok/s clean CLI on Qwen3.5-35B-A3B-UD |
| GPU SSM shaders + cmd batching | Done — clean ReleaseFast path is above 30 tok/s |
| HTTP server + OpenAI API | Done — 35B raw API ~33.5 tok/s, 2B raw API ~21.9 tok/s, reasoning chat still slower |
| Continuous batching | Phase 4 |
| TurboQuant KV compression | Phase 5 |
Validated on AMD Radeon AI PRO R9700 (RDNA4): Vulkan 1.3 init, GGUF parsing, 21 GB model loaded to VRAM, 723-node MoE graph built, coherent inference output verified against CPU reference.
The next push is from "raw decode above 30" to "reasoning workloads above 30 and better aggregate GPU utilization":
- Close the chat/reasoning gap — benchmark longer chat prompts, template overhead, stop behavior, and TTFT so
/v1/chat/completionstracks closer to the raw decode path. - Make profiling representative —
--profileis still too intrusive inReleaseFast, so it is not yet the right leaderboard tool for apples-to-apples throughput claims. - Reduce hot-path descriptor churn — reuse bindings and trim per-token Vulkan setup in the decode loop.
- Tune the actual hot shapes — focus on medium/small decode kernels, not just the vocab projection.
- Increase aggregate throughput with batching — if the goal is to drive bandwidth utilization much higher, concurrency is the right lever.
MIT
