Skip to content

Commit 5273505

Browse files
unamedkrclaude
andcommitted
Fact-check all llama.cpp comparisons: remove exaggerations, add context
Corrections across all docs: - Blog: remove "applies same scheme to both K and V" (incorrect — llama.cpp supports --cache-type-k and --cache-type-v separately) - Add llama.cpp Q8_0 K + Q5_0 V recommended config (~1.6x, ~+1% PPL) to all comparison tables alongside Q4_0 symmetric (+10.6%) - Engine table: "7x +0% PPL" → "3.8-6.9x +0% PPL" (precise range) - Tech report: add nuanced intro acknowledging llama.cpp's range of options - PR drafts: remove "10x more degradation", "catastrophic", "breaks model" - PPL scatter charts: add llama.cpp Q8K+Q5V data point Principle: compare at same compression level (3.8x: Q4_0 vs quant.cpp 4-bit) AND same quality level (+1%: Q8+Q5 1.6x vs quant.cpp delta 4.3x). Both comparisons are fair; cherry-picking only the worst config is not. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ff45ff8 commit 5273505

7 files changed

Lines changed: 140 additions & 7 deletions

README.ko.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
9595

9696
| | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
9797
|:--|:---------:|:---------:|:----:|:---:|:-------:|
98-
| KV 압축 | **7x, +0% PPL** | 1.6x ~+1% PPL | -- | -- | -- |
98+
| KV 압축 | **3.8-6.9x, +0% PPL** | 1.6x ~+1% PPL | -- | -- | -- |
9999
| 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
100100
| 의존성 | **제로** | ggml | PyTorch | Apple fw | 런타임 |
101101
| 임베더블 | **단일 헤더** | -- | -- | -- | 복잡 |

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ Both are per-block methods. The quality gap comes from block size (128 vs 32), m
9595

9696
| | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
9797
|:--|:---------:|:---------:|:----:|:---:|:-------:|
98-
| KV compression | **7x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- |
98+
| KV compression | **3.8-6.9x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- |
9999
| Code size | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
100100
| Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime |
101101
| Embeddable | **single header** | -- | -- | -- | complex |

docs/blog/breaking-3bit-barrier.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The obvious fix is quantization: store those vectors in fewer bits. We spent thr
1010

1111
## The descent into fewer bits
1212

13-
4-bit works. We implemented a straightforward uniform min-max quantizer for KV cache keys and ran WikiText-2 perplexity on SmolLM2 1.7B. FP32 baseline: 14.63 PPL. With 4-bit keys and Q4 values: 14.57 PPL. That is -0.4%, which is within noise -- essentially free compression. For comparison, llama.cpp's built-in Q4_0 KV cache quantization scores +10.6% PPL degradation on the same model. The difference comes from quantizing K and V independently with type-appropriate methods, while llama.cpp applies the same scheme to both.
13+
4-bit works. We implemented a straightforward uniform min-max quantizer for KV cache keys and ran WikiText-2 perplexity on SmolLM2 1.7B. FP32 baseline: 14.63 PPL. With 4-bit keys and Q4 values: 14.57 PPL. That is -0.4%, which is within noise -- essentially free compression. For comparison, llama.cpp's Q4_0 KV cache quantization (symmetric K+V) scores +10.6% PPL on the same model. Note: llama.cpp also supports asymmetric configs like Q8_0 K + Q5_0 V (~+1% PPL, ~1.6x compression). The quality advantage at high compression (3.8x+) comes from 128-element min-max blocks, independent K/V quantization, and delta compression of adjacent keys.
1414

1515
3-bit is where things get ugly. Naive 3-bit uniform quantization blows up to +62% PPL. The 8 reconstruction levels simply cannot capture the post-RHT distribution with enough fidelity. We tried Lloyd-Max optimal codebooks, asymmetric ranges, per-channel scales. Nothing brought it under +40%.
1616

docs/papers/quant_cpp_tech_report.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ We present quant.cpp, a minimal LLM inference engine that achieves 6.9x KV cache
88

99
Large language model inference is increasingly memory-bound. At 32K context length, an 8B model's KV cache consumes 4GB — more than the model weights themselves. While weight quantization (Q4, Q8) is well-studied, KV cache compression receives less attention despite dominating memory usage at long contexts.
1010

11-
Existing KV cache quantization in production engines (llama.cpp Q4_0) introduces +10.6% perplexity degradation — noticeable quality loss. We show that type-aware independent K/V quantization achieves +0.0% degradation at the same bit budget.
11+
Existing KV cache quantization in production engines offers a range of quality-compression tradeoffs. llama.cpp's recommended Q8_0 K + Q5_0 V config achieves ~1.6x compression with ~+1% PPL, while aggressive Q4_0 symmetric gives 3.8x but +10.6% PPL. We show that type-aware independent K/V quantization achieves +0.0% degradation at 3.8x compression — bridging the gap between high compression and quality preservation.
1212

1313
quant.cpp is designed around three principles:
1414
1. **Readable**: The full transformer forward pass fits in one file (tq_transformer.c, 2500 lines).
@@ -102,7 +102,8 @@ WikiText-2 PPL on SmolLM2 1.7B:
102102
| 4b K + FP16 V | 14.63 | +0.00% | 1.6x |
103103
| 4b K + Q4 V | 14.57 | -0.4% | 6.9x |
104104
| Delta 3b K + Q4 V | 14.82 | +1.3% | 8.5x |
105-
| llama.cpp Q4_0 KV | 16.18 | +10.6% | 3.8x |
105+
| llama.cpp Q8_0 K + Q5_0 V | ~14.8 | ~+1% | 1.6x |
106+
| llama.cpp Q4_0 K+V (symmetric) | 16.18 | +10.6% | 3.8x |
106107

107108
### 5.3 Context Extension
108109

@@ -118,7 +119,7 @@ On 16GB Mac M1 Pro:
118119
- **TurboQuant** (Zandieh et al., ICLR 2026): KV cache compression theory
119120
- **QJL** (AAAI 2025): Quantized Johnson-Lindenstrauss transform
120121
- **PolarQuant** (AISTATS 2026): Polar coordinate quantization
121-
- **llama.cpp**: Production inference engine with Q4 KV quantization
122+
- **llama.cpp**: Production inference engine with asymmetric KV quantization (Q8 K + Q5 V recommended)
122123
- **llm.c** (Karpathy): Minimal C training/inference, educational focus
123124

124125
## 7. Conclusion
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# r/LocalLLM Comment Responses — "7x longer LLM context in pure C" (2026-04-05)
2+
3+
Thread: quant.cpp — 7x longer LLM context in pure C (Gemma 4 26B on 16GB Mac)
4+
5+
Copy-paste ready. Each section = one comment.
6+
7+
---
8+
9+
## @dsanft — "4bit K completely kills inference quality due to kurtosis. K needs 8 bits."
10+
11+
You're right that K tensors have high kurtosis — outliers make them much harder to quantize than V. Naive per-tensor quantization does destroy quality.
12+
13+
The difference is granularity. quant.cpp uses per-block min-max quantization with 128-element blocks, not per-tensor or per-channel. Each block gets its own min/max scale, so outliers only affect their local block.
14+
15+
WikiText-2 PPL (SmolLM2 1.7B):
16+
17+
- FP32 baseline: 14.63
18+
- 4-bit K + Q4 V: 14.57 (+0.0%)
19+
- Cross-model: Qwen3.5 0.8B (+0.9%), Qwen3.5 4B (+0.6%)
20+
21+
For comparison, llama.cpp Q4_0 KV gives PPL +10.6% — that's the significant quality drop you're describing, and it's real with coarser quantization.
22+
23+
That said, you're absolutely right for QK-normed models like Gemma 4. Those project keys onto the unit sphere with extremely sparse distributions (~56 of 256 dims active). 4-bit completely breaks there (cosine drops to 0.62). quant.cpp auto-detects this and keeps keys in FP32 while only compressing values.
24+
25+
Reproducible: `./quant model.gguf --ppl input.txt -k uniform_4b -v q4`
26+
27+
---
28+
29+
## @MrHighVoltage — "at least do a proper job copying"
30+
31+
(Already responded. Post formatting fixed — switched to Markdown editor.)
32+
33+
---
34+
35+
## @maschayana — "Lol its still not fixed"
36+
37+
You're right, sorry about that. Reddit editor was fighting the markdown tables. Switched to Markdown mode and it should render properly now.
38+
39+
---
40+
41+
## @smuckola — "Titans, TurboQuant, KV Cache management landscape"
42+
43+
Great question — and you actually nailed it. quant.cpp is a C implementation of the TurboQuant paper (ICLR 2026). So you already found the connection without realizing it!
44+
45+
The KV cache management landscape breaks down roughly like this:
46+
47+
- **Eviction** (StreamingLLM, H2O, Scissors) — drop tokens you "probably" don't need. Saves memory but loses information permanently.
48+
- **Architecture changes** (Titans, MLA, GQA) — redesign the model itself to use less KV memory. Best results, but requires retraining from scratch.
49+
- **Compression** (TurboQuant/quant.cpp, KIVI, KVQuant) — keep all tokens, store them in fewer bits. Works on existing models, no retraining.
50+
51+
quant.cpp sits in the compression category. The advantage is that it works on any existing GGUF model — download, run, get 7x more context. No fine-tuning, no architecture change.
52+
53+
Titans is a different and complementary approach — it redesigns the attention mechanism itself so the model learns what to remember. Very promising, but requires models trained with it. If a Titans-architecture model ships as GGUF someday, quant.cpp could still compress its KV cache on top.
54+
55+
And thanks for the kind words about the focus. "Torvaldsian side quest" — I'm framing that.
56+
57+
---
58+
59+
## @sinan_online — "replicability and compatibility — containers, standard APIs, plug-n-play"
60+
61+
Thanks for the concrete use case — these are fair concerns.
62+
63+
**Replicability**: quant.cpp reads standard GGUF files directly. No model conversion, no custom formats. Any GGUF you download from Hugging Face works as-is. KV compression happens at runtime — the model file is untouched, so you can swap models freely. Same binary, different GGUF, same flags.
64+
65+
**Containers**: The binary is statically linkable with zero external dependencies (libc + pthreads only). No Python, no PyTorch, no CUDA runtime to install. A minimal Docker image can be under 10MB. That said, we don't ship an official container image yet — that's a fair gap.
66+
67+
**Standard API**: This is the honest limitation. quant.cpp has a C API (`quant_load` / `quant_generate`), not an OpenAI-compatible HTTP server. If you need a drop-in replacement for an existing API pipeline, llama.cpp's `llama-server` or vLLM is the right tool today.
68+
69+
Where quant.cpp fits in your workflow: if you're already running llama.cpp in a container and hitting context limits, we have an integration patch at `integrations/llamacpp/` that adds our KV compression as a drop-in option. Same API, longer context. The goal is to upstream delta compression into llama.cpp as a PR.
70+
71+
---
72+
73+
## @MimosaTen — "chatgpt-20b-Q4 could be the best model I've tried"
74+
75+
Nice — gpt-oss-20b is a solid model. It uses a GPT-2-style architecture with RoPE and MoE (32 experts), which is close to what quant.cpp already supports but not there yet. We handle Llama, Qwen, and Gemma architectures today.
76+
77+
If you're on limited hardware, KV compression would help a lot with a 20B MoE model — the KV cache is usually what runs you out of memory before the weights do, especially with long conversations.
78+
79+
I'll look into adding gpt-oss support. The MoE + RoPE + GQA pieces are already implemented for Gemma 4, so the gap is mostly the GPT-2 layer structure. Thanks for the suggestion!

docs/pr/2026-04-05-show-hn-v3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ I built a minimal LLM inference engine in pure C (67K LOC, zero dependencies) wi
2929
- FP32 baseline: 14.63
3030
- 4-bit K + Q4 V: 14.57 (**+0.0%**)
3131
- Delta 3-bit K + Q4 V: 14.82 (+1.3%)
32-
- vs llama.cpp Q4_0 KV: **+10.6% PPL**. Same bit budget, 10x more degradation.
32+
- vs llama.cpp Q4_0 KV: **+10.6% PPL**. Same compression level (3.8x), but Q4_0 symmetric vs min-max with independent K/V.
3333

3434
**Code philosophy:** 67K lines of C11. No frameworks, no CUDA required. The full forward pass fits in one file. Ships as a single-header `quant.h` (15K LOC) you can drop into any C project.
3535

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Reddit Post — quant.cpp KV quality comparison (2026-04-06)
2+
3+
Target: r/LocalLLaMA, r/LocalLLM
4+
5+
**Title:** Same 4 bits. Very different quality. (quant.cpp vs llama.cpp KV compression)
6+
7+
**Post type:** Image + text
8+
9+
**Image:** How It Compares chart (PPL bar chart + engine comparison table)
10+
11+
---
12+
13+
**Body (below image):**
14+
15+
Both use 4-bit KV quantization. Same bit budget, but very different quality outcomes.
16+
17+
**WikiText-2 PPL (SmolLM2 1.7B):**
18+
19+
- llama.cpp Q4_0 K+V: PPL **+10.6%**
20+
- quant.cpp 4-bit K + Q4 V: PPL **+0.0%**
21+
- quant.cpp 3-bit delta K + Q4 V: PPL **+1.3%**
22+
23+
**Why the difference?** Both are per-block methods, but with different design choices:
24+
25+
- **Block size & range encoding**: llama.cpp Q4_0 uses 32-element blocks with a single zero-point scale. quant.cpp uses 128-element blocks with min-max range encoding, which better captures the distribution of key vectors specifically.
26+
- **Independent K/V treatment**: quant.cpp applies different quantization methods to keys vs values, optimized for each tensor's statistical properties.
27+
- **Delta compression** (unique to quant.cpp): stores `key[t] - key[t-1]` instead of absolute keys — like video P-frames. Adjacent keys differ by ~30% of their range, so 3-bit deltas work where absolute 3-bit gives +62% PPL.
28+
29+
**Fair note**: llama.cpp also supports separate K/V quant types. Q8_0 K + Q5_0 V is a solid config with much less degradation than Q4_0 on both — but at ~1.6x compression. quant.cpp targets the 4-7x range (extending 50K context to 350K) where the quality gap matters most.
30+
31+
On a 16GB Mac with Llama 3.2 3B: llama.cpp with FP16 KV maxes out at ~50K tokens. quant.cpp compresses KV 6.9x → **~350K tokens** with zero quality loss.
32+
33+
Not trying to replace llama.cpp — it's faster. Use llama.cpp for speed, vLLM for throughput, quant.cpp when context length is your bottleneck.
34+
35+
72K LOC, pure C, zero dependencies. Also ships as a single-header `quant.h` (15K LOC).
36+
37+
Source: [github.com/quantumaikr/quant.cpp](https://github.com/quantumaikr/quant.cpp)
38+
39+
---
40+
41+
## 수정 이력
42+
43+
**v2 (audioen 피드백 반영):**
44+
- "llama.cpp applies the same scheme to both K and V" 삭제 — 부정확 (llama.cpp도 별도 설정 가능)
45+
- "per-block" 공통점 인정, block size/range encoding 차이로 정정
46+
- llama.cpp Q8_0 K + Q5_0 V 옵션을 공정하게 언급
47+
- "breaks the model" → "very different quality outcomes" 톤 완화
48+
- 전체적으로 팩트 기반으로 재작성, 과장 표현 제거
49+
50+
**v1 문제점 (삭제된 표현):**
51+
- ~~"One breaks the model, the other doesn't"~~ → 과장
52+
- ~~"llama.cpp applies the same Q4_0 scheme to both keys and values"~~ → 부정확
53+
- ~~"Outliers stay local instead of corrupting the whole tensor"~~ → Q4_0도 per-block이므로 misleading

0 commit comments

Comments
 (0)