Fact-check all llama.cpp comparisons: remove exaggerations, add context

unamedkr · claude · unamedkr · commit 5273505584e5 · 2026-04-06T01:53:09.000+09:00
Corrections across all docs:
- Blog: remove "applies same scheme to both K and V" (incorrect — llama.cpp
  supports --cache-type-k and --cache-type-v separately)
- Add llama.cpp Q8_0 K + Q5_0 V recommended config (~1.6x, ~+1% PPL) to
  all comparison tables alongside Q4_0 symmetric (+10.6%)
- Engine table: "7x +0% PPL" → "3.8-6.9x +0% PPL" (precise range)
- Tech report: add nuanced intro acknowledging llama.cpp's range of options
- PR drafts: remove "10x more degradation", "catastrophic", "breaks model"
- PPL scatter charts: add llama.cpp Q8K+Q5V data point

Principle: compare at same compression level (3.8x: Q4_0 vs quant.cpp 4-bit)
AND same quality level (+1%: Q8+Q5 1.6x vs quant.cpp delta 4.3x).
Both comparisons are fair; cherry-picking only the worst config is not.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -95,7 +95,7 @@ hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf
 
 |  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
 |:--|:---------:|:---------:|:----:|:---:|:-------:|
-| KV 압축 | **7x, +0% PPL** | 1.6x ~+1% PPL | -- | -- | -- |
+| KV 압축 | **3.8-6.9x, +0% PPL** | 1.6x ~+1% PPL | -- | -- | -- |
 | 코드 크기 | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
 | 의존성 | **제로** | ggml | PyTorch | Apple fw | 런타임 |
 | 임베더블 | **단일 헤더** | -- | -- | -- | 복잡 |
diff --git a/README.md b/README.md
@@ -95,7 +95,7 @@ Both are per-block methods. The quality gap comes from block size (128 vs 32), m
 
 |  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
 |:--|:---------:|:---------:|:----:|:---:|:-------:|
-| KV compression | **7x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- |
+| KV compression | **3.8-6.9x, +0% PPL** | 1.6x at ~+1% PPL | -- | -- | -- |
 | Code size | **72K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
 | Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime |
 | Embeddable | **single header** | -- | -- | -- | complex |
diff --git a/docs/blog/breaking-3bit-barrier.md b/docs/blog/breaking-3bit-barrier.md
@@ -10,7 +10,7 @@ The obvious fix is quantization: store those vectors in fewer bits. We spent thr
 
 ## The descent into fewer bits
 
-4-bit works. We implemented a straightforward uniform min-max quantizer for KV cache keys and ran WikiText-2 perplexity on SmolLM2 1.7B. FP32 baseline: 14.63 PPL. With 4-bit keys and Q4 values: 14.57 PPL. That is -0.4%, which is within noise -- essentially free compression. For comparison, llama.cpp's built-in Q4_0 KV cache quantization scores +10.6% PPL degradation on the same model. The difference comes from quantizing K and V independently with type-appropriate methods, while llama.cpp applies the same scheme to both.
+4-bit works. We implemented a straightforward uniform min-max quantizer for KV cache keys and ran WikiText-2 perplexity on SmolLM2 1.7B. FP32 baseline: 14.63 PPL. With 4-bit keys and Q4 values: 14.57 PPL. That is -0.4%, which is within noise -- essentially free compression. For comparison, llama.cpp's Q4_0 KV cache quantization (symmetric K+V) scores +10.6% PPL on the same model. Note: llama.cpp also supports asymmetric configs like Q8_0 K + Q5_0 V (~+1% PPL, ~1.6x compression). The quality advantage at high compression (3.8x+) comes from 128-element min-max blocks, independent K/V quantization, and delta compression of adjacent keys.
 
 3-bit is where things get ugly. Naive 3-bit uniform quantization blows up to +62% PPL. The 8 reconstruction levels simply cannot capture the post-RHT distribution with enough fidelity. We tried Lloyd-Max optimal codebooks, asymmetric ranges, per-channel scales. Nothing brought it under +40%.
 
diff --git a/docs/papers/quant_cpp_tech_report.md b/docs/papers/quant_cpp_tech_report.md
@@ -8,7 +8,7 @@ We present quant.cpp, a minimal LLM inference engine that achieves 6.9x KV cache
 
 Large language model inference is increasingly memory-bound. At 32K context length, an 8B model's KV cache consumes 4GB — more than the model weights themselves. While weight quantization (Q4, Q8) is well-studied, KV cache compression receives less attention despite dominating memory usage at long contexts.
 
-Existing KV cache quantization in production engines (llama.cpp Q4_0) introduces +10.6% perplexity degradation — noticeable quality loss. We show that type-aware independent K/V quantization achieves +0.0% degradation at the same bit budget.
+Existing KV cache quantization in production engines offers a range of quality-compression tradeoffs. llama.cpp's recommended Q8_0 K + Q5_0 V config achieves ~1.6x compression with ~+1% PPL, while aggressive Q4_0 symmetric gives 3.8x but +10.6% PPL. We show that type-aware independent K/V quantization achieves +0.0% degradation at 3.8x compression — bridging the gap between high compression and quality preservation.
 
 quant.cpp is designed around three principles:
 1. **Readable**: The full transformer forward pass fits in one file (tq_transformer.c, 2500 lines).
@@ -102,7 +102,8 @@ WikiText-2 PPL on SmolLM2 1.7B:
 | 4b K + FP16 V | 14.63 | +0.00% | 1.6x |
 | 4b K + Q4 V | 14.57 | -0.4% | 6.9x |
 | Delta 3b K + Q4 V | 14.82 | +1.3% | 8.5x |
-| llama.cpp Q4_0 KV | 16.18 | +10.6% | 3.8x |
+| llama.cpp Q8_0 K + Q5_0 V | ~14.8 | ~+1% | 1.6x |
+| llama.cpp Q4_0 K+V (symmetric) | 16.18 | +10.6% | 3.8x |
 
 ### 5.3 Context Extension
 
@@ -118,7 +119,7 @@ On 16GB Mac M1 Pro:
 - **TurboQuant** (Zandieh et al., ICLR 2026): KV cache compression theory
 - **QJL** (AAAI 2025): Quantized Johnson-Lindenstrauss transform
 - **PolarQuant** (AISTATS 2026): Polar coordinate quantization
-- **llama.cpp**: Production inference engine with Q4 KV quantization
+- **llama.cpp**: Production inference engine with asymmetric KV quantization (Q8 K + Q5 V recommended)
 - **llm.c** (Karpathy): Minimal C training/inference, educational focus
 
 ## 7. Conclusion
diff --git a/docs/pr/2026-04-05-localllm-7x-context-responses.md b/docs/pr/2026-04-05-localllm-7x-context-responses.md
@@ -0,0 +1,79 @@
+# r/LocalLLM Comment Responses — "7x longer LLM context in pure C" (2026-04-05)
+
+Thread: quant.cpp — 7x longer LLM context in pure C (Gemma 4 26B on 16GB Mac)
+
+Copy-paste ready. Each section = one comment.
+
+---
+
+## @dsanft — "4bit K completely kills inference quality due to kurtosis. K needs 8 bits."
+
+You're right that K tensors have high kurtosis — outliers make them much harder to quantize than V. Naive per-tensor quantization does destroy quality.
+
+The difference is granularity. quant.cpp uses per-block min-max quantization with 128-element blocks, not per-tensor or per-channel. Each block gets its own min/max scale, so outliers only affect their local block.
+
+WikiText-2 PPL (SmolLM2 1.7B):
+
+- FP32 baseline: 14.63
+- 4-bit K + Q4 V: 14.57 (+0.0%)
+- Cross-model: Qwen3.5 0.8B (+0.9%), Qwen3.5 4B (+0.6%)
+
+For comparison, llama.cpp Q4_0 KV gives PPL +10.6% — that's the significant quality drop you're describing, and it's real with coarser quantization.
+
+That said, you're absolutely right for QK-normed models like Gemma 4. Those project keys onto the unit sphere with extremely sparse distributions (~56 of 256 dims active). 4-bit completely breaks there (cosine drops to 0.62). quant.cpp auto-detects this and keeps keys in FP32 while only compressing values.
+
+Reproducible: `./quant model.gguf --ppl input.txt -k uniform_4b -v q4`
+
+---
+
+## @MrHighVoltage — "at least do a proper job copying"
+
+(Already responded. Post formatting fixed — switched to Markdown editor.)
+
+---
+
+## @maschayana — "Lol its still not fixed"
+
+You're right, sorry about that. Reddit editor was fighting the markdown tables. Switched to Markdown mode and it should render properly now.
+
+---
+
+## @smuckola — "Titans, TurboQuant, KV Cache management landscape"
+
+Great question — and you actually nailed it. quant.cpp is a C implementation of the TurboQuant paper (ICLR 2026). So you already found the connection without realizing it!
+
+The KV cache management landscape breaks down roughly like this:
+
+- **Eviction** (StreamingLLM, H2O, Scissors) — drop tokens you "probably" don't need. Saves memory but loses information permanently.
+- **Architecture changes** (Titans, MLA, GQA) — redesign the model itself to use less KV memory. Best results, but requires retraining from scratch.
+- **Compression** (TurboQuant/quant.cpp, KIVI, KVQuant) — keep all tokens, store them in fewer bits. Works on existing models, no retraining.
+
+quant.cpp sits in the compression category. The advantage is that it works on any existing GGUF model — download, run, get 7x more context. No fine-tuning, no architecture change.
+
+Titans is a different and complementary approach — it redesigns the attention mechanism itself so the model learns what to remember. Very promising, but requires models trained with it. If a Titans-architecture model ships as GGUF someday, quant.cpp could still compress its KV cache on top.
+
+And thanks for the kind words about the focus. "Torvaldsian side quest" — I'm framing that.
+
+---
+
+## @sinan_online — "replicability and compatibility — containers, standard APIs, plug-n-play"
+
+Thanks for the concrete use case — these are fair concerns.
+
+**Replicability**: quant.cpp reads standard GGUF files directly. No model conversion, no custom formats. Any GGUF you download from Hugging Face works as-is. KV compression happens at runtime — the model file is untouched, so you can swap models freely. Same binary, different GGUF, same flags.
+
+**Containers**: The binary is statically linkable with zero external dependencies (libc + pthreads only). No Python, no PyTorch, no CUDA runtime to install. A minimal Docker image can be under 10MB. That said, we don't ship an official container image yet — that's a fair gap.
+
+**Standard API**: This is the honest limitation. quant.cpp has a C API (`quant_load` / `quant_generate`), not an OpenAI-compatible HTTP server. If you need a drop-in replacement for an existing API pipeline, llama.cpp's `llama-server` or vLLM is the right tool today.
+
+Where quant.cpp fits in your workflow: if you're already running llama.cpp in a container and hitting context limits, we have an integration patch at `integrations/llamacpp/` that adds our KV compression as a drop-in option. Same API, longer context. The goal is to upstream delta compression into llama.cpp as a PR.
+
+---
+
+## @MimosaTen — "chatgpt-20b-Q4 could be the best model I've tried"
+
+Nice — gpt-oss-20b is a solid model. It uses a GPT-2-style architecture with RoPE and MoE (32 experts), which is close to what quant.cpp already supports but not there yet. We handle Llama, Qwen, and Gemma architectures today.
+
+If you're on limited hardware, KV compression would help a lot with a 20B MoE model — the KV cache is usually what runs you out of memory before the weights do, especially with long conversations.
+
+I'll look into adding gpt-oss support. The MoE + RoPE + GQA pieces are already implemented for Gemma 4, so the gap is mostly the GPT-2 layer structure. Thanks for the suggestion!
diff --git a/docs/pr/2026-04-05-show-hn-v3.md b/docs/pr/2026-04-05-show-hn-v3.md
@@ -29,7 +29,7 @@ I built a minimal LLM inference engine in pure C (67K LOC, zero dependencies) wi
 - FP32 baseline: 14.63
 - 4-bit K + Q4 V: 14.57 (**+0.0%**)
 - Delta 3-bit K + Q4 V: 14.82 (+1.3%)
-- vs llama.cpp Q4_0 KV: **+10.6% PPL**. Same bit budget, 10x more degradation.
+- vs llama.cpp Q4_0 KV: **+10.6% PPL**. Same compression level (3.8x), but Q4_0 symmetric vs min-max with independent K/V.
 
 **Code philosophy:** 67K lines of C11. No frameworks, no CUDA required. The full forward pass fits in one file. Ships as a single-header `quant.h` (15K LOC) you can drop into any C project.
 
diff --git a/docs/pr/2026-04-06-reddit-comparison-chart.md b/docs/pr/2026-04-06-reddit-comparison-chart.md
@@ -0,0 +1,53 @@
+# Reddit Post — quant.cpp KV quality comparison (2026-04-06)
+
+Target: r/LocalLLaMA, r/LocalLLM
+
+**Title:** Same 4 bits. Very different quality. (quant.cpp vs llama.cpp KV compression)
+
+**Post type:** Image + text
+
+**Image:** How It Compares chart (PPL bar chart + engine comparison table)
+
+---
+
+**Body (below image):**
+
+Both use 4-bit KV quantization. Same bit budget, but very different quality outcomes.
+
+**WikiText-2 PPL (SmolLM2 1.7B):**
+
+- llama.cpp Q4_0 K+V: PPL **+10.6%**
+- quant.cpp 4-bit K + Q4 V: PPL **+0.0%**
+- quant.cpp 3-bit delta K + Q4 V: PPL **+1.3%**
+
+**Why the difference?** Both are per-block methods, but with different design choices:
+
+- **Block size & range encoding**: llama.cpp Q4_0 uses 32-element blocks with a single zero-point scale. quant.cpp uses 128-element blocks with min-max range encoding, which better captures the distribution of key vectors specifically.
+- **Independent K/V treatment**: quant.cpp applies different quantization methods to keys vs values, optimized for each tensor's statistical properties.
+- **Delta compression** (unique to quant.cpp): stores `key[t] - key[t-1]` instead of absolute keys — like video P-frames. Adjacent keys differ by ~30% of their range, so 3-bit deltas work where absolute 3-bit gives +62% PPL.
+
+**Fair note**: llama.cpp also supports separate K/V quant types. Q8_0 K + Q5_0 V is a solid config with much less degradation than Q4_0 on both — but at ~1.6x compression. quant.cpp targets the 4-7x range (extending 50K context to 350K) where the quality gap matters most.
+
+On a 16GB Mac with Llama 3.2 3B: llama.cpp with FP16 KV maxes out at ~50K tokens. quant.cpp compresses KV 6.9x → **~350K tokens** with zero quality loss.
+
+Not trying to replace llama.cpp — it's faster. Use llama.cpp for speed, vLLM for throughput, quant.cpp when context length is your bottleneck.
+
+72K LOC, pure C, zero dependencies. Also ships as a single-header `quant.h` (15K LOC).
+
+Source: [github.com/quantumaikr/quant.cpp](https://github.com/quantumaikr/quant.cpp)
+
+---
+
+## 수정 이력
+
+**v2 (audioen 피드백 반영):**
+- "llama.cpp applies the same scheme to both K and V" 삭제 — 부정확 (llama.cpp도 별도 설정 가능)
+- "per-block" 공통점 인정, block size/range encoding 차이로 정정
+- llama.cpp Q8_0 K + Q5_0 V 옵션을 공정하게 언급
+- "breaks the model" → "very different quality outcomes" 톤 완화
+- 전체적으로 팩트 기반으로 재작성, 과장 표현 제거
+
+**v1 문제점 (삭제된 표현):**
+- ~~"One breaks the model, the other doesn't"~~ → 과장
+- ~~"llama.cpp applies the same Q4_0 scheme to both keys and values"~~ → 부정확
+- ~~"Outliers stay local instead of corrupting the whole tensor"~~ → Q4_0도 per-block이므로 misleading