A Rust implementation of TurboQuant — Google Research's near-lossless KV-cache compression algorithm for large language models (to be presented at ICLR 2026).
TurboQuant combines two novel techniques — PolarQuant and QJL (Quantized Johnson-Lindenstrauss) — to compress the key-value cache of transformer models by 3–7× with zero measurable accuracy loss and no retraining required.
FP16 KV cache (128-dim, 128K context, 96 heads): ~6.2 GB
TurboQuant 4-bit: ~1.9 GB (3.4×)
TurboQuant 2-bit: ~0.9 GB (6.8×)
Modern large language models maintain a KV cache (Key-Value cache) during inference: a temporary store of the key and value vectors produced by each attention head for every token in the context window. This cache allows the model to avoid recomputing past token representations at every decoding step.
The problem is memory. For a model with:
Llayers,Hattention heads, head dimensiond, context lengthN
The KV cache consumes:
memory = 2 × L × H × d × N × sizeof(float16)
For a GPT-4-class model at 128K context, this exceeds 6 GB — a hard wall that prevents long-context inference on consumer hardware and raises cloud inference costs substantially.
Standard quantization (INT8, INT4) works well for model weights, which are static and can be calibrated offline. The KV cache is different:
- Vectors are generated online, one token at a time
- Each vector has a different scale and distribution
- Most methods store a normalization constant per block — adding memory overhead that partially cancels the compression gain
- Aggressive compression introduces quantization bias that accumulates over long contexts, causing hallucinations
TurboQuant is a two-stage pipeline: PolarQuant handles the bulk compression, and QJL eliminates the residual bias.
Core insight: A random orthogonal rotation makes every vector "look the same" statistically, enabling a single universal quantizer with no per-vector calibration.
Step-by-step:
-
Store the global L2 norm of the input vector (one
f32, 4 bytes). -
Normalize the vector to the unit sphere.
-
Apply a random orthogonal rotation
R ∈ ℝ^{d×d}:v' = R · v_unitAfter rotation, each coordinate of
v'follows a known marginal distribution (approximately Beta or Gaussian depending on head dimension). This is the Johnson-Lindenstrauss effect applied to the data rather than the query. -
Convert to polar coordinates in pairs
(v'[2i], v'[2i+1]):r_i = sqrt(v'[2i]² + v'[2i+1]²) # radius θ_i = atan2(v'[2i+1], v'[2i]) # angle ∈ [-π, π] -
Quantize using pre-computed optimal grids (no runtime calibration):
- Angles: Uniform grid on
[-π, π]— valid because the rotation makes angles approximately Uniform(−π, π) - Radii: Lloyd-Max quantizer for the Rayleigh distribution with
σ = 1/√d— valid because radii follow a known 2D chi distribution after rotation
- Angles: Uniform grid on
-
Store:
global_norm(f32) + packed angle codes + packed radius codes.
The key advantage over traditional methods: no per-block scale factors. Traditional INT4 quantization stores one float16 scale per 32 or 128 values. PolarQuant stores exactly one f32 per vector, regardless of dimension.
Memory for a single d=128 vector:
| Format | Bytes | Notes |
|---|---|---|
| FP32 | 512 | baseline |
| FP16 | 256 | standard inference |
| PolarQuant 4-bit | 68 | 4+32+32: norm + angles + radii |
| PolarQuant 3-bit | 52 | 4+24+24 |
| PolarQuant 2-bit | 36 | 4+16+16 |
PolarQuant introduces a small reconstruction error. Left uncorrected, this error creates a systematic bias in attention score computation that compounds over long contexts.
QJL corrects this bias using just 1 additional bit per dimension.
Encoding the residual:
residual = original_vector - PolarQuant_reconstruction(original_vector)
sign_bits = sign(S · (residual / ‖residual‖)) # m sign bits
Where S ∈ ℝ^{m×d} is a fixed random Gaussian matrix (shared, not stored per vector). Only the m sign bits and the scalar ‖residual‖ are stored per vector.
Estimating the dot product correction at query time:
From the Johnson-Lindenstrauss lemma, for random unit vector sᵢ:
E[sign(sᵢ · residual) · (sᵢ · query)] = (2/π) · ⟨query, residual⟩ / ‖residual‖
Therefore:
⟨query, residual⟩ ≈ (π/2) · (‖residual‖ / m) · Σᵢ sign_bitsᵢ · (S · query)ᵢ
The query is kept in full precision; only the keys are compressed. This unbiased estimator eliminates the systematic error from PolarQuant at the cost of m/8 extra bytes per vector (typically d/4 projections = d/32 bytes).
Full TurboQuant dot product:
⟨query, key⟩ ≈ PolarQuant_dot(query, key_pq) + QJL_correction(query, key_qjl)
TurboQuant operates near the information-theoretic lower bound for vector quantization distortion. The authors prove that the combination of random rotation + polar quantization achieves near-optimal rate-distortion tradeoff for Gaussian-distributed data, and the QJL stage makes the dot product estimator unbiased.
Benchmarks from this implementation (dim=128, seed=42):
| Mode | Bytes (FP16→TQ) | Compression | Dot product error |
|---|---|---|---|
| 2-bit | 256 → 44 | 5.8× | ~5% (with QJL) |
| 3-bit | 256 → 60 | 4.3× | ~3% |
| 4-bit | 256 → 76 | 3.4× | ~1% |
| K | Recall |
|---|---|
| Top-1 | 100% |
| Top-5 | 100% |
| Top-10 | 90% |
| Model | Context | FP16 | TurboQuant 4-bit | Savings |
|---|---|---|---|---|
| Llama-3-8B (32h×128d) | 8K | 134 MB | ~39 MB | 3.4× |
| Llama-3-70B (64h×128d) | 32K | 1.07 GB | ~318 MB | 3.4× |
| GPT-4 style (96h×128d) | 128K | 6.29 GB | ~1.87 GB | 3.4× |
Add to your Cargo.toml:
[dependencies]
turboquant = { path = "path/to/turboquant" }Dependencies: rand = "0.8", rand_distr = "0.4" (pulled automatically).
use turboquant::TurboQuant;
// Create a TurboQuant instance for head_dim=128, 4-bit compression
let tq = TurboQuant::new_4bit(128, /*seed=*/ 42);
// When a new token is generated, encode its key vector once:
let key: Vec<f32> = vec![/* head_dim floats */];
let encoded_key = tq.encode(&key);
// Compute attention score (replaces q · k / sqrt(d)):
let query: Vec<f32> = vec![/* head_dim floats */];
let score = tq.dot_product(&query, &encoded_key);
// Memory used:
println!("{} bytes (vs {} FP16)", encoded_key.bytes(), 128 * 2);use turboquant::KvCache;
let head_dim = 128;
let n_heads = 32;
let max_seq = 8192;
let mut cache = KvCache::new(head_dim, n_heads, max_seq);
// For each new token (keys and values are [n_heads][head_dim]):
cache.append(&keys, &values);
// Compute multi-head attention output ([n_heads][head_dim]):
let output = cache.attend(&queries);
// Print memory stats:
cache.print_stats();
// → KV Cache [32 heads × 128 dim × 512 tokens]
// FP16: 16777216 bytes (16384.0 KB)
// TurboQuant: 4915200 bytes ( 4800.0 KB)
// Compression: 3.41xuse turboquant::PolarQuant;
let pq = PolarQuant::new(
128, // dim
4, // angle_bits
4, // radius_bits
42, // seed
);
let v: Vec<f32> = vec![/* ... */];
let encoded = pq.encode(&v);
let decoded = pq.decode(&encoded); // approximate reconstruction
let dot = pq.dot_product_decoded(&query, &encoded); // fast path
let stats = pq.compression_stats();
println!("{:.1}x compression", stats.compression_ratio_fp16);// 4-bit: best accuracy, ~3.4x compression
let tq = TurboQuant::new_4bit(dim, seed);
// 3-bit: balanced, ~4.3x compression
let tq = TurboQuant::new_3bit(dim, seed);
// Custom: full control
let tq = TurboQuant::new(
dim,
2, // angle_bits
2, // radius_bits
dim / 4, // n_qjl_projections
seed,
);
println!("{}", tq.stats());
// → TurboQuant(dim=128) | FP16: 256B → Compressed: 76B (PQ:68B + QJL:8B) | 3.4x compressionturboquant/
├── Cargo.toml
├── README.md
├── LICENSE
├── src/
│ ├── lib.rs Public API and re-exports
│ ├── rotation.rs Random orthogonal matrix (Gram-Schmidt QR)
│ ├── bits.rs Bit-packing for 1–8 bit quantization codes
│ ├── polar_quant.rs PolarQuant encoder / decoder
│ ├── qjl.rs QJL sign-bit compressor + dot product estimator
│ ├── turboquant.rs TurboQuant (PolarQuant + QJL), attention_scores()
│ └── kv_cache.rs Multi-head KV cache for transformer inference
└── examples/
└── demo.rs Full benchmark: compression, recall, memory savings
| Module | Responsibility |
|---|---|
rotation |
Generate a fixed random orthogonal matrix R via Gram-Schmidt. Provides forward(v) = R·v and inverse(v) = Rᵀ·v. Used by PolarQuant to uniformize vector statistics before quantization. |
bits |
Bit-packing: store N codes of b bits each in ⌈N·b/8⌉ bytes. Handles arbitrary bit widths 1–8. Used by PolarQuant to store angle and radius codes compactly. |
polar_quant |
Core PolarQuant algorithm. Rotation → polar conversion → quantize with pre-computed grids. Also provides a fast dot product path that avoids full decoding. |
qjl |
QJL residual compression. Projects the residual error onto m random vectors, stores 1 sign bit each. Estimates ⟨query, residual⟩ at inference time using the JL unbiased estimator. |
turboquant |
Orchestrates the full pipeline: encode = PolarQuant + compute residual + QJL. dot_product = PolarQuant dot + QJL correction. Exposes attention_scores() for direct use in attention layers. |
kv_cache |
Manages a compressed KV cache for all heads across all sequence positions. Provides append(keys, values) and attend(queries) matching the transformer attention interface. |
# Run all 12 unit tests
cargo test
# Run the full benchmark demo
cargo run --example demoExpected demo output:
▶ Step 1: PolarQuant (rotation + polar quantization)
2-bit | bytes: 256 → 36 (7.1x) | recon_err: 0.414 | dot_err: 0.195
3-bit | bytes: 256 → 52 (4.9x) | recon_err: 0.238 | dot_err: 0.067
4-bit | bytes: 256 → 68 (3.8x) | recon_err: 0.127 | dot_err: 0.021
▶ Step 2: TurboQuant = PolarQuant + QJL residual correction
2-bit | 44 bytes | dot_err: PolarQuant=0.1948 TurboQuant=0.0496 (QJL improvement: 3.9x)
▶ Step 3: Top-K recall (100 keys, dim=128)
Top-1 recall: 1/1 = 100%
Top-5 recall: 5/5 = 100%
Top-10 recall: 9/10 = 90%
▶ Step 4: KV Cache memory savings
GPT-4 style: FP16: 6291 MB → TurboQuant: 1867 MB (3.4x)
This is a research implementation faithful to the published algorithm. A production deployment would additionally want:
- SIMD acceleration — the rotation matrix multiply is the hot path; AVX2/NEON intrinsics would give 4–8× speedup
- Randomized Hadamard Transform — replaces the O(d²) dense rotation with an O(d log d) structured transform; same theoretical guarantees, much faster
- INT4 kernel integration — fused dequantize + attention kernel (as in FlashAttention) to avoid materializing FP32 vectors
- Sliding window / token eviction — the current KvCache uses a simple FIFO eviction policy; real systems use more sophisticated strategies
- Per-layer tuning — different transformer layers benefit from different bit widths; early layers typically tolerate more compression
- TurboQuant — Google Research blog post, March 2026: research.google/blog/turboquant
- ICLR 2026 paper (to appear): TurboQuant: Redefining AI Efficiency with Extreme Compression
- AISTATS 2026 paper (to appear): PolarQuant
- Johnson, W. B. & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.
- Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
MIT License — see LICENSE.
This is an independent implementation based on the publicly available research description. It is not affiliated with or endorsed by Google Research.