turboquant

A Rust implementation of TurboQuant — Google Research's near-lossless KV-cache compression algorithm for large language models (to be presented at ICLR 2026).

TurboQuant combines two novel techniques — PolarQuant and QJL (Quantized Johnson-Lindenstrauss) — to compress the key-value cache of transformer models by 3–7× with zero measurable accuracy loss and no retraining required.

FP16 KV cache (128-dim, 128K context, 96 heads):  ~6.2 GB
TurboQuant 4-bit:                                  ~1.9 GB  (3.4×)
TurboQuant 2-bit:                                  ~0.9 GB  (6.8×)

Background

The KV Cache Memory Problem

Modern large language models maintain a KV cache (Key-Value cache) during inference: a temporary store of the key and value vectors produced by each attention head for every token in the context window. This cache allows the model to avoid recomputing past token representations at every decoding step.

The problem is memory. For a model with:

L layers, H attention heads, head dimension d, context length N

The KV cache consumes:

memory = 2 × L × H × d × N × sizeof(float16)

For a GPT-4-class model at 128K context, this exceeds 6 GB — a hard wall that prevents long-context inference on consumer hardware and raises cloud inference costs substantially.

Why Traditional Quantization Falls Short

Standard quantization (INT8, INT4) works well for model weights, which are static and can be calibrated offline. The KV cache is different:

Vectors are generated online, one token at a time
Each vector has a different scale and distribution
Most methods store a normalization constant per block — adding memory overhead that partially cancels the compression gain
Aggressive compression introduces quantization bias that accumulates over long contexts, causing hallucinations

Algorithm

TurboQuant is a two-stage pipeline: PolarQuant handles the bulk compression, and QJL eliminates the residual bias.

Stage 1 — PolarQuant

Core insight: A random orthogonal rotation makes every vector "look the same" statistically, enabling a single universal quantizer with no per-vector calibration.

Step-by-step:

Store the global L2 norm of the input vector (one f32, 4 bytes).
Normalize the vector to the unit sphere.
Apply a random orthogonal rotation R ∈ ℝ^{d×d}:
```
v' = R · v_unit
```
After rotation, each coordinate of v' follows a known marginal distribution (approximately Beta or Gaussian depending on head dimension). This is the Johnson-Lindenstrauss effect applied to the data rather than the query.

Convert to polar coordinates in pairs (v'[2i], v'[2i+1]):

r_i = sqrt(v'[2i]² + v'[2i+1]²)      # radius
θ_i = atan2(v'[2i+1], v'[2i])         # angle ∈ [-π, π]

Quantize using pre-computed optimal grids (no runtime calibration):
- Angles: Uniform grid on [-π, π] — valid because the rotation makes angles approximately Uniform(−π, π)
- Radii: Lloyd-Max quantizer for the Rayleigh distribution with σ = 1/√d — valid because radii follow a known 2D chi distribution after rotation
Store: global_norm (f32) + packed angle codes + packed radius codes.

The key advantage over traditional methods: no per-block scale factors. Traditional INT4 quantization stores one float16 scale per 32 or 128 values. PolarQuant stores exactly one f32 per vector, regardless of dimension.

Memory for a single d=128 vector:

Format	Bytes	Notes
FP32	512	baseline
FP16	256	standard inference
PolarQuant 4-bit	68	4+32+32: norm + angles + radii
PolarQuant 3-bit	52	4+24+24
PolarQuant 2-bit	36	4+16+16

Stage 2 — QJL (Quantized Johnson-Lindenstrauss)

PolarQuant introduces a small reconstruction error. Left uncorrected, this error creates a systematic bias in attention score computation that compounds over long contexts.

QJL corrects this bias using just 1 additional bit per dimension.

Encoding the residual:

residual = original_vector - PolarQuant_reconstruction(original_vector)
sign_bits = sign(S · (residual / ‖residual‖))   # m sign bits

Where S ∈ ℝ^{m×d} is a fixed random Gaussian matrix (shared, not stored per vector). Only the m sign bits and the scalar ‖residual‖ are stored per vector.

Estimating the dot product correction at query time:

From the Johnson-Lindenstrauss lemma, for random unit vector sᵢ:

E[sign(sᵢ · residual) · (sᵢ · query)] = (2/π) · ⟨query, residual⟩ / ‖residual‖

Therefore:

⟨query, residual⟩ ≈ (π/2) · (‖residual‖ / m) · Σᵢ sign_bitsᵢ · (S · query)ᵢ

The query is kept in full precision; only the keys are compressed. This unbiased estimator eliminates the systematic error from PolarQuant at the cost of m/8 extra bytes per vector (typically d/4 projections = d/32 bytes).

Full TurboQuant dot product:

⟨query, key⟩ ≈ PolarQuant_dot(query, key_pq) + QJL_correction(query, key_qjl)

Theoretical Guarantees

TurboQuant operates near the information-theoretic lower bound for vector quantization distortion. The authors prove that the combination of random rotation + polar quantization achieves near-optimal rate-distortion tradeoff for Gaussian-distributed data, and the QJL stage makes the dot product estimator unbiased.

Results

Benchmarks from this implementation (dim=128, seed=42):

Compression ratio vs accuracy

Mode	Bytes (FP16→TQ)	Compression	Dot product error
2-bit	256 → 44	5.8×	~5% (with QJL)
3-bit	256 → 60	4.3×	~3%
4-bit	256 → 76	3.4×	~1%

Top-K recall (100 keys, dim=128)

K	Recall
Top-1	100%
Top-5	100%
Top-10	90%

KV cache memory at scale

Model	Context	FP16	TurboQuant 4-bit	Savings
Llama-3-8B (32h×128d)	8K	134 MB	~39 MB	3.4×
Llama-3-70B (64h×128d)	32K	1.07 GB	~318 MB	3.4×
GPT-4 style (96h×128d)	128K	6.29 GB	~1.87 GB	3.4×

Installation

Add to your Cargo.toml:

[dependencies]
turboquant = { path = "path/to/turboquant" }

Dependencies: rand = "0.8", rand_distr = "0.4" (pulled automatically).

Usage

Basic encoding and dot product

use turboquant::TurboQuant;

// Create a TurboQuant instance for head_dim=128, 4-bit compression
let tq = TurboQuant::new_4bit(128, /*seed=*/ 42);

// When a new token is generated, encode its key vector once:
let key: Vec<f32> = vec![/* head_dim floats */];
let encoded_key = tq.encode(&key);

// Compute attention score (replaces q · k / sqrt(d)):
let query: Vec<f32> = vec![/* head_dim floats */];
let score = tq.dot_product(&query, &encoded_key);

// Memory used:
println!("{} bytes (vs {} FP16)", encoded_key.bytes(), 128 * 2);

KV Cache for transformer inference

use turboquant::KvCache;

let head_dim  = 128;
let n_heads   = 32;
let max_seq   = 8192;

let mut cache = KvCache::new(head_dim, n_heads, max_seq);

// For each new token (keys and values are [n_heads][head_dim]):
cache.append(&keys, &values);

// Compute multi-head attention output ([n_heads][head_dim]):
let output = cache.attend(&queries);

// Print memory stats:
cache.print_stats();
// → KV Cache [32 heads × 128 dim × 512 tokens]
//     FP16:       16777216 bytes (16384.0 KB)
//     TurboQuant:  4915200 bytes ( 4800.0 KB)
//     Compression: 3.41x

PolarQuant standalone

use turboquant::PolarQuant;

let pq = PolarQuant::new(
    128,  // dim
    4,    // angle_bits
    4,    // radius_bits
    42,   // seed
);

let v: Vec<f32> = vec![/* ... */];
let encoded  = pq.encode(&v);
let decoded  = pq.decode(&encoded);      // approximate reconstruction
let dot      = pq.dot_product_decoded(&query, &encoded);  // fast path

let stats = pq.compression_stats();
println!("{:.1}x compression", stats.compression_ratio_fp16);

Compression modes

// 4-bit: best accuracy, ~3.4x compression
let tq = TurboQuant::new_4bit(dim, seed);

// 3-bit: balanced, ~4.3x compression
let tq = TurboQuant::new_3bit(dim, seed);

// Custom: full control
let tq = TurboQuant::new(
    dim,
    2,        // angle_bits
    2,        // radius_bits
    dim / 4,  // n_qjl_projections
    seed,
);

println!("{}", tq.stats());
// → TurboQuant(dim=128) | FP16: 256B → Compressed: 76B (PQ:68B + QJL:8B) | 3.4x compression

Project Structure

turboquant/
├── Cargo.toml
├── README.md
├── LICENSE
├── src/
│   ├── lib.rs            Public API and re-exports
│   ├── rotation.rs       Random orthogonal matrix (Gram-Schmidt QR)
│   ├── bits.rs           Bit-packing for 1–8 bit quantization codes
│   ├── polar_quant.rs    PolarQuant encoder / decoder
│   ├── qjl.rs            QJL sign-bit compressor + dot product estimator
│   ├── turboquant.rs     TurboQuant (PolarQuant + QJL), attention_scores()
│   └── kv_cache.rs       Multi-head KV cache for transformer inference
└── examples/
    └── demo.rs           Full benchmark: compression, recall, memory savings

Module responsibilities

Module	Responsibility
`rotation`	Generate a fixed random orthogonal matrix `R` via Gram-Schmidt. Provides `forward(v)` = `R·v` and `inverse(v)` = `Rᵀ·v`. Used by PolarQuant to uniformize vector statistics before quantization.
`bits`	Bit-packing: store N codes of `b` bits each in `⌈N·b/8⌉` bytes. Handles arbitrary bit widths 1–8. Used by PolarQuant to store angle and radius codes compactly.
`polar_quant`	Core PolarQuant algorithm. Rotation → polar conversion → quantize with pre-computed grids. Also provides a fast dot product path that avoids full decoding.
`qjl`	QJL residual compression. Projects the residual error onto `m` random vectors, stores 1 sign bit each. Estimates `⟨query, residual⟩` at inference time using the JL unbiased estimator.
`turboquant`	Orchestrates the full pipeline: encode = PolarQuant + compute residual + QJL. dot_product = PolarQuant dot + QJL correction. Exposes `attention_scores()` for direct use in attention layers.
`kv_cache`	Manages a compressed KV cache for all heads across all sequence positions. Provides `append(keys, values)` and `attend(queries)` matching the transformer attention interface.

Running the tests and demo

# Run all 12 unit tests
cargo test

# Run the full benchmark demo
cargo run --example demo

Expected demo output:

▶ Step 1: PolarQuant (rotation + polar quantization)
  2-bit | bytes: 256 → 36 (7.1x) | recon_err: 0.414 | dot_err: 0.195
  3-bit | bytes: 256 → 52 (4.9x) | recon_err: 0.238 | dot_err: 0.067
  4-bit | bytes: 256 → 68 (3.8x) | recon_err: 0.127 | dot_err: 0.021

▶ Step 2: TurboQuant = PolarQuant + QJL residual correction
  2-bit | 44 bytes | dot_err: PolarQuant=0.1948  TurboQuant=0.0496  (QJL improvement: 3.9x)

▶ Step 3: Top-K recall (100 keys, dim=128)
  Top-1  recall: 1/1 = 100%
  Top-5  recall: 5/5 = 100%
  Top-10 recall: 9/10 = 90%

▶ Step 4: KV Cache memory savings
  GPT-4 style: FP16: 6291 MB  →  TurboQuant: 1867 MB  (3.4x)

Limitations and known gaps

This is a research implementation faithful to the published algorithm. A production deployment would additionally want:

SIMD acceleration — the rotation matrix multiply is the hot path; AVX2/NEON intrinsics would give 4–8× speedup
Randomized Hadamard Transform — replaces the O(d²) dense rotation with an O(d log d) structured transform; same theoretical guarantees, much faster
INT4 kernel integration — fused dequantize + attention kernel (as in FlashAttention) to avoid materializing FP32 vectors
Sliding window / token eviction — the current KvCache uses a simple FIFO eviction policy; real systems use more sophisticated strategies
Per-layer tuning — different transformer layers benefit from different bit widths; early layers typically tolerate more compression

References

TurboQuant — Google Research blog post, March 2026: research.google/blog/turboquant
ICLR 2026 paper (to appear): TurboQuant: Redefining AI Efficiency with Extreme Compression
AISTATS 2026 paper (to appear): PolarQuant
Johnson, W. B. & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.

License

MIT License — see LICENSE.

This is an independent implementation based on the publicly available research description. It is not affiliated with or endorsed by Google Research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

turboquant

Background

The KV Cache Memory Problem

Why Traditional Quantization Falls Short

Algorithm

Stage 1 — PolarQuant

Stage 2 — QJL (Quantized Johnson-Lindenstrauss)

Theoretical Guarantees

Results

Compression ratio vs accuracy

Top-K recall (100 keys, dim=128)

KV cache memory at scale

Installation

Usage

Basic encoding and dot product

KV Cache for transformer inference

PolarQuant standalone

Compression modes

Project Structure

Module responsibilities

Running the tests and demo

Limitations and known gaps

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

turboquant

Background

The KV Cache Memory Problem

Why Traditional Quantization Falls Short

Algorithm

Stage 1 — PolarQuant

Stage 2 — QJL (Quantized Johnson-Lindenstrauss)

Theoretical Guarantees

Results

Compression ratio vs accuracy

Top-K recall (100 keys, dim=128)

KV cache memory at scale

Installation

Usage

Basic encoding and dot product

KV Cache for transformer inference

PolarQuant standalone

Compression modes

Project Structure

Module responsibilities

Running the tests and demo

Limitations and known gaps

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 0

Languages

Packages

Contributors