Skip to content

mew-sh/TurboQuant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

turboquant

A Rust implementation of TurboQuant — Google Research's near-lossless KV-cache compression algorithm for large language models (to be presented at ICLR 2026).

TurboQuant combines two novel techniques — PolarQuant and QJL (Quantized Johnson-Lindenstrauss) — to compress the key-value cache of transformer models by 3–7× with zero measurable accuracy loss and no retraining required.

FP16 KV cache (128-dim, 128K context, 96 heads):  ~6.2 GB
TurboQuant 4-bit:                                  ~1.9 GB  (3.4×)
TurboQuant 2-bit:                                  ~0.9 GB  (6.8×)

Background

The KV Cache Memory Problem

Modern large language models maintain a KV cache (Key-Value cache) during inference: a temporary store of the key and value vectors produced by each attention head for every token in the context window. This cache allows the model to avoid recomputing past token representations at every decoding step.

The problem is memory. For a model with:

  • L layers, H attention heads, head dimension d, context length N

The KV cache consumes:

memory = 2 × L × H × d × N × sizeof(float16)

For a GPT-4-class model at 128K context, this exceeds 6 GB — a hard wall that prevents long-context inference on consumer hardware and raises cloud inference costs substantially.

Why Traditional Quantization Falls Short

Standard quantization (INT8, INT4) works well for model weights, which are static and can be calibrated offline. The KV cache is different:

  • Vectors are generated online, one token at a time
  • Each vector has a different scale and distribution
  • Most methods store a normalization constant per block — adding memory overhead that partially cancels the compression gain
  • Aggressive compression introduces quantization bias that accumulates over long contexts, causing hallucinations

Algorithm

TurboQuant is a two-stage pipeline: PolarQuant handles the bulk compression, and QJL eliminates the residual bias.

Stage 1 — PolarQuant

Core insight: A random orthogonal rotation makes every vector "look the same" statistically, enabling a single universal quantizer with no per-vector calibration.

Step-by-step:

  1. Store the global L2 norm of the input vector (one f32, 4 bytes).

  2. Normalize the vector to the unit sphere.

  3. Apply a random orthogonal rotation R ∈ ℝ^{d×d}:

    v' = R · v_unit
    

    After rotation, each coordinate of v' follows a known marginal distribution (approximately Beta or Gaussian depending on head dimension). This is the Johnson-Lindenstrauss effect applied to the data rather than the query.

  4. Convert to polar coordinates in pairs (v'[2i], v'[2i+1]):

    r_i = sqrt(v'[2i]² + v'[2i+1]²)      # radius
    θ_i = atan2(v'[2i+1], v'[2i])         # angle ∈ [-π, π]
    
  5. Quantize using pre-computed optimal grids (no runtime calibration):

    • Angles: Uniform grid on [-π, π] — valid because the rotation makes angles approximately Uniform(−π, π)
    • Radii: Lloyd-Max quantizer for the Rayleigh distribution with σ = 1/√d — valid because radii follow a known 2D chi distribution after rotation
  6. Store: global_norm (f32) + packed angle codes + packed radius codes.

The key advantage over traditional methods: no per-block scale factors. Traditional INT4 quantization stores one float16 scale per 32 or 128 values. PolarQuant stores exactly one f32 per vector, regardless of dimension.

Memory for a single d=128 vector:

Format Bytes Notes
FP32 512 baseline
FP16 256 standard inference
PolarQuant 4-bit 68 4+32+32: norm + angles + radii
PolarQuant 3-bit 52 4+24+24
PolarQuant 2-bit 36 4+16+16

Stage 2 — QJL (Quantized Johnson-Lindenstrauss)

PolarQuant introduces a small reconstruction error. Left uncorrected, this error creates a systematic bias in attention score computation that compounds over long contexts.

QJL corrects this bias using just 1 additional bit per dimension.

Encoding the residual:

residual = original_vector - PolarQuant_reconstruction(original_vector)
sign_bits = sign(S · (residual / ‖residual‖))   # m sign bits

Where S ∈ ℝ^{m×d} is a fixed random Gaussian matrix (shared, not stored per vector). Only the m sign bits and the scalar ‖residual‖ are stored per vector.

Estimating the dot product correction at query time:

From the Johnson-Lindenstrauss lemma, for random unit vector sᵢ:

E[sign(sᵢ · residual) · (sᵢ · query)] = (2/π) · ⟨query, residual⟩ / ‖residual‖

Therefore:

⟨query, residual⟩ ≈ (π/2) · (‖residual‖ / m) · Σᵢ sign_bitsᵢ · (S · query)ᵢ

The query is kept in full precision; only the keys are compressed. This unbiased estimator eliminates the systematic error from PolarQuant at the cost of m/8 extra bytes per vector (typically d/4 projections = d/32 bytes).

Full TurboQuant dot product:

⟨query, key⟩ ≈ PolarQuant_dot(query, key_pq) + QJL_correction(query, key_qjl)

Theoretical Guarantees

TurboQuant operates near the information-theoretic lower bound for vector quantization distortion. The authors prove that the combination of random rotation + polar quantization achieves near-optimal rate-distortion tradeoff for Gaussian-distributed data, and the QJL stage makes the dot product estimator unbiased.


Results

Benchmarks from this implementation (dim=128, seed=42):

Compression ratio vs accuracy

Mode Bytes (FP16→TQ) Compression Dot product error
2-bit 256 → 44 5.8× ~5% (with QJL)
3-bit 256 → 60 4.3× ~3%
4-bit 256 → 76 3.4× ~1%

Top-K recall (100 keys, dim=128)

K Recall
Top-1 100%
Top-5 100%
Top-10 90%

KV cache memory at scale

Model Context FP16 TurboQuant 4-bit Savings
Llama-3-8B (32h×128d) 8K 134 MB ~39 MB 3.4×
Llama-3-70B (64h×128d) 32K 1.07 GB ~318 MB 3.4×
GPT-4 style (96h×128d) 128K 6.29 GB ~1.87 GB 3.4×

Installation

Add to your Cargo.toml:

[dependencies]
turboquant = { path = "path/to/turboquant" }

Dependencies: rand = "0.8", rand_distr = "0.4" (pulled automatically).


Usage

Basic encoding and dot product

use turboquant::TurboQuant;

// Create a TurboQuant instance for head_dim=128, 4-bit compression
let tq = TurboQuant::new_4bit(128, /*seed=*/ 42);

// When a new token is generated, encode its key vector once:
let key: Vec<f32> = vec![/* head_dim floats */];
let encoded_key = tq.encode(&key);

// Compute attention score (replaces q · k / sqrt(d)):
let query: Vec<f32> = vec![/* head_dim floats */];
let score = tq.dot_product(&query, &encoded_key);

// Memory used:
println!("{} bytes (vs {} FP16)", encoded_key.bytes(), 128 * 2);

KV Cache for transformer inference

use turboquant::KvCache;

let head_dim  = 128;
let n_heads   = 32;
let max_seq   = 8192;

let mut cache = KvCache::new(head_dim, n_heads, max_seq);

// For each new token (keys and values are [n_heads][head_dim]):
cache.append(&keys, &values);

// Compute multi-head attention output ([n_heads][head_dim]):
let output = cache.attend(&queries);

// Print memory stats:
cache.print_stats();
// → KV Cache [32 heads × 128 dim × 512 tokens]
//     FP16:       16777216 bytes (16384.0 KB)
//     TurboQuant:  4915200 bytes ( 4800.0 KB)
//     Compression: 3.41x

PolarQuant standalone

use turboquant::PolarQuant;

let pq = PolarQuant::new(
    128,  // dim
    4,    // angle_bits
    4,    // radius_bits
    42,   // seed
);

let v: Vec<f32> = vec![/* ... */];
let encoded  = pq.encode(&v);
let decoded  = pq.decode(&encoded);      // approximate reconstruction
let dot      = pq.dot_product_decoded(&query, &encoded);  // fast path

let stats = pq.compression_stats();
println!("{:.1}x compression", stats.compression_ratio_fp16);

Compression modes

// 4-bit: best accuracy, ~3.4x compression
let tq = TurboQuant::new_4bit(dim, seed);

// 3-bit: balanced, ~4.3x compression
let tq = TurboQuant::new_3bit(dim, seed);

// Custom: full control
let tq = TurboQuant::new(
    dim,
    2,        // angle_bits
    2,        // radius_bits
    dim / 4,  // n_qjl_projections
    seed,
);

println!("{}", tq.stats());
// → TurboQuant(dim=128) | FP16: 256B → Compressed: 76B (PQ:68B + QJL:8B) | 3.4x compression

Project Structure

turboquant/
├── Cargo.toml
├── README.md
├── LICENSE
├── src/
│   ├── lib.rs            Public API and re-exports
│   ├── rotation.rs       Random orthogonal matrix (Gram-Schmidt QR)
│   ├── bits.rs           Bit-packing for 1–8 bit quantization codes
│   ├── polar_quant.rs    PolarQuant encoder / decoder
│   ├── qjl.rs            QJL sign-bit compressor + dot product estimator
│   ├── turboquant.rs     TurboQuant (PolarQuant + QJL), attention_scores()
│   └── kv_cache.rs       Multi-head KV cache for transformer inference
└── examples/
    └── demo.rs           Full benchmark: compression, recall, memory savings

Module responsibilities

Module Responsibility
rotation Generate a fixed random orthogonal matrix R via Gram-Schmidt. Provides forward(v) = R·v and inverse(v) = Rᵀ·v. Used by PolarQuant to uniformize vector statistics before quantization.
bits Bit-packing: store N codes of b bits each in ⌈N·b/8⌉ bytes. Handles arbitrary bit widths 1–8. Used by PolarQuant to store angle and radius codes compactly.
polar_quant Core PolarQuant algorithm. Rotation → polar conversion → quantize with pre-computed grids. Also provides a fast dot product path that avoids full decoding.
qjl QJL residual compression. Projects the residual error onto m random vectors, stores 1 sign bit each. Estimates ⟨query, residual⟩ at inference time using the JL unbiased estimator.
turboquant Orchestrates the full pipeline: encode = PolarQuant + compute residual + QJL. dot_product = PolarQuant dot + QJL correction. Exposes attention_scores() for direct use in attention layers.
kv_cache Manages a compressed KV cache for all heads across all sequence positions. Provides append(keys, values) and attend(queries) matching the transformer attention interface.

Running the tests and demo

# Run all 12 unit tests
cargo test

# Run the full benchmark demo
cargo run --example demo

Expected demo output:

▶ Step 1: PolarQuant (rotation + polar quantization)
  2-bit | bytes: 256 → 36 (7.1x) | recon_err: 0.414 | dot_err: 0.195
  3-bit | bytes: 256 → 52 (4.9x) | recon_err: 0.238 | dot_err: 0.067
  4-bit | bytes: 256 → 68 (3.8x) | recon_err: 0.127 | dot_err: 0.021

▶ Step 2: TurboQuant = PolarQuant + QJL residual correction
  2-bit | 44 bytes | dot_err: PolarQuant=0.1948  TurboQuant=0.0496  (QJL improvement: 3.9x)

▶ Step 3: Top-K recall (100 keys, dim=128)
  Top-1  recall: 1/1 = 100%
  Top-5  recall: 5/5 = 100%
  Top-10 recall: 9/10 = 90%

▶ Step 4: KV Cache memory savings
  GPT-4 style: FP16: 6291 MB  →  TurboQuant: 1867 MB  (3.4x)

Limitations and known gaps

This is a research implementation faithful to the published algorithm. A production deployment would additionally want:

  • SIMD acceleration — the rotation matrix multiply is the hot path; AVX2/NEON intrinsics would give 4–8× speedup
  • Randomized Hadamard Transform — replaces the O(d²) dense rotation with an O(d log d) structured transform; same theoretical guarantees, much faster
  • INT4 kernel integration — fused dequantize + attention kernel (as in FlashAttention) to avoid materializing FP32 vectors
  • Sliding window / token eviction — the current KvCache uses a simple FIFO eviction policy; real systems use more sophisticated strategies
  • Per-layer tuning — different transformer layers benefit from different bit widths; early layers typically tolerate more compression

References

  • TurboQuant — Google Research blog post, March 2026: research.google/blog/turboquant
  • ICLR 2026 paper (to appear): TurboQuant: Redefining AI Efficiency with Extreme Compression
  • AISTATS 2026 paper (to appear): PolarQuant
  • Johnson, W. B. & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.
  • Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.

License

MIT License — see LICENSE.

This is an independent implementation based on the publicly available research description. It is not affiliated with or endorsed by Google Research.

About

A Rust implementation of TurboQuant - Google Research's near-lossless KV-cache compression algorithm for large language models (to be presented at ICLR 2026)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors

Languages