This document describes the experimental methodology used to evaluate TurboQuant across different language models.
Definition: Uncompressed size / Compressed size
Calculation:
Keys: (bits-1) × d + 1 × d + 16 + 16 bits per vector
Values: bits × d + 16 bits per vector
FP16 baseline: 16 bits per value
Interpretation:
- 3.0x = 1 MB KV cache becomes 333 KB
- Higher is better (more compression)
Definition: Similarity between real and estimated attention score distributions
Calculation:
cos(real_scores, estimated_scores) = <real, estimated> / (||real|| × ||estimated||)
Interpretation:
- 1.0 = Perfect match
- 0.99 = 99% similar distribution
- Used as primary quality metric (captures overall attention pattern)
Definition: % of attention heads where the most-attended token is the same
Calculation:
matches = count(argmax(real_scores) == argmax(estimated_scores))
pct = 100 × matches / n_heads
Interpretation:
- 100% = All heads attend to same token
- 80% = 4/5 heads correct
- Important for next-token prediction accuracy
Definition: % of heads where the real top-1 token appears in estimated top-5
Calculation:
matches = count(argmax(real) ∈ topk(estimated, k=5))
pct = 100 × matches / n_heads
Interpretation:
- More lenient than top-1
- 95% = Mostly correct rankings
- Shows if error is "close" or completely wrong
-
Synthetic Validation (
experiments/1_paper_reproduction/)- Random unit vectors
- Verifies theoretical bounds
- No model involved
-
Interactive Comparison (
experiments/2_multi_model_evaluation/interactive_with_real_kv.py)- Real-time prompt-based testing
- User enters custom prompts
- Shows actual KV cache compression analysis
- Measures memory, speed, and attention accuracy
- No pre-built contexts needed
-
Single Forward Pass (
experiments/2_multi_model_evaluation/)- Load pre-trained model
- One forward pass on long context
- Capture full KV cache
- Compress with TurboQuant
- Compare attention scores
-
Generation Benchmark (
experiments/2_multi_model_evaluation/benchmark_generation.py)- Simulate full generation loop
- Track KV cache growth across tokens
- Measure actual memory savings during generation
- Compare speed with/without compression
-
Variable Context Lengths
- 2K, 4K, 8K tokens
- Tests compression stability across lengths
- Reveals any context-dependent effects
Criteria:
- Open-source (reproducible)
- Varying sizes (3B to 13B)
- Different architectures (LLaMA, Phi, Mistral, Qwen)
- HuggingFace available
Selected Models:
- Qwen2.5-3B-Instruct (3.5GB, smallest, easiest)
- Microsoft Phi-2 (2.7GB, highly optimized)
- Meta LLaMA-2-7B (13GB, popular baseline)
- Mistral-7B-Instruct (13GB, modern architecture)
- Bit-widths: 2, 3, 4 bits
- Stage 1: (bits-1) for MSE quantization
- Stage 2: 1 bit for QJL residual correction
- Seed: Fixed per layer for reproducibility
For each model × bits configuration:
- Load Model: HuggingFace with 4-bit quantization (model weights only)
- Prepare Input:
- Build long prompt with hidden "needle" fact
- Filler text repeated to reach target length
- Tokenize and pad to exact length
- Forward Pass: Single pass, capture KV cache
- Compression:
- Create compressor for each layer
- Compress keys (TurboQuantV2) and values (TurboQuantMSE)
- Scoring:
- Compute real attention scores:
Q @ K^T - Compute estimated scores: asymmetric estimator
- Compare using metrics above
- Compute real attention scores:
Each evaluation outputs:
{
"model": "Qwen/Qwen2.5-3B-Instruct",
"bits": 3,
"by_context": {
"2048": {
"seq_len": 2048,
"compression_ratio": 5.0,
"cosine_similarity": 0.9945,
"top1_match_pct": 86.0,
"top5_match_pct": 94.0
}
}
}Results aggregated across:
- All layers (36-80 layers per model)
- All heads (8-96 heads per model)
- All contexts (2-8K tokens)
Shows average metrics across contexts for each model × bits combination.
- Compression vs Bits: Compression ratio curves
- Accuracy vs Bits: Cosine similarity vs quantization bits
- Memory Usage: Absolute memory savings across models
- Python 3.10+
- PyTorch 2.0+ with CUDA
- HuggingFace transformers, bitsandbytes
- 12GB+ GPU VRAM
python evaluate_model.py \
--model Qwen/Qwen2.5-3B-Instruct \
--bits 3 \
--contexts 2048 4096 8192 \
--output qwen/results_3bit.jsonThe new interactive_with_real_kv.py tool addresses several limitations:
- Real KV Cache Compression: Actually compresses KV cache and analyzes impact
- Custom Prompts: Users can test any prompt, not just pre-built contexts
- Real-time Feedback: Immediate compression metrics (memory, speed, accuracy)
- Actual Generation: Not just single forward pass - can analyze full sequences
Example Improvements:
- Original limitations: "only evaluates one token's attention"
- New tool: Analyzes actual KV cache at each generation step
Interactive web interface for side-by-side comparison:
- Dual Generation: Generates text with both original and compressed KV
- Same seed for deterministic output
- Shows actual output differences due to compression
- Attention Analysis: Displays how many heads might pick different tokens
- "Heads at Risk" metric: % of heads where compressed and original differ
- Indicates likelihood of different generation path
- Visual Metrics:
- Side-by-side text comparison
- KV cache memory metrics
- Compression ratio and speed
- Attention quality (cosine similarity, top-1/top-5 match)
- Model/Bitwidth Selection: Easy testing across different models and quantization levels
Features:
- User-friendly web interface
- No command-line knowledge needed
- Real-time analysis and results
- Model and quantization selection via dropdown
The benchmark_generation.py tool simulates real generation:
- Tracks KV cache growth during generation
- Measures memory savings across full sequence
- Shows compression overhead in generation loop
-
Custom Compression in Generation (CLI tool): Interactive tool doesn't apply compression during generation (generates normally, then analyzes compression)
- CLI limitation: Analyzes compression impact on attention scores post-hoc
- Streamlit solution: Generates with deterministic seed to show actual output differences
- Workaround:
benchmark_generation.pysimulates compression impact on memory
-
Streamlit Web UI Limitations:
- Single prompt test per run (not interactive loop like CLI)
- Requires seed-based reproducibility (limited variance analysis)
- Long prompts (2K+ tokens) significantly increase generation time
-
General Limitations:
- Long Context: Limited to 8K tokens in single forward pass due to GPU memory
- Different Hardware: Results depend on GPU (VRAM, compute capability)
- Fixed Codebooks: Seed-fixed for reproducibility, may not represent variance across different random seeds
- Full Generative Sampling: Apply compression during generation loop (not post-hoc analysis)
- Variance Analysis: Multiple random seeds to estimate output variance and stability
- Longer Contexts: Support 16K+ tokens with efficient quantization strategies
- Latency Benchmarking: Direct inference speed comparison with compression overhead
- Method Comparison: Compare with other compression methods (KIVI, PolarQuant, etc.)
- Batch Generation: Test batch inference with compression
- Deployment Options: Streamlit Cloud, HuggingFace Spaces, Docker containers
- Interactive Sessions: Extended multi-turn conversation analysis in Streamlit