TurboQuant Evaluation Methodology

Overview

This document describes the experimental methodology used to evaluate TurboQuant across different language models.

Evaluation Metrics

1. Compression Ratio

Definition: Uncompressed size / Compressed size

Calculation:

Keys: (bits-1) × d + 1 × d + 16 + 16 bits per vector
Values: bits × d + 16 bits per vector

FP16 baseline: 16 bits per value

Interpretation:

3.0x = 1 MB KV cache becomes 333 KB
Higher is better (more compression)

2. Cosine Similarity

Definition: Similarity between real and estimated attention score distributions

Calculation:

cos(real_scores, estimated_scores) = <real, estimated> / (||real|| × ||estimated||)

Interpretation:

1.0 = Perfect match
0.99 = 99% similar distribution
Used as primary quality metric (captures overall attention pattern)

3. Top-1 Match Percentage

Definition: % of attention heads where the most-attended token is the same

Calculation:

matches = count(argmax(real_scores) == argmax(estimated_scores))
pct = 100 × matches / n_heads

Interpretation:

100% = All heads attend to same token
80% = 4/5 heads correct
Important for next-token prediction accuracy

4. Top-5 Match Percentage

Definition: % of heads where the real top-1 token appears in estimated top-5

Calculation:

matches = count(argmax(real) ∈ topk(estimated, k=5))
pct = 100 × matches / n_heads

Interpretation:

More lenient than top-1
95% = Mostly correct rankings
Shows if error is "close" or completely wrong

Experimental Design

Test Scenarios

Synthetic Validation (experiments/1_paper_reproduction/)
- Random unit vectors
- Verifies theoretical bounds
- No model involved
Interactive Comparison (experiments/2_multi_model_evaluation/interactive_with_real_kv.py)
- Real-time prompt-based testing
- User enters custom prompts
- Shows actual KV cache compression analysis
- Measures memory, speed, and attention accuracy
- No pre-built contexts needed
Single Forward Pass (experiments/2_multi_model_evaluation/)
- Load pre-trained model
- One forward pass on long context
- Capture full KV cache
- Compress with TurboQuant
- Compare attention scores
Generation Benchmark (experiments/2_multi_model_evaluation/benchmark_generation.py)
- Simulate full generation loop
- Track KV cache growth across tokens
- Measure actual memory savings during generation
- Compare speed with/without compression
Variable Context Lengths
- 2K, 4K, 8K tokens
- Tests compression stability across lengths
- Reveals any context-dependent effects

Model Selection

Criteria:

Open-source (reproducible)
Varying sizes (3B to 13B)
Different architectures (LLaMA, Phi, Mistral, Qwen)
HuggingFace available

Selected Models:

Qwen2.5-3B-Instruct (3.5GB, smallest, easiest)
Microsoft Phi-2 (2.7GB, highly optimized)
Meta LLaMA-2-7B (13GB, popular baseline)
Mistral-7B-Instruct (13GB, modern architecture)

Quantization Config

Bit-widths: 2, 3, 4 bits
Stage 1: (bits-1) for MSE quantization
Stage 2: 1 bit for QJL residual correction
Seed: Fixed per layer for reproducibility

Data Collection

Per-Model Evaluation

For each model × bits configuration:

Load Model: HuggingFace with 4-bit quantization (model weights only)
Prepare Input:
- Build long prompt with hidden "needle" fact
- Filler text repeated to reach target length
- Tokenize and pad to exact length
Forward Pass: Single pass, capture KV cache
Compression:
- Create compressor for each layer
- Compress keys (TurboQuantV2) and values (TurboQuantMSE)
Scoring:
- Compute real attention scores: Q @ K^T
- Compute estimated scores: asymmetric estimator
- Compare using metrics above

Results Storage

Each evaluation outputs:

{
  "model": "Qwen/Qwen2.5-3B-Instruct",
  "bits": 3,
  "by_context": {
    "2048": {
      "seq_len": 2048,
      "compression_ratio": 5.0,
      "cosine_similarity": 0.9945,
      "top1_match_pct": 86.0,
      "top5_match_pct": 94.0
    }
  }
}

Analysis & Visualization

Aggregation

Results aggregated across:

All layers (36-80 layers per model)
All heads (8-96 heads per model)
All contexts (2-8K tokens)

Comparison Table

Shows average metrics across contexts for each model × bits combination.

Plots

Compression vs Bits: Compression ratio curves
Accuracy vs Bits: Cosine similarity vs quantization bits
Memory Usage: Absolute memory savings across models

Reproducibility

Requirements

Python 3.10+
PyTorch 2.0+ with CUDA
HuggingFace transformers, bitsandbytes
12GB+ GPU VRAM

Command to Reproduce

python evaluate_model.py \
    --model Qwen/Qwen2.5-3B-Instruct \
    --bits 3 \
    --contexts 2048 4096 8192 \
    --output qwen/results_3bit.json

New Tools (Recent Additions)

Interactive Real KV Compression Analysis (CLI)

The new interactive_with_real_kv.py tool addresses several limitations:

Real KV Cache Compression: Actually compresses KV cache and analyzes impact
Custom Prompts: Users can test any prompt, not just pre-built contexts
Real-time Feedback: Immediate compression metrics (memory, speed, accuracy)
Actual Generation: Not just single forward pass - can analyze full sequences

Example Improvements:

Original limitations: "only evaluates one token's attention"
New tool: Analyzes actual KV cache at each generation step

Streamlit Web UI (NEW)

Interactive web interface for side-by-side comparison:

Dual Generation: Generates text with both original and compressed KV
- Same seed for deterministic output
- Shows actual output differences due to compression
Attention Analysis: Displays how many heads might pick different tokens
- "Heads at Risk" metric: % of heads where compressed and original differ
- Indicates likelihood of different generation path
Visual Metrics:
- Side-by-side text comparison
- KV cache memory metrics
- Compression ratio and speed
- Attention quality (cosine similarity, top-1/top-5 match)
Model/Bitwidth Selection: Easy testing across different models and quantization levels

Features:

User-friendly web interface
No command-line knowledge needed
Real-time analysis and results
Model and quantization selection via dropdown

Generation Benchmark

The benchmark_generation.py tool simulates real generation:

Tracks KV cache growth during generation
Measures memory savings across full sequence
Shows compression overhead in generation loop

Limitations (Current)

Custom Compression in Generation (CLI tool): Interactive tool doesn't apply compression during generation (generates normally, then analyzes compression)
- CLI limitation: Analyzes compression impact on attention scores post-hoc
- Streamlit solution: Generates with deterministic seed to show actual output differences
- Workaround: benchmark_generation.py simulates compression impact on memory
Streamlit Web UI Limitations:
- Single prompt test per run (not interactive loop like CLI)
- Requires seed-based reproducibility (limited variance analysis)
- Long prompts (2K+ tokens) significantly increase generation time
General Limitations:
- Long Context: Limited to 8K tokens in single forward pass due to GPU memory
- Different Hardware: Results depend on GPU (VRAM, compute capability)
- Fixed Codebooks: Seed-fixed for reproducibility, may not represent variance across different random seeds

Future Improvements

Full Generative Sampling: Apply compression during generation loop (not post-hoc analysis)
Variance Analysis: Multiple random seeds to estimate output variance and stability
Longer Contexts: Support 16K+ tokens with efficient quantization strategies
Latency Benchmarking: Direct inference speed comparison with compression overhead
Method Comparison: Compare with other compression methods (KIVI, PolarQuant, etc.)
Batch Generation: Test batch inference with compression
Deployment Options: Streamlit Cloud, HuggingFace Spaces, Docker containers
Interactive Sessions: Extended multi-turn conversation analysis in Streamlit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant Evaluation Methodology

Overview

Evaluation Metrics

1. Compression Ratio

2. Cosine Similarity

3. Top-1 Match Percentage

4. Top-5 Match Percentage

Experimental Design

Test Scenarios

Model Selection

Quantization Config

Data Collection

Per-Model Evaluation

Results Storage

Analysis & Visualization

Aggregation

Comparison Table

Plots

Reproducibility

Requirements

Command to Reproduce

New Tools (Recent Additions)

Interactive Real KV Compression Analysis (CLI)

Streamlit Web UI (NEW)

Generation Benchmark

Limitations (Current)

Future Improvements

FilesExpand file tree

METHODOLOGY.md

Latest commit

History

METHODOLOGY.md

File metadata and controls

TurboQuant Evaluation Methodology

Overview

Evaluation Metrics

1. Compression Ratio

2. Cosine Similarity

3. Top-1 Match Percentage

4. Top-5 Match Percentage

Experimental Design

Test Scenarios

Model Selection

Quantization Config

Data Collection

Per-Model Evaluation

Results Storage

Analysis & Visualization

Aggregation

Comparison Table

Plots

Reproducibility

Requirements

Command to Reproduce

New Tools (Recent Additions)

Interactive Real KV Compression Analysis (CLI)

Streamlit Web UI (NEW)

Generation Benchmark

Limitations (Current)

Future Improvements