Fast compressed approximate nearest-neighbor search. Pure NumPy. No heavy dependencies.
hadamax implements the TurboQuant compression pipeline — randomized Hadamard transform followed by optimal Gaussian scalar quantization (Lloyd-Max) — as a self-contained Python library for embedding vector search. It achieves 8–12× compression with >0.92 recall@10 against float32 brute-force, using only NumPy.
pip install hadamax
import numpy as np
from hadamax import HadaMaxIndex
# Build index
idx = HadaMaxIndex(dim=384, bits=4) # 4-bit, ~8x compression
idx.add_batch(ids=list(range(N)), vectors=embeddings)
# Query
results = idx.search(query_vector, k=10) # [(id, score), ...]
# Persist
idx.save("my_index.hdmx")
idx2 = HadaMaxIndex.load("my_index.hdmx") # atomic save, v1/v2 compatibleModern embedding models produce float32 vectors of dimension d ∈ {384, 768, 1536}.
Storing N vectors requires 4·N·d bytes; brute-force search costs O(N·d) per query.
For N = 1M, d = 384: 1.5 GB RAM, with inner products dominating inference time.
Product Quantization (PQ) splits vectors into M sub-vectors and quantizes each independently. It is effective but requires training a K-means codebook per dataset. Random Binary Quantization (RaBitQ, 1-bit) is fast but coarse.
TurboQuant (Zandieh et al., ICLR 2026, arXiv:2504.19874) achieves near-optimal distortion at b bits per coordinate without training codebooks, by first rotating the space with a randomized Hadamard transform to make coordinates approximately Gaussian, then quantizing each coordinate independently with the optimal scalar quantizer for N(0,1).
Given a raw embedding v ∈ ℝᵈ, compute the unit vector v̂ = v / ‖v‖ and store
‖v‖ separately (float32, 4 bytes per vector).
Pad v̂ to the next power of 2 (d' = 2^⌈log₂ d⌉), then apply:
x = (1/√d') · H · D · v̂
where:
D = diag(σ₁, …, σ_d')— diagonal matrix of i.i.d. ±1 random signs (seed-deterministic)H— unnormalized Walsh-Hadamard matrix (butterfly pattern)
By the Johnson-Lindenstrauss lemma, each coordinate xᵢ ≈ N(0, 1/d').
After rescaling x̃ = x · √d', the coordinates are approximately N(0,1)
regardless of the original distribution of v.
Complexity: O(d log d) — no matrix multiplication, no codebook training.
The optimal scalar quantizer for N(0,1) at b bits partitions ℝ into 2^b intervals
and assigns each the conditional mean as reconstruction value.
These boundaries and centroids are precomputed and hardcoded in hadamax._codebooks
(no scipy required at runtime):
| bits | levels | distortion (MSE) | bytes/coord (disk) |
|---|---|---|---|
| 2 | 4 | 0.1175 | 0.25 |
| 3 | 8 | 0.0311 | 0.375 |
| 4 | 16 | 0.0077 | 0.50 |
The quantized vector is stored as a uint8 index matrix, bit-packed to b/8
bytes per coordinate on disk.
At search time the query q is rotated (not quantized) and the approximate
cosine similarity is computed as:
score(q, v) = (1/d') · Σᵢ centroid[idx_qᵢ] · centroid[idx_vᵢ]
This is a single float16 matrix–vector product against the cached centroid expansions.
The MSE quantizer introduces a small systematic downward bias. The use_prod=True mode
corrects this using a Quantized Johnson-Lindenstrauss (QJL) residual:
Build time (per stored vector):
- Quantize at
(b-1)bits MSE, compute residualr = x̃ - x̃_MSE - Store
sign(S·r)as a 1-bit vector (int8 ±1 in practice), whereS ∈ ℝ^(d'×d')is a fixed random Gaussian matrix - Store
‖r‖ / √d'(one float32 per vector)
Query time (correction term):
correctionᵢ = √(π/2) / d' · ‖rᵢ‖ · dot(S·q̂, sign(S·rᵢ))
final_scoreᵢ = mse_scoreᵢ + correctionᵢ
This follows from Lemma 4 of Zandieh et al. (2025):
E[sign(S·r)] = √(2/π) · S·r / ‖S·r‖, giving an unbiased estimate of ⟨r, q̂⟩.
When to use use_prod=True:
- When you need accurate inner product magnitudes (KV-cache, attention approximation)
- Not recommended for pure ranking/NNS — the added QJL variance degrades recall@k relative to MSE-only at equal total bits
For N vectors of dimension d = 384 (BGE-small):
| Backend | Bytes/vector (disk) | Ratio vs float32 |
|---|---|---|
| float32 | 1 536 | 1.0× |
| 4-bit hadamax | 192 + 4 | 7.9× |
| 3-bit hadamax | 144 + 4 | 10.4× |
| 2-bit hadamax | 96 + 4 | 15.4× |
| int8 (naïve) | 384 + 4 | 3.9× |
The 4-byte overhead is the per-vector norm. In RAM, indices are stored as uint8 (~3× vs float32); bit-packing applies on disk (~8× vs float32 at 4-bit).
Measured on synthetic unit-sphere vectors (d=384, N=10 000, 100 queries).
Baseline: exact cosine float32 brute-force.
| bits | recall@1 | recall@10 | recall@50 |
|---|---|---|---|
| 2 | 0.72 | 0.83 | 0.91 |
| 3 | 0.81 | 0.91 | 0.96 |
| 4 | 0.86 | 0.93 | 0.95 |
Recall improves with clustered (real-world) data. On BGE-small-en embeddings from mixed document corpora, 4-bit achieves recall@10 ≈ 0.95.
Note on published results: The TurboQuant paper (Zandieh et al., 2025) reports recall up to 0.99, measured against HNSW graph navigation (not brute-force float32), on GloVe
d=200data, using recall@1 with largek_probe. These conditions differ from the above; both results are correct under their respective definitions.
Offset Size Field
──────────────────────────────────────────────────
0 4 B magic: "HDMX"
4 4 B version: uint32 (1 or 2)
8 4 B dim: uint32 — original embedding dimension
12 4 B bits: uint32 — total bits (2, 3, or 4)
16 4 B seed: uint32 — rotation seed
20 4 B n: uint32 — number of stored vectors
24 4 B flags: uint32 — bit-0: use_prod [v2 only]
──────────────────────────────────────────────────
28 4 B packed_len: uint32
32 * indices: bit-packed uint8 MSE indices
n×4 B norms: float32 per-vector original norms
[prod only]
n×d' B qjl_signs: int8 sign(S·r) per vector
n×4 B rnorms: float32 ‖r‖/√d per vector
──────────────────────────────────────────────────
n×(2+L) ids: uint16-length-prefixed UTF-8 strings
Saves are atomic on POSIX: writes to .hdmx.tmp then os.replace().
Backward compatible: v1 files (mse-only) load correctly in any version.
| Parameter | Type | Default | Description |
|---|---|---|---|
dim |
int | — | Embedding dimension |
bits |
int | 4 | Bits per coordinate: 2, 3, or 4 |
seed |
int | 0 | Rotation seed — must be consistent across build and query |
use_prod |
bool | False | Enable QJL unbiased estimator (requires bits ≥ 3) |
idx.add(id, vector) # Add one vector
idx.add_batch(ids, vectors) # Add N vectors (~50x faster than loop)
idx.delete(id) -> bool # Remove by id, O(1) lookup
idx.search(query, k=10) -> list # [(id, score), ...] descending
idx.save(path) # Atomic binary save to .hdmx
HadaMaxIndex.load(path) # Load from .hdmx file
idx.stats() -> dict # Compression / memory diagnostics
len(idx) # Number of stored vectors
repr(idx) # HadaMaxIndex(dim=384, bits=4, mode=mse, n=1000)hadamax implements the core compression pipeline from:
Zandieh, A., Daliri, M., Hadian, A., & Mirrokni, V. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
The same algorithm was published concurrently as "PolarQuant" at AISTATS 2026.
Both names were already taken on PyPI; hadamax = Hadamard + Lloyd-Max,
named after its two core operations.
Key contributions of this implementation over the reference:
- No scipy — codebooks hardcoded, numpy is the only runtime dependency
- Batch WHT — single O(n·d·log d) call for bulk inserts (~50x faster than loop)
- Float16 cache — centroid expansions in half precision, ~2x faster matmul
- O(1) delete —
_id_to_posdict + position compaction - Atomic saves —
.hdmx.tmp→os.replace()pattern - Versioned format — v1/v2 both loadable, forward-compatible flags field
pip install hadamaxRequirements: Python >= 3.10, NumPy >= 1.24. No other runtime dependencies.
For development:
git clone https://github.com/stffns/hadamax
cd hadamax
pip install -e .
pytest tests/ -vMIT © 2025 Jayson Steffens.
The TurboQuant algorithm is described in arXiv:2504.19874 by Zandieh et al. (Google Research / ICLR 2026). This package is an independent implementation.