perf: cache-friendly approximate scoring via transposed centroid layout#26
perf: cache-friendly approximate scoring via transposed centroid layout#26dilpreet92 wants to merge 1 commit intolightonai:mainfrom
Conversation
Transpose the query-centroid score matrix from [num_tokens × num_centroids] to [num_centroids × num_tokens] before approximate scoring, making each centroid's scores contiguous in memory for sequential access. Replace row-major random-access scoring and sparse HashMap-based scoring with a single SIMD-friendly implementation using slice-based zip iterators. Add software prefetching on x86_64 to hide L3 latency. Replace O(N log N) full sort with O(N) select_nth_unstable partial sort. Benchmark on c6a.4xlarge with 1.18B embeddings, top_k=4096: - Avg latency: 7.4s → 2.0s (3.7x faster) - QPS: 0.1 → 0.5 (5x improvement) - Memory unchanged (~9 GB), results mathematically identical.
|
Hi @dilpreet92, this is a very cool MR, I'm benching it. Also it update some core code of the algorithm so I'll be careful with the review, I'm not a huge fan of the transposition, need to check about actual memory usage impact further, would welcome any chart or deeper analysis than the current mr :) |
|
Thanks for benching it @raphaelsty Here's the memory breakdown: The original code already allocates query_centroid_scores as a [num_tokens × num_centroids] Array2 (e.g. 128 × 65K × 4 = 32 MB). The transpose creates a second copy of the same size, so peak per-query memory goes from ~32 MB to ~64 MB temporarily during scoring. This For context:
If the temporary doubling is a concern, I can refactor in-place transpose that consumes the original matrix, keeping peak memory identical to before. Let me know if you'd prefer that approach. I am already running this on my production system :) |
Summary
[num_tokens × num_centroids]to[num_centroids × num_tokens]before approximate scoring, making each centroid's scores contiguous in memoryapproximate_score_mmap) and sparse HashMap-based scoring (approximate_score_sparse) with a single SIMD-friendly implementation using slice-based zip iteratorsselect_nth_unstablepartial sort, then sort only the small top-k setBenchmark
Setup: c6a.4xlarge (AMD EPYC, 16 vCPU, 32 GB RAM), 1.18B embeddings, 94 GB index, top_k=4096, 20 sequential queries
Memory usage unchanged (~9 GB). Results are mathematically identical — the transposed layout computes the same MaxSim scores with better cache locality.
Why this works
The original scoring reads
num_tokens(typically 128) scattered values per centroid code from a large row-major matrix. With ~65K centroids × 128 tokens × 4 bytes = 32 MB, this thrashes L2 cache on every code lookup.The transposed layout makes each centroid's 128 scores contiguous (512 bytes = 8 cache lines), enabling:
vmaxpson AVX2/AVX-512)Test plan
cargo test)