perf: cache-friendly approximate scoring via transposed centroid layout by dilpreet92 · Pull Request #26 · lightonai/next-plaid

dilpreet92 · 2026-03-05T10:22:50Z

Summary

Transpose the query-centroid score matrix from [num_tokens × num_centroids] to [num_centroids × num_tokens] before approximate scoring, making each centroid's scores contiguous in memory
Replace row-major random-access scoring (approximate_score_mmap) and sparse HashMap-based scoring (approximate_score_sparse) with a single SIMD-friendly implementation using slice-based zip iterators
Add software prefetching on x86_64 to hide L3 latency during scoring
Replace O(N log N) full sort of approximate scores with O(N) select_nth_unstable partial sort, then sort only the small top-k set

Benchmark

Setup: c6a.4xlarge (AMD EPYC, 16 vCPU, 32 GB RAM), 1.18B embeddings, 94 GB index, top_k=4096, 20 sequential queries

Metric	Before	After	Improvement
Wall time	148.30s	40.40s	3.7x
QPS	0.1	0.5	5x
Avg latency	7.414s	2.019s	3.7x
p50 latency	7.111s	2.325s	3.1x

Memory usage unchanged (~9 GB). Results are mathematically identical — the transposed layout computes the same MaxSim scores with better cache locality.

Why this works

The original scoring reads num_tokens (typically 128) scattered values per centroid code from a large row-major matrix. With ~65K centroids × 128 tokens × 4 bytes = 32 MB, this thrashes L2 cache on every code lookup.

The transposed layout makes each centroid's 128 scores contiguous (512 bytes = 8 cache lines), enabling:

Sequential reads instead of strided random access
LLVM auto-vectorization of the inner loop (vmaxps on AVX2/AVX-512)
Software prefetching to pipeline L3 loads

Test plan

Existing unit tests pass (cargo test)
Verified identical search results before and after on production index (1.18B embeddings)
Benchmarked on c6a.4xlarge

Transpose the query-centroid score matrix from [num_tokens × num_centroids] to [num_centroids × num_tokens] before approximate scoring, making each centroid's scores contiguous in memory for sequential access. Replace row-major random-access scoring and sparse HashMap-based scoring with a single SIMD-friendly implementation using slice-based zip iterators. Add software prefetching on x86_64 to hide L3 latency. Replace O(N log N) full sort with O(N) select_nth_unstable partial sort. Benchmark on c6a.4xlarge with 1.18B embeddings, top_k=4096: - Avg latency: 7.4s → 2.0s (3.7x faster) - QPS: 0.1 → 0.5 (5x improvement) - Memory unchanged (~9 GB), results mathematically identical.

raphaelsty · 2026-03-10T21:15:03Z

Hi @dilpreet92, this is a very cool MR, I'm benching it. Also it update some core code of the algorithm so I'll be careful with the review, I'm not a huge fan of the transposition, need to check about actual memory usage impact further, would welcome any chart or deeper analysis than the current mr :)

dilpreet92 · 2026-03-11T07:25:01Z

Thanks for benching it @raphaelsty

Here's the memory breakdown:
Per-query transient memory during approximate scoring:

The original code already allocates query_centroid_scores as a [num_tokens × num_centroids] Array2 (e.g. 128 × 65K × 4 = 32 MB). The transpose creates a second copy of the same size, so peak per-query memory goes from ~32 MB to ~64 MB temporarily during scoring. This
is freed after each query completes.

For context:

The extra 32-64 MB is transient (lives only during approximate scoring, ~100-200ms)
It's <1% of the 9 GB resident working set
Steady-state memory is unchanged, no new long-lived allocations

If the temporary doubling is a concern, I can refactor in-place transpose that consumes the original matrix, keeping peak memory identical to before. Let me know if you'd prefer that approach.

I am already running this on my production system :)

raphaelsty added the enhancement New feature or request label Mar 10, 2026

raphaelsty assigned dilpreet92 Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache-friendly approximate scoring via transposed centroid layout#26

perf: cache-friendly approximate scoring via transposed centroid layout#26
dilpreet92 wants to merge 1 commit intolightonai:mainfrom
dilpreet92:perf/transposed-centroid-scoring

dilpreet92 commented Mar 5, 2026

Uh oh!

raphaelsty commented Mar 10, 2026 •

edited

Loading

Uh oh!

dilpreet92 commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dilpreet92 commented Mar 5, 2026

Summary

Benchmark

Why this works

Test plan

Uh oh!

raphaelsty commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilpreet92 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raphaelsty commented Mar 10, 2026 •

edited

Loading

dilpreet92 commented Mar 11, 2026 •

edited

Loading