Skip to content

perf: cache-friendly approximate scoring via transposed centroid layout#26

Open
dilpreet92 wants to merge 1 commit intolightonai:mainfrom
dilpreet92:perf/transposed-centroid-scoring
Open

perf: cache-friendly approximate scoring via transposed centroid layout#26
dilpreet92 wants to merge 1 commit intolightonai:mainfrom
dilpreet92:perf/transposed-centroid-scoring

Conversation

@dilpreet92
Copy link
Copy Markdown

Summary

  • Transpose the query-centroid score matrix from [num_tokens × num_centroids] to [num_centroids × num_tokens] before approximate scoring, making each centroid's scores contiguous in memory
  • Replace row-major random-access scoring (approximate_score_mmap) and sparse HashMap-based scoring (approximate_score_sparse) with a single SIMD-friendly implementation using slice-based zip iterators
  • Add software prefetching on x86_64 to hide L3 latency during scoring
  • Replace O(N log N) full sort of approximate scores with O(N) select_nth_unstable partial sort, then sort only the small top-k set

Benchmark

Setup: c6a.4xlarge (AMD EPYC, 16 vCPU, 32 GB RAM), 1.18B embeddings, 94 GB index, top_k=4096, 20 sequential queries

Metric Before After Improvement
Wall time 148.30s 40.40s 3.7x
QPS 0.1 0.5 5x
Avg latency 7.414s 2.019s 3.7x
p50 latency 7.111s 2.325s 3.1x

Memory usage unchanged (~9 GB). Results are mathematically identical — the transposed layout computes the same MaxSim scores with better cache locality.

Why this works

The original scoring reads num_tokens (typically 128) scattered values per centroid code from a large row-major matrix. With ~65K centroids × 128 tokens × 4 bytes = 32 MB, this thrashes L2 cache on every code lookup.

The transposed layout makes each centroid's 128 scores contiguous (512 bytes = 8 cache lines), enabling:

  1. Sequential reads instead of strided random access
  2. LLVM auto-vectorization of the inner loop (vmaxps on AVX2/AVX-512)
  3. Software prefetching to pipeline L3 loads

Test plan

  • Existing unit tests pass (cargo test)
  • Verified identical search results before and after on production index (1.18B embeddings)
  • Benchmarked on c6a.4xlarge

  Transpose the query-centroid score matrix from [num_tokens × num_centroids]
  to [num_centroids × num_tokens] before approximate scoring, making each
  centroid's scores contiguous in memory for sequential access.

  Replace row-major random-access scoring and sparse HashMap-based scoring
  with a single SIMD-friendly implementation using slice-based zip iterators.
  Add software prefetching on x86_64 to hide L3 latency. Replace O(N log N)
  full sort with O(N) select_nth_unstable partial sort.

  Benchmark on c6a.4xlarge with 1.18B embeddings, top_k=4096:
  - Avg latency: 7.4s → 2.0s (3.7x faster)
  - QPS: 0.1 → 0.5 (5x improvement)
  - Memory unchanged (~9 GB), results mathematically identical.
@raphaelsty
Copy link
Copy Markdown
Collaborator

raphaelsty commented Mar 10, 2026

Hi @dilpreet92, this is a very cool MR, I'm benching it. Also it update some core code of the algorithm so I'll be careful with the review, I'm not a huge fan of the transposition, need to check about actual memory usage impact further, would welcome any chart or deeper analysis than the current mr :)

@raphaelsty raphaelsty added the enhancement New feature or request label Mar 10, 2026
@dilpreet92
Copy link
Copy Markdown
Author

dilpreet92 commented Mar 11, 2026

Thanks for benching it @raphaelsty

Here's the memory breakdown:
Per-query transient memory during approximate scoring:

The original code already allocates query_centroid_scores as a [num_tokens × num_centroids] Array2 (e.g. 128 × 65K × 4 = 32 MB). The transpose creates a second copy of the same size, so peak per-query memory goes from ~32 MB to ~64 MB temporarily during scoring. This
is freed after each query completes.

For context:

  • The extra 32-64 MB is transient (lives only during approximate scoring, ~100-200ms)
  • It's <1% of the 9 GB resident working set
  • Steady-state memory is unchanged, no new long-lived allocations

If the temporary doubling is a concern, I can refactor in-place transpose that consumes the original matrix, keeping peak memory identical to before. Let me know if you'd prefer that approach.

I am already running this on my production system :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants