Skip to content

Conversation

@marma01
Copy link

@marma01 marma01 commented Dec 15, 2025

This patch introduces an AArch64 vectorized optimization for the 4‑bit dequantization path in EmbeddingSpMDMAutovec.cc. The previous implementation processes one byte at a time (each byte contains two nibbles). The new path uses AArch64 vector intrinsics to process 16 bytes per iteration, yielding 32 FP32 outputs.

EmbeddingSpMDMNBitBenchmark(4-bit cases) shows up to 47% improvement on AArch64

Test environment: AWS r8g.4xlarge (Graviton4) instance; compiler: gcc 14.3.0

Example benchmark results:
Before:

bit_rate 4 batch size 10 num rows 4000000 emb dim 256 avg length 100
32 bit indices lengths_sum 898
out type fp32 SLS cache not flushed prefetch off b/w 2.15088 GB/s effective b/w 3.12855 GB/s time 5.51E-05
out type fp32 SLS cache flushed prefetch off b/w 1.97542 GB/s effective b/w 2.87335 GB/s time 6.00E-05
out type fp32 SLW(WEIGHTED) cache not flushed prefetch off b/w 2.14381 GB/s effective b/w 3.11827 GB/s time 5.53E-05
out type fp32 SLW(WEIGHTED) cache flushed prefetch off b/w 1.9467 GB/s effective b/w 2.83157 GB/s time 6.09E-05

After:

bit_rate 4 batch size 10 num rows 4000000 emb dim 256 avg length 100
32 bit indices lengths_sum 898
out type fp32 SLS cache not flushed prefetch off b/w 3.18089 GB/s effective b/w 4.62675 GB/s time 3.73E-05
out type fp32 SLS cache flushed prefetch off b/w 2.75099 GB/s effective b/w 4.00144 GB/s time 4.31E-05
out type fp32 SLW(WEIGHTED) cache not flushed prefetch off b/w 3.15288 GB/s effective b/w 4.58601 GB/s time 3.76E-05
out type fp32 SLW(WEIGHTED) cache flushed prefetch off b/w 2.67149 GB/s effective b/w 3.8858 GB/s time 4.44E-05

@meta-cla meta-cla bot added the cla signed label Dec 15, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Dec 23, 2025

@q10 has imported this pull request. If you are a Meta employee, you can view this in D89737989.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant