Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics in EmbeddingSpMDMAutovec #5224

marma01 · 2025-12-15T02:07:27Z

This patch introduces an AArch64 vectorized optimization for the 4‑bit dequantization path in EmbeddingSpMDMAutovec.cc. The previous implementation processes one byte at a time (each byte contains two nibbles). The new path uses AArch64 vector intrinsics to process 16 bytes per iteration, yielding 32 FP32 outputs.

EmbeddingSpMDMNBitBenchmark(4-bit cases) shows up to 47% improvement on AArch64

Test environment: AWS r8g.4xlarge (Graviton4) instance; compiler: gcc 14.3.0

Example benchmark results:
Before:

bit_rate 4 batch size 10 num rows 4000000 emb dim 256 avg length 100
32 bit indices lengths_sum 898
out type fp32 SLS cache not flushed prefetch off b/w 2.15088 GB/s effective b/w 3.12855 GB/s time 5.51E-05
out type fp32 SLS cache flushed prefetch off b/w 1.97542 GB/s effective b/w 2.87335 GB/s time 6.00E-05
out type fp32 SLW(WEIGHTED) cache not flushed prefetch off b/w 2.14381 GB/s effective b/w 3.11827 GB/s time 5.53E-05
out type fp32 SLW(WEIGHTED) cache flushed prefetch off b/w 1.9467 GB/s effective b/w 2.83157 GB/s time 6.09E-05

After:

bit_rate 4 batch size 10 num rows 4000000 emb dim 256 avg length 100
32 bit indices lengths_sum 898
out type fp32 SLS cache not flushed prefetch off b/w 3.18089 GB/s effective b/w 4.62675 GB/s time 3.73E-05
out type fp32 SLS cache flushed prefetch off b/w 2.75099 GB/s effective b/w 4.00144 GB/s time 4.31E-05
out type fp32 SLW(WEIGHTED) cache not flushed prefetch off b/w 3.15288 GB/s effective b/w 4.58601 GB/s time 3.76E-05
out type fp32 SLW(WEIGHTED) cache flushed prefetch off b/w 2.67149 GB/s effective b/w 3.8858 GB/s time 4.44E-05

meta-codesync · 2025-12-23T19:59:45Z

@q10 has imported this pull request. If you are a Meta employee, you can view this in D89737989.

Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics

0f83009

meta-cla bot added the cla signed label Dec 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics in EmbeddingSpMDMAutovec #5224

Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics in EmbeddingSpMDMAutovec #5224

Uh oh!

marma01 commented Dec 15, 2025

Uh oh!

meta-codesync bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics in EmbeddingSpMDMAutovec #5224

Are you sure you want to change the base?

Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics in EmbeddingSpMDMAutovec #5224

Uh oh!

Conversation

marma01 commented Dec 15, 2025

Uh oh!

meta-codesync bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant