Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics in EmbeddingSpMDMAutovec #5224
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch introduces an AArch64 vectorized optimization for the 4‑bit dequantization path in
EmbeddingSpMDMAutovec.cc. The previous implementation processes one byte at a time (each byte contains two nibbles). The new path uses AArch64 vector intrinsics to process 16 bytes per iteration, yielding 32 FP32 outputs.EmbeddingSpMDMNBitBenchmark(4-bit cases) shows up to 47% improvement on AArch64Test environment: AWS r8g.4xlarge (Graviton4) instance; compiler: gcc 14.3.0
Example benchmark results:
Before:
bit_rate 4 batch size 10 num rows 4000000 emb dim 256 avg length 100
32 bit indices lengths_sum 898
out type fp32 SLS cache not flushed prefetch off b/w 2.15088 GB/s effective b/w 3.12855 GB/s time 5.51E-05
out type fp32 SLS cache flushed prefetch off b/w 1.97542 GB/s effective b/w 2.87335 GB/s time 6.00E-05
out type fp32 SLW(WEIGHTED) cache not flushed prefetch off b/w 2.14381 GB/s effective b/w 3.11827 GB/s time 5.53E-05
out type fp32 SLW(WEIGHTED) cache flushed prefetch off b/w 1.9467 GB/s effective b/w 2.83157 GB/s time 6.09E-05
After:
bit_rate 4 batch size 10 num rows 4000000 emb dim 256 avg length 100
32 bit indices lengths_sum 898
out type fp32 SLS cache not flushed prefetch off b/w 3.18089 GB/s effective b/w 4.62675 GB/s time 3.73E-05
out type fp32 SLS cache flushed prefetch off b/w 2.75099 GB/s effective b/w 4.00144 GB/s time 4.31E-05
out type fp32 SLW(WEIGHTED) cache not flushed prefetch off b/w 3.15288 GB/s effective b/w 4.58601 GB/s time 3.76E-05
out type fp32 SLW(WEIGHTED) cache flushed prefetch off b/w 2.67149 GB/s effective b/w 3.8858 GB/s time 4.44E-05