Conversation
Implement runtime detection of ARM SVE and SVE2 CPU capabilities, similar to the existing BMI2 runtime detection for x86-64. Changes: - Add ARM CPU feature detection in lib/common/cpu.h using platform-specific APIs (getauxval on Linux/Android, disabled on macOS/Windows) - Add DYNAMIC_SVE and DYNAMIC_SVE2 macros in portability_macros.h - Add SVE2_TARGET_ATTRIBUTE for selective function compilation - Add sve2 field to compression context (ZSTD_CCtx) - Update histogram functions to support dynamic SVE2 dispatch - Explicitly disable SVE/SVE2 on Apple platforms (not supported) Platform support: - Linux/Android aarch64: Full runtime detection via getauxval() - Apple platforms: Disabled (Apple Silicon doesn't support SVE/SVE2) - Windows on ARM: Placeholder (API not yet available) Benefits: - Enables SVE2 optimizations on capable hardware without requiring build-time flags - Zero overhead on non-SVE2 systems - Expected 2-3x speedup in histogram counting on SVE2-capable CPUs (AWS Graviton4, Ampere AmpereOne) Note: Currently only SVE2 optimizations exist. CPUs with SVE but not SVE2 (e.g., Fujitsu A64FX) could benefit from future SVE-only implementations.
lib/common/cpu.h
Outdated
|
|
||
| #elif defined(_WIN32) | ||
| /* Windows on ARM - use IsProcessorFeaturePresent() */ | ||
| /* Note: As of 2024, Windows on ARM doesn't expose SVE/SVE2 through this API */ |
There was a problem hiding this comment.
|
Thanks @Andarwinux for pointing out the mingw-w64 header! I've updated the code in commit 1bf764f to use IsProcessorFeaturePresent() with PF_ARM_SVE_INSTRUCTIONS_AVAILABLE (46) and PF_ARM_SVE2_INSTRUCTIONS_AVAILABLE (47) for proper Windows ARM SVE/SVE2 detection. |
|
@Cyan4973 you might be interested as you previously worked on related SVE2 PRs |
|
@iksaif Generally seems reasonable, and we'll consider it for the next release, thanks for the PR! Do you have benchmarks for the performance increase this brings on different ARM CPUs? We'll need to see improvements in benchmarks to merge this. |
I think https://github.com/facebook/zstd/pulls?q=is%3Apr+SVE2+is%3Aclosed already have the details, but would you like me to show more ? (which particular CPU ?) |
Motivation
Enable deployment of a single zstd binary across heterogeneous ARM fleets with varying CPU capabilities. This is particularly important for cloud deployments where applications run across multiple instance types:
Currently, to leverage SVE2 optimizations, you must compile with
-march=neoverse-v2or similar flags, which produces binaries that won't run on older processors. This forces users to either:This PR implements runtime CPU feature detection, similar to the existing BMI2 support on x86-64, allowing a single binary compiled for Neoverse N1 baseline (
-mcpu=neoverse-n1) to automatically use SVE2 optimizations when available.Changes
This PR adds runtime ARM SVE2 detection infrastructure:
Core Infrastructure
lib/common/cpu.h): Platform-specific detection viagetauxval()on Linux/Android,IsProcessorFeaturePresent()on Windowslib/common/portability_macros.h):DYNAMIC_SVE2macro to enable runtime dispatchlib/common/compiler.h):SVE2_TARGET_ATTRIBUTEfor selective function compilationlib/compress/zstd_compress.c): Detect SVE2 once per compression contextPlatform Support
getauxval()IsProcessorFeaturePresent()Recommended Flags
Benchmarking on Graviton 4:
Overhead
Zero overhead on non-SVE2 systems:
Related
This follows the same pattern as the existing x86-64 BMI2 runtime detection, extending it to ARM architectures.
See also:
#4440
#4429
#4418
#4414
#4413
#4411