Skip to content

enable MEM_FORCE_MEMORY_ACCESS=2 for RISC-V targets with zicclsm#4596

Open
Polaris-911 wants to merge 1 commit intofacebook:devfrom
Polaris-911:new-feature
Open

enable MEM_FORCE_MEMORY_ACCESS=2 for RISC-V targets with zicclsm#4596
Polaris-911 wants to merge 1 commit intofacebook:devfrom
Polaris-911:new-feature

Conversation

@Polaris-911
Copy link
Contributor

Summary

This PR updates lib/common/mem.h to select MEM_FORCE_MEMORY_ACCESS=2 for RISC-V builds when both __riscv and __riscv_zicclsm are defined.

Motivation

Zicclsm indicates support for misaligned loads/stores in main memory.
For these targets, using method 2 enables direct unaligned memory access in mem.h.
References
GCC Zicclsm
RVA20U64 specification

Changes

  • Updated the default MEM_FORCE_MEMORY_ACCESS selection logic:
    • MEM_FORCE_MEMORY_ACCESS=2 when defined(__riscv) && defined(__riscv_zicclsm)
    • Keep existing GCC fallback: MEM_FORCE_MEMORY_ACCESS=1 for other GCC targets
  • No changes to behavior when MEM_FORCE_MEMORY_ACCESS is explicitly set by the build system.

Validation

  • Verified the updated preprocessor branch compiles cleanly in mem.h.
  • No linter issues reported for the modified file.

Benchmark Data (Compression Speed)

Method zstd -1 Speed vs Method 0
Method 2 74.2 MB/s +74%
Method 1 59.2 MB/s +39%
Method 0 42.5 MB/s -

Data screenshot

image

@terrelln
Copy link
Contributor

terrelln commented Mar 2, 2026

This looks good to me. Ultimately, it would be better if the compiler was able to handle optimizing memcpy() to use unaligned loads & stores on RISC-V with zicclsm. Once they can do that we can turn this off.

CC @Cyan4973 in case you have an opinion, before I merge

@terrelln
Copy link
Contributor

terrelln commented Mar 2, 2026

@Polaris-911 I don't see the same improvement to decompression speed as we see in PR #4584.

Do you know what function the difference in decompression speed is coming from between these two PRs? Other than these functions, the decompressor really shouldn't be doing any unaligned accesses, so I don't know where the difference would be coming from.

@Polaris-911
Copy link
Contributor Author

@terrelln Thanks for pointing that out! I suspect the difference in decompression speed might be related to ZSTD_wildcopy rather than the MEM_read/MEM_write macros.

While the decompressor doesn't directly use MEM_read* for data manipulation, ZSTD_execSequence relies heavily on ZSTD_wildcopy (which eventually calls __builtin_memcpy via ZSTD_copy8/ZSTD_copy16) to copy both literals and matches.

@Polaris-911
Copy link
Contributor Author

@terrelln @Cyan4973 Hi , I submitted a similar PR( #4524) before, and at that time @Cyan4973 was concerned about undefined behavior (UB).

I completely understand @Cyan4973's concerns about Undefined Behavior and that the compiler should handle memcpy optimizations natively (Method 0). However, given the current state of RISC-V compilers, they still fall short in optimizing memcpy for unaligned accesses, leaving a significant performance gap on hardware that supports it (like with zicclsm).

Would you be open to accepting either this MEM_FORCE_MEMORY_ACCESS=2 approach or explicitly adding -mno-strict-align to the RISC-V build flags(#4584) as a temporary workaround? We can absolutely revert this once the upstream compilers improve their memcpy optimizations for RISC-V. This would allow users to benefit from the immediate performance gains in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants