Non temporal data transfers#1170
Conversation
|
Some generic thoughts:
On arm64, there's no support for non-temporal loads (https://developer.arm.com/documentation/100048/0100/level-1-memory-system/memory-prefetching/non-temporal-loads), the corresponding instruction do exist (LDNP/STNP) but I failed to find the related intrinsic. There seems to be something equivalent in riscv (see riscv-non-isa/riscv-c-api-doc#47) I couldn't find anything for webassembly nor Power. So that's quite a niche, but I'm fine with adding those though. |
Cheers, PS: sse2 adds APIS for non temporal stores of scalars of 32/64 bits. I am not sure the fit within xsimd though |
8162072 to
66fd323
Compare
2. Adding xsimd::fence as a wrapper around std atomic for cache coherence 3. Adding tests
66fd323 to
6d15ab0
Compare
|
I finally had time to go back to this. I also started implementing things using inline asm for missing intrinsics. What do you think? Cheers, PS: since I do not know arm well and used qemu I asked Claude for a review before pushing. |
6d15ab0 to
39c0790
Compare
Implement store_stream and load_stream for neon64 using inline asm with LDNP/STNP instructions, providing non-temporal cache hints on AArch64. Covers float, double, and integral types. Guarded behind __GNUC__ so MSVC ARM64 falls back to aligned load/store. Also remove xsimd::fence (std::atomic wrapper) and its test, which were unrelated additions from a prior commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
39c0790 to
47dee36
Compare
|
The problem on mingw seems unrelated (vtable size). This seems to fix it for me: diff --git a/test/test_utils.hpp b/test/test_utils.hpp
index 0db01e4..faa322a 100644
--- a/test/test_utils.hpp
+++ b/test/test_utils.hpp
@@ -386,28 +386,47 @@ namespace detail
b.store_unaligned(dst.data() + i);
}
+ // Non-template context scope to avoid per-instantiation vtable issues with MinGW GCC.
+ // INFO() creates a ContextScope<Lambda> with a unique vtable per template instantiation.
+ // This concrete class has a single vtable definition shared across all instantiations.
+ struct StringContextScope : doctest::detail::ContextScopeBase
+ {
+ std::string msg_;
+ explicit StringContextScope(std::string msg)
+ : msg_(std::move(msg))
+ {
+ }
+ void stringify(std::ostream* os) const override { *os << msg_; }
+ };
+
+ template <class T>
+ StringContextScope make_context_info(const char* name, const T& val)
+ {
+ return StringContextScope(std::string(name) + ":" + doctest::toString(val).c_str());
+ }
+
}
-#define CHECK_BATCH_EQ(b1, b2) \
- do \
- { \
- INFO(#b1 ":", b1); \
- INFO(#b2 ":", b2); \
- CHECK_UNARY(::detail::expect_batch_near(b1, b2)); \
+#define CHECK_BATCH_EQ(b1, b2) \
+ do \
+ { \
+ auto _ctx1 = ::detail::make_context_info(#b1, b1); \
+ auto _ctx2 = ::detail::make_context_info(#b2, b2); \
+ CHECK_UNARY(::detail::expect_batch_near(b1, b2)); \
} while (0)
-#define CHECK_SCALAR_EQ(s1, s2) \
- do \
- { \
- INFO(#s1 ":", s1); \
- INFO(#s2 ":", s2); \
- CHECK_UNARY(::detail::expect_scalar_near(s1, s2)); \
+#define CHECK_SCALAR_EQ(s1, s2) \
+ do \
+ { \
+ auto _ctx1 = ::detail::make_context_info(#s1, s1); \
+ auto _ctx2 = ::detail::make_context_info(#s2, s2); \
+ CHECK_UNARY(::detail::expect_scalar_near(s1, s2)); \
} while (0)
-#define CHECK_VECTOR_EQ(v1, v2) \
- do \
- { \
- INFO(#v1 ":", v1); \
- INFO(#v2 ":", v2); \
- CHECK_UNARY(::detail::expect_vector_near(v1, v2)); \
+#define CHECK_VECTOR_EQ(v1, v2) \
+ do \
+ { \
+ auto _ctx1 = ::detail::make_context_info(#v1, v1); \
+ auto _ctx2 = ::detail::make_context_info(#v2, v2); \
+ CHECK_UNARY(::detail::expect_vector_near(v1, v2)); \
} while (0)
/*********************** |
Draft because I need to double check the API levels ( i.e I am not using AVX2 functions in AVX and so on).I just wanted some feedback while I do the finishing touches.