Skip to content

Non temporal data transfers#1170

Open
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/stream-api
Open

Non temporal data transfers#1170
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/stream-api

Conversation

@DiamonDinoia
Copy link
Contributor

@DiamonDinoia DiamonDinoia commented Sep 24, 2025

  1. Adding stream API for non temporal data transfers
  2. Adding xsimd::fence as a wrapper around std atomic for cache coherence
  3. Adding tests

Draft because I need to double check the API levels ( i.e I am not using AVX2 functions in AVX and so on). I just wanted some feedback while I do the finishing touches.

@serge-sans-paille
Copy link
Contributor

Some generic thoughts:

  • I'm unsure the fence belongs to xsimd, but I like being proven wrong, maybe show us a code example that uses it?
  • load_stream or stream_load or streaming_load?

On arm64, there's no support for non-temporal loads (https://developer.arm.com/documentation/100048/0100/level-1-memory-system/memory-prefetching/non-temporal-loads), the corresponding instruction do exist (LDNP/STNP) but I failed to find the related intrinsic.

There seems to be something equivalent in riscv (see riscv-non-isa/riscv-c-api-doc#47)

I couldn't find anything for webassembly nor Power. So that's quite a niche, but I'm fine with adding those though.

@DiamonDinoia
Copy link
Contributor Author

DiamonDinoia commented Sep 24, 2025

  1. I went for load_stream and store_stream so that it is consistent with [load|store]_[un]aligned... (Also load_non_temporal was too long and load_nta is not clear).
  2. I added fence for convenience. I have no strong feelings on it. We can always think about adding it in the future. In the end on x86, I was recently made aware that it is not needed on a single core application. In parallel applications, atomic is likely to be imported anyway.
  3. About ARM and RISK-V what about making our own intrinsics by wrapping the inline assembly? I sadly do not know about ARM all that much to be able to promise I will help

Cheers,
Marco

PS: sse2 adds APIS for non temporal stores of scalars of 32/64 bits. I am not sure the fit within xsimd though

@DiamonDinoia DiamonDinoia marked this pull request as ready for review September 24, 2025 20:59
2. Adding xsimd::fence as a wrapper around std atomic for cache coherence
3. Adding tests
@DiamonDinoia
Copy link
Contributor Author

DiamonDinoia commented Mar 3, 2026

Hi @serge-sans-paille,

I finally had time to go back to this. I also started implementing things using inline asm for missing intrinsics. What do you think?

Cheers,
Marco

PS: since I do not know arm well and used qemu I asked Claude for a review before pushing.

Implement store_stream and load_stream for neon64 using inline asm
with LDNP/STNP instructions, providing non-temporal cache hints on
AArch64. Covers float, double, and integral types. Guarded behind
__GNUC__ so MSVC ARM64 falls back to aligned load/store.

Also remove xsimd::fence (std::atomic wrapper) and its test, which
were unrelated additions from a prior commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@DiamonDinoia
Copy link
Contributor Author

The problem on mingw seems unrelated (vtable size). This seems to fix it for me:

diff --git a/test/test_utils.hpp b/test/test_utils.hpp
index 0db01e4..faa322a 100644
--- a/test/test_utils.hpp
+++ b/test/test_utils.hpp
@@ -386,28 +386,47 @@ namespace detail
         b.store_unaligned(dst.data() + i);
     }
 
+    // Non-template context scope to avoid per-instantiation vtable issues with MinGW GCC.
+    // INFO() creates a ContextScope<Lambda> with a unique vtable per template instantiation.
+    // This concrete class has a single vtable definition shared across all instantiations.
+    struct StringContextScope : doctest::detail::ContextScopeBase
+    {
+        std::string msg_;
+        explicit StringContextScope(std::string msg)
+            : msg_(std::move(msg))
+        {
+        }
+        void stringify(std::ostream* os) const override { *os << msg_; }
+    };
+
+    template <class T>
+    StringContextScope make_context_info(const char* name, const T& val)
+    {
+        return StringContextScope(std::string(name) + ":" + doctest::toString(val).c_str());
+    }
+
 }
 
-#define CHECK_BATCH_EQ(b1, b2)                            \
-    do                                                    \
-    {                                                     \
-        INFO(#b1 ":", b1);                                \
-        INFO(#b2 ":", b2);                                \
-        CHECK_UNARY(::detail::expect_batch_near(b1, b2)); \
+#define CHECK_BATCH_EQ(b1, b2)                                 \
+    do                                                         \
+    {                                                          \
+        auto _ctx1 = ::detail::make_context_info(#b1, b1);    \
+        auto _ctx2 = ::detail::make_context_info(#b2, b2);    \
+        CHECK_UNARY(::detail::expect_batch_near(b1, b2));      \
     } while (0)
-#define CHECK_SCALAR_EQ(s1, s2)                            \
-    do                                                     \
-    {                                                      \
-        INFO(#s1 ":", s1);                                 \
-        INFO(#s2 ":", s2);                                 \
-        CHECK_UNARY(::detail::expect_scalar_near(s1, s2)); \
+#define CHECK_SCALAR_EQ(s1, s2)                                \
+    do                                                         \
+    {                                                          \
+        auto _ctx1 = ::detail::make_context_info(#s1, s1);    \
+        auto _ctx2 = ::detail::make_context_info(#s2, s2);    \
+        CHECK_UNARY(::detail::expect_scalar_near(s1, s2));     \
     } while (0)
-#define CHECK_VECTOR_EQ(v1, v2)                            \
-    do                                                     \
-    {                                                      \
-        INFO(#v1 ":", v1);                                 \
-        INFO(#v2 ":", v2);                                 \
-        CHECK_UNARY(::detail::expect_vector_near(v1, v2)); \
+#define CHECK_VECTOR_EQ(v1, v2)                                \
+    do                                                         \
+    {                                                          \
+        auto _ctx1 = ::detail::make_context_info(#v1, v1);    \
+        auto _ctx2 = ::detail::make_context_info(#v2, v2);    \
+        CHECK_UNARY(::detail::expect_vector_near(v1, v2));     \
     } while (0)
 
 /***********************

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants