Add shift by multiple constant by AntoinePrv · Pull Request #1220 · xtensor-stack/xsimd

AntoinePrv · 2025-11-21T16:38:16Z

@serge-sans-paille @JohanMabille this ideas works, but I cannot figure out how to refactor bitwise_lshift_as_twice_larger into a separate header.

The issue:

xsimd_sse2.hpp needs utils/shits.hpp for bitwise_lshift_as_twice_larger
but bitwise_lshift_as_twice_larger needs definitions from xsimd_sse2.hpp.
and likewise for avx2

Forward declaring all the needed functions overloads from sse2 and avx2 and the arch types in utils/shits.hpp.
Perhaps it should use the functions from xsimd_api instead? (which would also be better if there is a better implementation further down in the inheritance tree).

Close #1218

AntoinePrv

Also we need to add all the proper forward to dynamic for all other architectures to avoid ending up with a common implementation (unless that's all there is).

AntoinePrv · 2025-11-21T16:47:27Z

include/xsimd/arch/utils/shifts.hpp

I named this util as common seems to be more used for implementing the common architecture.

AntoinePrv · 2025-11-21T16:48:11Z

include/xsimd/arch/utils/shifts.hpp

+
+            template <class T, class T2, class A, class R, T... Vs>
+            XSIMD_INLINE batch<T, A> bitwise_lshift_as_twice_larger(
+                batch<T, A> const& self, batch_constant<T, A, Vs...>, R req) noexcept


Probably should remove R req here

serge-sans-paille · 2025-11-21T20:15:20Z

@AntoinePrv unless there's is a huge need on your side, I won't schedule that one for the release. Let's just add masked load / store support for other architecture so that we can release, then begin a new commit wave :-)

AntoinePrv · 2025-11-24T08:43:45Z

@serge-sans-paille I had hoped I could do it quickly before the weekend but it seems it will take longer.
Let's release indeed, I'll add the fix in Arrow with a TODO for cleaning it up in a future release.

AntoinePrv · 2026-02-16T17:58:34Z

include/xsimd/types/xsimd_api.hpp

+    namespace detail
+    {
+        // Detection for kernel overloads accepting ``batch_constant`` in ``bitwise_lshift``
+        // directly (or in a parent register function).
+        // The ``batch_constant`` overload is a rare but useful optimization.
+        // Running the detection here is less error prone than to add a fallback to all
+        // architectures.
+
+        template <class A, class B, class C, class = void>
+        struct has_bitwise_lshift_batch_const : std::false_type
+        {
+        };
+
+        template <class A, class B, class C>
+        struct has_bitwise_lshift_batch_const<A, B, C,
+                                              void_t<decltype(kernel::bitwise_lshift<A>(std::declval<B>(), std::declval<C>(), A {}))>>
+            : std::true_type
+        {
+        };


We run the detection of an optimization in the xsimd_api to avoid adding many error-prone overloads.

AntoinePrv · 2026-02-16T18:03:46Z

I've made a fresh take, and added the fallback to runtime API in xsimd_api.hpp since this is an edge-case API.

This seems much improved over the runtime version. E.g. with GCC 15, SSE2, and uin8_t

Compile time (new)

        .cfi_startproc
        movdqa  (%rdi), %xmm1
        movdqa  .LC0(%rip), %xmm0
        pcmpeqd %xmm2, %xmm2
        movdqa  %xmm2, %xmm3
        psllw   $8, %xmm2
        pmullw  %xmm1, %xmm0
        psrlw   $8, %xmm3
        pand    %xmm2, %xmm1
        pmullw  .LC3(%rip), %xmm1
        pand    %xmm3, %xmm0
        por     %xmm1, %xmm0
        ret
        .cfi_endproc
.LFE6862:
        .size   _Z1fRKN5xsimd5batchIhNS_4sse2EEE, .-_Z1fRKN5xsimd5batchIhNS_4sse2EEE
        .section        .rodata.cst16,"aM",@progbits,16
        .align 16
.LC0:
        .value  2
        .value  8
        .value  16
        .value  4
        .value  2
        .value  2
        .value  2
        .value  2
        .align 16
.LC3:
        .value  4
        .value  16
        .value  8
        .value  2
        .value  2
        .value  2
        .value  2
        .value  2

Run time (before)

        .cfi_startproc
        pushq   %r15
        .cfi_def_cfa_offset 16
        .cfi_offset 15, -16
        pushq   %r14
        .cfi_def_cfa_offset 24
        .cfi_offset 14, -24
        pushq   %r13
        .cfi_def_cfa_offset 32
        .cfi_offset 13, -32
        pushq   %r12
        .cfi_def_cfa_offset 40
        .cfi_offset 12, -40
        pushq   %rbp
        .cfi_def_cfa_offset 48
        .cfi_offset 6, -48
        pushq   %rbx
        .cfi_def_cfa_offset 56
        .cfi_offset 3, -56
        movzbl  15(%rdi), %eax
        movzbl  13(%rdi), %edx
        movzbl  12(%rdi), %ecx
        movzbl  8(%rdi), %r15d
        movzbl  7(%rdi), %r12d
        movzbl  11(%rdi), %esi
        movzbl  10(%rdi), %r8d
        movb    %al, -12(%rsp)
        movzbl  9(%rdi), %r13d
        movzbl  14(%rdi), %eax
        movb    %dl, -10(%rsp)
        xorl    %edx, %edx
        movzbl  6(%rdi), %r9d
        movzbl  3(%rdi), %ebx
        movb    %cl, -9(%rsp)
        xorl    %ecx, %ecx
        movzbl  5(%rdi), %r10d
        movzbl  4(%rdi), %r11d
        addl    %r13d, %r13d
        addl    %r8d, %r8d
        movzbl  2(%rdi), %ebp
        movzbl  1(%rdi), %r14d
        movb    %al, -11(%rsp)
        sall    $4, %ebx
        movzbl  (%rdi), %edi
        movzbl  %bl, %ebx
        sall    $4, %r11d
        movzbl  %r8b, %r8d
        sall    $2, %r14d
        sall    $3, %ebp
        movzbl  %r11b, %r11d
        salq    $24, %rbx
        movb    %dil, %dl
        movzbl  %bpl, %ebp
        salq    $32, %r11
        salq    $16, %rbp
        addb    %dl, %dl
        sall    $3, %r10d
        movq    %rdx, -40(%rsp)
        movq    %r14, %rdx
        movzbl  %r10b, %r10d
        sall    $2, %r9d
        movq    %rcx, -32(%rsp)
        salq    $40, %r10
        movzbl  %r9b, %r9d
        salq    $16, %r8
        movq    -40(%rsp), %rcx
        salq    $48, %r9
        movb    %dl, %ch
        movq    -32(%rsp), %rdx
        movq    %rcx, -40(%rsp)
        movq    -40(%rsp), %rdi
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rdx
        andq    $-16711681, %rdi
        orq     %rbp, %rdi
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rdx
        movabsq $-4278190081, %rbp
        movq    %rdi, -40(%rsp)
        movq    -40(%rsp), %rdi
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rdx
        andq    %rbp, %rdi
        orq     %rbx, %rdi
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rdx
        movabsq $-1095216660481, %rbx
        movq    %rdi, -40(%rsp)
        movq    -40(%rsp), %rdi
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rdx
        andq    %rbx, %rdi
        orq     %r11, %rdi
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rdx
        movabsq $-280375465082881, %r11
        movq    %rdi, -40(%rsp)
        movq    -40(%rsp), %rdi
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rcx
        andq    %r11, %rdi
        orq     %r10, %rdi
        movb    %r15b, %cl
        movabsq $-71776119061217281, %r10
        movq    %rdi, -40(%rsp)
        movq    -40(%rsp), %rdi
        addb    %cl, %cl
        andq    %r10, %rdi
        orq     %r9, %rdi
        leal    (%r12,%r12), %r9d
        movq    %rdi, -40(%rsp)
        movq    -40(%rsp), %r12
        salq    $56, %r9
        movabsq $72057594037927935, %rdi
        andq    %rdi, %r12
        orq     %r9, %r12
        movq    %r12, -40(%rsp)
        movq    -40(%rsp), %rdx
        movq    %rdx, -40(%rsp)
        movq    %r13, %rdx
        movq    %rcx, -32(%rsp)
        movq    -32(%rsp), %rcx
        movzbl  -11(%rsp), %eax
        movb    %dl, %ch
        movzbl  -10(%rsp), %edx
        movq    %rcx, -32(%rsp)
        movq    -32(%rsp), %r9
        movzbl  -9(%rsp), %ecx
        andq    $-16711681, %r9
        orq     %r9, %r8
        addl    %esi, %esi
        addl    %ecx, %ecx
        addl    %edx, %edx
        movq    %r8, -32(%rsp)
        movq    -32(%rsp), %r8
        movzbl  %sil, %esi
        addl    %eax, %eax
        salq    $24, %rsi
        movzbl  %cl, %ecx
        movzbl  %dl, %edx
        movzbl  %al, %eax
        salq    $32, %rcx
        andq    %rbp, %r8
        salq    $40, %rdx
        orq     %r8, %rsi
        salq    $48, %rax
        movq    %rsi, -32(%rsp)
        movq    -32(%rsp), %rsi
        andq    %rbx, %rsi
        orq     %rsi, %rcx
        movq    %rcx, -32(%rsp)
        movq    -32(%rsp), %rcx
        andq    %r11, %rcx
        orq     %rcx, %rdx
        movq    %rdx, -32(%rsp)
        movq    -32(%rsp), %rdx
        andq    %r10, %rdx
        orq     %rdx, %rax
        movq    %rax, -32(%rsp)
        movzbl  -12(%rsp), %eax
        leal    (%rax,%rax), %edx
        movq    -32(%rsp), %rax
        salq    $56, %rdx
        andq    %rdi, %rax
        orq     %rdx, %rax
        movq    %rax, -32(%rsp)
        movdqa  -40(%rsp), %xmm0
        popq    %rbx
        .cfi_def_cfa_offset 48
        popq    %rbp
        .cfi_def_cfa_offset 40
        popq    %r12
        .cfi_def_cfa_offset 32
        popq    %r13
        .cfi_def_cfa_offset 24
        popq    %r14
        .cfi_def_cfa_offset 16
        popq    %r15
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc

AntoinePrv · 2026-02-24T09:48:59Z

@serge-sans-paille this one is ready too

serge-sans-paille · 2026-02-27T16:10:41Z

I'm a bit puzzled by the gain here: you're actually not adding anything except the possibility to pass a batch of constant, which is then dispatched to either the original code that takes a constant, or the original code that takes a batch, right?

AntoinePrv · 2026-02-28T09:58:39Z

@serge-sans-paille There are three things in this PR:

A new API for lshift taking a batch_constant as input
Two cases (SSE2 with 2 bytes data and SSE4.1 with 4 byte data) with a new implementation that uses multiply as a fallback for missing left shit
A utility function that can implement left shift using left shift on data twice as large.

They all play out nicely together, for instance with SSE2 and uint8_t data in #1220 (comment)

Call bitwise_lshift<uint8_t>
- Use bitwise_lshift_as_twice_larger<uint16_t>
  - There, use the batch_constant to make some computation at compile time
    - Call into to bitwise_lshift<uint16_t>
      - There, use the batch_constant to generate the "mulitply" quantities at compile time
    - Call into to bitwise_lshift<uint16_t> (same)
      - ...

Result after the compiler does it jobs: with 2 multiply, 2 AND, 1 OR (and a few other things) you get a lshift for uint8_t.

bitwise_lshift_as_twice_larger actually does not care what the underlying shift implementation is.
In the he AVX2 case of this PR, there is _mm256_sllv_epi32 available for 4 byte data (bitwise_lshift<uint32_t>). We use it as a target to generate

bitwise_lshift<uint16_t>, using 2 _mm256_sllv_epi32, 2 AND, and 1 OR
bitwise_lshift<uint8_t>, recursively using 4 _mm256_sllv_epi32, 6 AND, and 3 OR

From the code perspective of xsimd, these looks like very niche case. But in practice, this unlocks bitwise_lshift for many sizes in the SSE family. A common operation on a common architecture!

Another nice cheat happening here is that to generate the "multiply" quantities as a fallback for lshift, we use... lshift, but at compile time 😅

serge-sans-paille · 2026-03-03T16:52:25Z

I'm all in for the second and third effect, but still puzzled by the first case :-)

serge-sans-paille · 2026-03-03T17:00:49Z

include/xsimd/types/xsimd_api.hpp

    {
        detail::static_check_supported_config<T, A>();
-        return kernel::bitwise_lshift<shift, A>(x, A {});
+        using has_batch_const_impl = detail::has_bitwise_lshift_batch_const<A, decltype(x), decltype(shift)>;


it seems to me that the dispatch here should be 'if all Values... are the same, dispatch to the overload that takes a single parameter, otherwise dispatch to the generic overload. Wouldn't that simplify the whole implementation?

AntoinePrv commented Nov 21, 2025

View reviewed changes

AntoinePrv force-pushed the shift-var branch 2 times, most recently from de97676 to 1cd12f9 Compare November 26, 2025 16:06

This comment was marked as resolved.

Sign in to view

AntoinePrv force-pushed the shift-var branch from c85add7 to fb76fd7 Compare November 28, 2025 16:05

AntoinePrv force-pushed the shift-var branch from caaab2a to adead38 Compare February 16, 2026 17:32

AntoinePrv commented Feb 16, 2026

View reviewed changes

AntoinePrv marked this pull request as ready for review February 16, 2026 17:59

AntoinePrv force-pushed the shift-var branch 6 times, most recently from bdfd07f to ceb6683 Compare February 17, 2026 15:57

AntoinePrv added 6 commits February 18, 2026 17:05

Add bitwise-shift batch constant api

85fcf7b

Add x86 optimizations

991b087

Fix merge

116829c

Add single shift optimization

923059e

Strenghen tests

8cc985d

Enable SSE2 fallback for signed integers

cde846f

AntoinePrv force-pushed the shift-var branch from 2dcfeee to cde846f Compare February 18, 2026 16:05

serge-sans-paille reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shift by multiple constant#1220

Add shift by multiple constant#1220
AntoinePrv wants to merge 6 commits intoxtensor-stack:masterfrom
AntoinePrv:shift-var

AntoinePrv commented Nov 21, 2025 •

edited

Loading

Uh oh!

AntoinePrv left a comment

Uh oh!

AntoinePrv Nov 21, 2025

Uh oh!

AntoinePrv Nov 21, 2025

Uh oh!

serge-sans-paille commented Nov 21, 2025

Uh oh!

AntoinePrv commented Nov 24, 2025

Uh oh!

This comment was marked as resolved.

AntoinePrv Feb 16, 2026

Uh oh!

AntoinePrv commented Feb 16, 2026 •

edited

Loading

Uh oh!

AntoinePrv commented Feb 24, 2026

Uh oh!

serge-sans-paille commented Feb 27, 2026

Uh oh!

AntoinePrv commented Feb 28, 2026

Uh oh!

serge-sans-paille commented Mar 3, 2026

Uh oh!

serge-sans-paille Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AntoinePrv commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoinePrv left a comment

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille commented Nov 21, 2025

Uh oh!

AntoinePrv commented Nov 24, 2025

Uh oh!

This comment was marked as resolved.

AntoinePrv Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

AntoinePrv commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoinePrv commented Feb 24, 2026

Uh oh!

serge-sans-paille commented Feb 27, 2026

Uh oh!

AntoinePrv commented Feb 28, 2026

Uh oh!

serge-sans-paille commented Mar 3, 2026

Uh oh!

serge-sans-paille Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntoinePrv commented Nov 21, 2025 •

edited

Loading

AntoinePrv commented Feb 16, 2026 •

edited

Loading