Skip to content

Conversation

@AntoinePrv
Copy link
Contributor

@AntoinePrv AntoinePrv commented Oct 21, 2025

Rationale for this change

  • Simplify the use of unpack
  • Reduce code spread for unpacking integers

What changes are included in this PR?

  • epilog: unpack extract exactly the required number of values -> change return type to void.
  • prolog: unpack can handled non aligned data -> include bit_offset in input parameters.
  • Include prolog/epilog cases in unpack tests.
  • Simplify a roundtrip test from packed -> unpacked -> packed to unpacked -> packed -> unpacked

Decoder benchmarks should remain the same (tested linux x86-64).
I have not benchmark the unpack functions themselves but I don't believe it's relevant since they now do more work.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions
Copy link

⚠️ GitHub issue #47895 has been automatically assigned in GitHub to PR creator.

@AntoinePrv AntoinePrv changed the title GH-47895: [C++][Parquet] Add prolog and eiplog in unpack GH-47895: [C++][Parquet] Add prolog and epilog in unpack Oct 21, 2025
@AntoinePrv
Copy link
Contributor Author

@pitrou this is ready for review (waiting for CI to finish here).

With this we could also investigate removing the BitReader from the BitPackedRunDecoder, reducing the general complexity seen by the compilers (number of member variables, pointers and offsets bookkeeping...).

@pitrou
Copy link
Member

pitrou commented Oct 21, 2025

There's a sanitizer failure that needs fixing here:
https://github.com/apache/arrow/actions/runs/18688262707/job/53286800716?pr=47896#step:7:8798

(I suppose it happens when length == 0...)

@pitrou
Copy link
Member

pitrou commented Oct 21, 2025

With this we could also investigate removing the BitReader from the BitPackedRunDecoder, reducing the general complexity seen by the compilers (number of member variables, pointers and offsets bookkeeping...).

Perhaps that can even be done in this PR? It doesn't sound very complicated...

@pitrou
Copy link
Member

pitrou commented Oct 21, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit 600696c. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit 600696c.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

@pitrou
Copy link
Member

pitrou commented Oct 22, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit 287f136. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 287f136.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the implementation. I haven't looked at the bpacking tests.

const int spread = byte_end - byte_start + 1;
max = spread > max ? spread : max;
start += width;
} while (start % 8 != bit_offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will be an infinite loop if bit_offset >= 8 (hence the DCHECK suggestion below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, though that function is never used at runtime, only compile time.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 22, 2025
AntoinePrv and others added 2 commits October 23, 2025 09:38
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@AntoinePrv
Copy link
Contributor Author

@pitrou removing the MaxSpread constexpr logic did not perform well. Up to -20% on decoding benchmarks.

For reference: that MaxSpread metric is central to the shuffle SIMD algorithm I'm working on:

  • If small: we can "pack" multiple values in the shuffle and reuse it with multiple rshifts
  • If very large: we have to do something radically different for packed values that spread over >8 bytes

@pitrou
Copy link
Member

pitrou commented Oct 23, 2025

Ah, sorry! Let's just restore it then :)

@AntoinePrv
Copy link
Contributor Author

I never pushed it, this was on a local benchmark.

// Easy case to handle, simply setting memory to zero.
return unpack_null(in, out, batch_size);
} else {
// In case of misalignment, we need to run the prolog until aligned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a TODO, if batch_size is large enough, we can perhaps rewind to the last byte-aligned packed and SIMD-unpack kValuesUnpacked into a local buffer, instead of going through unpack_exact.

(this seems lower-priority than SIMD shuffling, though)

ARROW_DCHECK_GE(batch_size, 0);
ARROW_COMPILER_ASSUME(batch_size < kValuesUnpacked);
ARROW_COMPILER_ASSUME(batch_size >= 0);
unpack_exact<kPackedBitWidth, false>(in, out, batch_size, /* bit_offset= */ 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, if there's enough padding at the end of the input, we could SIMD-unpack a full kValuesUnpacked into a local buffer.

switch (num_bits) {
case 0:
return unpack_null(in, out, batch_size);
return unpack_width<0, Unpacker>(in, out, batch_size, bit_offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, macros are not pretty, but we could have a macro here to minimize diffs when changing these function signatures :-)

Such as:

#define CASE_UNPACK_WIDTH(_width) \
        return unpack_width<_width, Unpacker>(in, out, batch_size, bit_offset)

  if constexpr (std::is_same_v<UnpackedUint, bool>) {
    switch (num_bits) {
      case 0:
        CASE_UNPACK_WIDTH(0);
  // etc.

#undef CASE_UNPACK_WIDTH

As you prefer, though.

if constexpr (std::is_same_v<Uint, bool>) {
random_is_valid(num_values, 0.5, &out, kSeed);
} else {
const uint64_t max = (uint64_t{1} << (static_cast<uint64_t>(bit_width) - 1)) - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be

Suggested change
const uint64_t max = (uint64_t{1} << (static_cast<uint64_t>(bit_width) - 1)) - 1;
const uint64_t max = (uint64_t{1} << static_cast<uint64_t>(bit_width)) - 1;

(e.g. you want 2**10 - 1 for 10-bit packing, not 2**9 - 1)

Comment on lines +101 to 103
if (!written) {
throw std::runtime_error("Cannot write move values");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid exceptions, you could for example have this function return a Result<std::vector<uint8_t>>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants