[SPARK-55802][SQL] Fix integer overflow when computing Arrow batch bytes by viirya · Pull Request #54584 · apache/spark

viirya · 2026-03-03T00:28:52Z

What changes were proposed in this pull request?

ArrowWriter.sizeInBytes() and SliceBytesArrowOutputProcessorImpl .getBatchBytes() both accumulated per-column buffer sizes (each an Int) into an Int accumulator. When the total exceeds 2 GB the sum silently wraps negative, causing the byte-limit checks controlled by spark.sql.execution.arrow.maxBytesPerBatch and
spark.sql.execution.arrow.maxBytesPerOutputBatch to behave incorrectly and potentially allow oversized batches through.

Fix by changing both accumulators and return types to Long.

Why are the changes needed?

Fix possible overflow when calculating Arrow batch bytes.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6 noreply@anthropic.com

viirya · 2026-03-03T00:35:08Z

This issue is reported by @sunchao.

sunchao · 2026-03-03T00:41:56Z

Thanks @viirya !

dongjoon-hyun

+1, LGTM.

`ArrowWriter.sizeInBytes()` and `SliceBytesArrowOutputProcessorImpl .getBatchBytes()` both accumulated per-column buffer sizes (each an `Int`) into an `Int` accumulator. When the total exceeds 2 GB the sum silently wraps negative, causing the byte-limit checks controlled by `spark.sql.execution.arrow.maxBytesPerBatch` and `spark.sql.execution.arrow.maxBytesPerOutputBatch` to behave incorrectly and potentially allow oversized batches through. Fix by changing both accumulators and return types to `Long`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

### What changes were proposed in this pull request? `ArrowWriter.sizeInBytes()` and `SliceBytesArrowOutputProcessorImpl .getBatchBytes()` both accumulated per-column buffer sizes (each an `Int`) into an `Int` accumulator. When the total exceeds 2 GB the sum silently wraps negative, causing the byte-limit checks controlled by `spark.sql.execution.arrow.maxBytesPerBatch` and `spark.sql.execution.arrow.maxBytesPerOutputBatch` to behave incorrectly and potentially allow oversized batches through. Fix by changing both accumulators and return types to `Long`. ### Why are the changes needed? Fix possible overflow when calculating Arrow batch bytes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6 <noreplyanthropic.com> Closes #54584 from viirya/fix-arrow-batch-bytes-overflow. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit df195ac) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2026-03-04T16:31:28Z

Merged to master/4.1.

Manually backport to 4.0 in #54624 because some conflicts.

Thanks @HyukjinKwon @sunchao @zhengruifeng @dongjoon-hyun @yaooqinn @Yicong-Huang

HyukjinKwon approved these changes Mar 3, 2026

View reviewed changes

sunchao approved these changes Mar 3, 2026

View reviewed changes

zhengruifeng approved these changes Mar 3, 2026

View reviewed changes

dongjoon-hyun approved these changes Mar 3, 2026

View reviewed changes

viirya force-pushed the fix-arrow-batch-bytes-overflow branch from 6a9089f to edca3d5 Compare March 3, 2026 03:10

yaooqinn approved these changes Mar 3, 2026

View reviewed changes

viirya force-pushed the fix-arrow-batch-bytes-overflow branch 4 times, most recently from 8208dfb to 36e6738 Compare March 3, 2026 21:42

Yicong-Huang approved these changes Mar 3, 2026

View reviewed changes

viirya force-pushed the fix-arrow-batch-bytes-overflow branch from 36e6738 to 0de3d67 Compare March 4, 2026 02:48

viirya force-pushed the fix-arrow-batch-bytes-overflow branch from 0de3d67 to 076746f Compare March 4, 2026 07:11

viirya closed this in df195ac Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55802][SQL] Fix integer overflow when computing Arrow batch bytes#54584

[SPARK-55802][SQL] Fix integer overflow when computing Arrow batch bytes#54584
viirya wants to merge 1 commit intoapache:masterfrom
viirya:fix-arrow-batch-bytes-overflow

viirya commented Mar 3, 2026

Uh oh!

viirya commented Mar 3, 2026

Uh oh!

sunchao commented Mar 3, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

viirya commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

viirya commented Mar 3, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

viirya commented Mar 3, 2026

Uh oh!

sunchao commented Mar 3, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants