Skip to content

Conversation

@dahbka-lis
Copy link
Contributor

@dahbka-lis dahbka-lis commented Oct 14, 2025

Rationale for this change

There is a bug for creating union types with empty type_codes. If fields.size() == 128 (kMaxTypeCode + 1) and type_codes is empty, static_cast<int8_t> returns -128 and internal::Iota generates an empty vector of type codes, but the expected vector is [0, 1, 2, ..., 127], where 127 is kMaxTypeCode.

What changes are included in this PR?

  • Added a new internal::Iota function to generate vectors of size = length with values from start.
  • Changed internal::Iota call from old parameters to new for creating dense_union and sparse_union types.
  • Implemented a new test to detect this error.

Are these changes tested?

Yes, there is a new test.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix".
(b) a bug that caused incorrect or invalid data to be produced

If fields.size() == 128 (kMaxTypeCode + 1), internal::Iota returns an empty vector that cannot validate here:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/type.cc#L1232

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@dahbka-lis dahbka-lis changed the title [C++] Fix creating union types without type_codes for fields.size() == 128 MINOR: [C++] Fix creating union types without type_codes for fields.size() == 128 Oct 14, 2025
@dahbka-lis
Copy link
Contributor Author

dahbka-lis commented Oct 18, 2025

Hi, @kou! Do you know someone who can review this PR? It looks like a bug rather than a "feature"

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
This doesn't fit our definition of MINOR, could you please open an issue for this change? See our definition for MINOR:
https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#minor-fixes
Thanks!

@dahbka-lis dahbka-lis changed the title MINOR: [C++] Fix creating union types without type_codes for fields.size() == 128 GH-47859: [C++] Fix creating union types without type_codes for fields.size() == 128 Oct 18, 2025
@github-actions
Copy link

⚠️ GitHub issue #47859 has been automatically assigned in GitHub to PR creator.

@dahbka-lis
Copy link
Contributor Author

Hi, @kou! Can you take a look?

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems that we can't create a union type that has fields.size() == 128:

https://arrow.apache.org/docs/format/Columnar.html#dense-union

A union with more than 127 possible types can be modeled as a union of unions.

Comment on lines 48 to 53
template <typename T>
std::vector<T> Iota(T start, size_t length) {
std::vector<T> result(length);
std::iota(result.begin(), result.end(), start);
return result;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this in

std::vector<T> result(static_cast<size_t>(stop - start));
std::iota(result.begin(), result.end(), start);
return result;
}
?

Copy link
Contributor Author

@dahbka-lis dahbka-lis Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this in

The stop border not included, so for fields.size() == 128, Iota<int8_t>(0, 128) is called and 128 is converted to -128.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to confirm whether we can do the following:

diff --git a/cpp/src/arrow/util/range.h b/cpp/src/arrow/util/range.h
index 2055328798..7ccf3d154d 100644
--- a/cpp/src/arrow/util/range.h
+++ b/cpp/src/arrow/util/range.h
@@ -33,9 +33,7 @@ std::vector<T> Iota(T start, T stop) {
   if (start > stop) {
     return {};
   }
-  std::vector<T> result(static_cast<size_t>(stop - start));
-  std::iota(result.begin(), result.end(), start);
-  return result;
+  return Iota(start, static_cast<size_t>(stop - start));
 }
 
 /// Create a vector containing the values from 0 up to length

@dahbka-lis
Copy link
Contributor Author

Hmm. It seems that we can't create a union type that has fields.size() == 128:

There are no restrictions on creating a union type that has fields.size() == 128. Supported type codes range from 0 to 127 (128 different codes): https://github.com/apache/arrow/blob/main/cpp/src/arrow/type.cc#L1246

Could there be other limitations? Current designs allow this if you build a vector of type codes yourself.

@dahbka-lis dahbka-lis requested a review from kou November 11, 2025 11:47
@kou
Copy link
Member

kou commented Nov 11, 2025

Supported type codes range from 0 to 127 (128 different codes): https://github.com/apache/arrow/blob/main/cpp/src/arrow/type.cc#L1246

This check is for each type code value check. It's not a check for the number of type codes.

@bkietz You added the following checks in #4781:

https://github.com/apache/arrow/pull/4781/files#diff-a7012482623c4234c4fade3342b1180de1885484825790d88c1d55b672c5fe79R78-R79

https://github.com/apache/arrow/pull/4781/files#diff-a7012482623c4234c4fade3342b1180de1885484825790d88c1d55b672c5fe79R110-R113

The former change limits the number of type codes to 127 but the latter change limits the number of type codes to 128.

Do you have any opinion for this report?

@pitrou You may have an opinion for this report because you introduced UnionType::kMaxTypeId in #5892.

@pitrou
Copy link
Member

pitrou commented Nov 12, 2025

kMaxTypeId is 127 because it's the maximum type id that can be represented, but the maximum number of type ids is 128.

@dahbka-lis
Copy link
Contributor Author

kMaxTypeId is 127 because it's the maximum type id that can be represented, but the maximum number of type ids is 128.

@pitrou If the maximum number of type ids is 128, it is correct to create a union type with field.size() == 128, right?
If it is not, let me extend validation rules to check this situation

@pitrou
Copy link
Member

pitrou commented Nov 15, 2025

@pitrou If the maximum number of type ids is 128, it is correct to create a union type with field.size() == 128, right?

Yes, it should be.

@kou
Copy link
Member

kou commented Nov 17, 2025

https://arrow.apache.org/docs/format/Columnar.html#dense-union

A union with more than 127 possible types can be modeled as a union of unions.

Should we update the "127" in the document to "128" too?

@dahbka-lis
Copy link
Contributor Author

Should we update the "127" in the document to "128" too?

Done, is it alright?

@dahbka-lis
Copy link
Contributor Author

@kou is there anything to do to complete this PR?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Nov 21, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix this error?

https://github.com/apache/arrow/actions/runs/19468399277/job/56026880992?pr=47815#step:11:1701

FAILED: [code=2] src/arrow/CMakeFiles/arrow-type-test.dir/Unity/unity_0_cxx.cxx.obj 
"C:\Program Files\Git\usr\bin\ccache.exe" C:\PROGRA~1\MICROS~2\2022\ENTERP~1\VC\Tools\MSVC\1444~1.352\bin\Hostx64\x64\cl.exe  /nologo /TP -DARROW_HAVE_AVX2 -DARROW_HAVE_BMI2 -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_WITH_BENCHMARKS_REFERENCE -DARROW_WITH_TIMING_TESTS -D_CRT_SECURE_NO_WARNINGS -D__SSE2__ -D__SSE4_1__ -D__SSE4_2__ -ID:\a\arrow\arrow\build\cpp\src -ID:\a\arrow\arrow\cpp\src -ID:\a\arrow\arrow\cpp\src\generated -external:ID:\a\arrow\arrow\build\cpp\_deps\googletest-src\googlemock\include -external:ID:\a\arrow\arrow\build\cpp\_deps\googletest-src\googletest\include -external:ID:\a\arrow\arrow\cpp\thirdparty\flatbuffers\include -external:ID:\a\arrow\arrow\build\cpp\_deps\googletest-src\googletest -external:ID:\a\arrow\arrow\build\cpp\_deps\googletest-src\googlemock -external:W0 /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING /W3 /EHsc /wd5105 /bigobj /utf-8 /W3 /wd4365 /wd4267 /wd4838 /wd4800 /wd4996 /wd4065 /arch:AVX2  /Ob0 /Od /RTC1 /WX -std:c++17 -MDd -Zi /showIncludes /Fosrc\arrow\CMakeFiles\arrow-type-test.dir\Unity\unity_0_cxx.cxx.obj /Fdsrc\arrow\CMakeFiles\arrow-type-test.dir\ /FS -c D:\a\arrow\arrow\build\cpp\src\arrow\CMakeFiles\arrow-type-test.dir\Unity\unity_0_cxx.cxx
D:/a/arrow/arrow/cpp/src/arrow/type_test.cc(2201): error C2039: 'iota': is not a member of 'std'
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.44.35207\include\unordered_set(23): note: see declaration of 'std'
D:/a/arrow/arrow/cpp/src/arrow/type_test.cc(2201): error C3861: 'iota': identifier not found

This?

diff --git a/cpp/src/arrow/type_test.cc b/cpp/src/arrow/type_test.cc
index cb17f1ac3f..4c796ceee7 100644
--- a/cpp/src/arrow/type_test.cc
+++ b/cpp/src/arrow/type_test.cc
@@ -22,6 +22,7 @@
 #include <cstdint>
 #include <functional>
 #include <memory>
+#include <numeric>
 #include <string>
 #include <unordered_set>
 #include <vector>

Comment on lines 48 to 53
template <typename T>
std::vector<T> Iota(T start, size_t length) {
std::vector<T> result(length);
std::iota(result.begin(), result.end(), start);
return result;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to confirm whether we can do the following:

diff --git a/cpp/src/arrow/util/range.h b/cpp/src/arrow/util/range.h
index 2055328798..7ccf3d154d 100644
--- a/cpp/src/arrow/util/range.h
+++ b/cpp/src/arrow/util/range.h
@@ -33,9 +33,7 @@ std::vector<T> Iota(T start, T stop) {
   if (start > stop) {
     return {};
   }
-  std::vector<T> result(static_cast<size_t>(stop - start));
-  std::iota(result.begin(), result.end(), start);
-  return result;
+  return Iota(start, static_cast<size_t>(stop - start));
 }
 
 /// Create a vector containing the values from 0 up to length

@dahbka-lis
Copy link
Contributor Author

Could you fix this error?

https://github.com/apache/arrow/actions/runs/19468399277/job/56026880992?pr=47815#step:11:1701

FIxed, can you approve workflows?

@dahbka-lis dahbka-lis requested a review from kou November 22, 2025 20:40
@kou
Copy link
Member

kou commented Nov 22, 2025

Could you rebase on main to confirm whether the CI failures are unrelated to this change?

@dahbka-lis
Copy link
Contributor Author

Could you rebase on main to confirm whether the CI failures are unrelated to this change?

OK, can you approve workflows?

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Failures are unrelated: #48230

@kou kou merged commit 55587ef into apache:main Nov 23, 2025
37 of 45 checks passed
@kou kou removed the awaiting merge Awaiting merge label Nov 23, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 55587ef.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants