[SPARK-54938][PYTHON][TESTS] Add tests for pa.array type inference #53718

Yicong-Huang · 2026-01-07T22:57:53Z

What changes were proposed in this pull request?

Add tests for PyArrow's pa.array type inference behavior. These tests monitor upstream PyArrow behavior to ensure PySpark's assumptions remain valid across versions.

The tests cover type inference across input categories:

Nullable data - with None values
Plain Python instances - list, tuple, dict (struct)
Pandas instances - numpy-backed Series, nullable extension types, ArrowDtype
NumPy array - all numeric dtypes, datetime64, timedelta64
Nested types - list of list, list of struct, struct of struct, struct of list
Explicit type specification - large_list, fixed_size_list, map_, large_string, large_binary

Types tested include:

Category	Types Covered
Primitive	int8/16/32/64, uint8/16/32/64, float16/32/64, bool, string, binary
Temporal	date32, timestamp (s/ms/us/ns), time64, duration (s/ms/us/ns)
Decimal	decimal128
Nested	list_, struct, map_ (explicit only)
Large variants	large_list, large_string, large_binary (explicit only)

Pandas extension types tested:

Nullable types: pd.Int8Dtype() ... pd.Int64Dtype(), pd.UInt8Dtype() ... pd.UInt64Dtype(), pd.Float32Dtype(), pd.Float64Dtype(), pd.BooleanDtype(), pd.StringDtype()
PyArrow-backed: pd.ArrowDtype(pa.int64()), pd.ArrowDtype(pa.float64()), pd.ArrowDtype(pa.large_string()), etc.

Why are the changes needed?

This is part of SPARK-54936 to monitor behavior changes from upstream dependencies. By testing PyArrow's type inference behavior, we can detect breaking changes when upgrading PyArrow versions.

Does this PR introduce any user-facing change?

No. This PR only adds tests.

How was this patch tested?

New unit tests added:

python -m pytest python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py -v

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-01-07T22:58:02Z

JIRA Issue Information

=== Sub-task SPARK-54938 ===
Summary: Add tests for pa.array type inference
Assignee: Yicong Huang
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

HyukjinKwon · 2026-01-07T23:20:28Z

cc @zhengruifeng

zhengruifeng · 2026-01-08T00:33:59Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

+class PyArrowTypeInferenceTests(unittest.TestCase):
+    """Test PyArrow's type inference behavior for pa.array."""
+
+    def test_nullable_data(self):


@Yicong-Huang shall we check the inference with a for loop like

for value, expected_type in [ ([None], pa.null()) ... ]: ...

I'd prefer explicit checks instead of group input and expected value with a for loop.

the current version is easier to read. for for loop, if anything breaks, it would be harder to find the failed pair.

some of the cases need special comments. commenting in a list of pairs would also be harder to understand.

it would be harder to find the failed pair

assertions supports error message like self.assertEqual(a, b, f"{a} {b}")

some of the cases need special comments

the comments can be interleaved with codes

for value, expected_type in [ # comments ([None], pa.null()), # comments ... ]:

Those feel more like workarounds than clear benefits. Is there a concrete advantage to switching to a for loop in this case?

to make code concise, so that developers can review more cases in one screen

zhengruifeng · 2026-01-08T00:40:42Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

+        self.assertEqual(a.type, pa.date32())
+
+        # ArrowDtype - timestamp with timezone
+        tz = "UTC"


for timezone, let's also test a non-UTC timezone

sure will add it

zhengruifeng · 2026-01-08T00:43:11Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

+        self.assertEqual(a.type, pa.date32())
+
+        # Datetime Series (pandas Timestamp, numpy datetime64[ns])
+        s = pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"]))


lets also test datetime.datetime and pd.Timestamp with timezone

zhengruifeng · 2026-01-08T00:47:36Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

+            dtype=pd.ArrowDtype(pa_type),
+        )
+        a = pa.array(s)
+        self.assertEqual(a.type, pa_type)


shall we also test nested pandas series?

added some cases

zhengruifeng · 2026-01-08T03:24:03Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

+#
+# Key input types tested:
+# 1. nullable data (with None values)
+# 2. plain Python instances (list, tuple, array)


let's also consider dict

pa.array([{}]).type pa.array([{"a":3,"b":None,"c":"s"}]).type pa.array([{"a":3,"b":"s"}, {"x":5, "a":1.0}]).type

zhengruifeng

thanks so much, it is much cleaner.

inspiried by https://github.com/apache/spark/pull/53727/changes,
I think we need to also test following cases:
1, string: non-english values;
2, integral: min max values, and make it overflow
3, floats: nan, -inf, inf
4, time: Unix epoch, min max values

zhengruifeng · 2026-01-09T01:12:25Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

+        import pyarrow as pa
+
+        for name, data, expected_type in cases:
+            with self.subTest(name=name):


subTest will generate a lot of log in the file, I feel we can just remove it.

and the name field is not necessary.

zhengruifeng · 2026-01-09T01:14:43Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py

+        import pyarrow as pa
+
+        cases = [
+            # fmt: off


I feel enabling the fmt won't hurt the readability too much, if we remove the name field

…pe-inference

github-actions bot added BUILD CORE PYTHON labels Jan 7, 2026

zhengruifeng reviewed Jan 8, 2026

View reviewed changes

zhengruifeng reviewed Jan 9, 2026

View reviewed changes

Yicong-Huang force-pushed the SPARK-54938/test/add-tests-for-pa-array-type-inference branch from efbc505 to 905b616 Compare January 9, 2026 21:34

Yicong-Huang added 6 commits January 9, 2026 13:36

test: add tests for pa.array type inference

b88e79b

test: register pa.array type inference test module

44923b9

test: add comprehensive type coverage for pa.array inference

6769e91

feat: use loops for test cases for easier review

f7210d0

test: add edge cases for strings, integers, floats, and temporals

6b9b1c8

refactor: add fmt:off/on and column comments for test cases

f12b2a0

Yicong-Huang force-pushed the SPARK-54938/test/add-tests-for-pa-array-type-inference branch from 905b616 to f12b2a0 Compare January 9, 2026 21:37

Yicong-Huang added 3 commits January 9, 2026 13:38

Merge branch 'master' into SPARK-54938/test/add-tests-for-pa-array-ty…

6a043db

…pe-inference

refactor: enhance test organization

a4dd486

fix: format

281adcc

[SPARK-54938][PYTHON][TESTS] Add tests for pa.array type inference #53718

Are you sure you want to change the base?

[SPARK-54938][PYTHON][TESTS] Add tests for pa.array type inference #53718

Conversation

Yicong-Huang commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA Issue Information

Uh oh!

HyukjinKwon commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yicong-Huang commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading