Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Jan 7, 2026

What changes were proposed in this pull request?

Add tests for PyArrow's pa.array type inference behavior. These tests monitor upstream PyArrow behavior to ensure PySpark's assumptions remain valid across versions.

The tests cover type inference across input categories:

  1. Nullable data - with None values
  2. Plain Python instances - list, tuple, dict (struct)
  3. Pandas instances - numpy-backed Series, nullable extension types, ArrowDtype
  4. NumPy array - all numeric dtypes, datetime64, timedelta64
  5. Nested types - list of list, list of struct, struct of struct, struct of list
  6. Explicit type specification - large_list, fixed_size_list, map_, large_string, large_binary

Types tested include:

Category Types Covered
Primitive int8/16/32/64, uint8/16/32/64, float16/32/64, bool, string, binary
Temporal date32, timestamp (s/ms/us/ns), time64, duration (s/ms/us/ns)
Decimal decimal128
Nested list_, struct, map_ (explicit only)
Large variants large_list, large_string, large_binary (explicit only)

Pandas extension types tested:

  • Nullable types: pd.Int8Dtype() ... pd.Int64Dtype(), pd.UInt8Dtype() ... pd.UInt64Dtype(), pd.Float32Dtype(), pd.Float64Dtype(), pd.BooleanDtype(), pd.StringDtype()
  • PyArrow-backed: pd.ArrowDtype(pa.int64()), pd.ArrowDtype(pa.float64()), pd.ArrowDtype(pa.large_string()), etc.

Why are the changes needed?

This is part of SPARK-54936 to monitor behavior changes from upstream dependencies. By testing PyArrow's type inference behavior, we can detect breaking changes when upgrading PyArrow versions.

Does this PR introduce any user-facing change?

No. This PR only adds tests.

How was this patch tested?

New unit tests added:

python -m pytest python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_inference.py -v

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

JIRA Issue Information

=== Sub-task SPARK-54938 ===
Summary: Add tests for pa.array type inference
Assignee: Yicong Huang
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@HyukjinKwon
Copy link
Member

cc @zhengruifeng

class PyArrowTypeInferenceTests(unittest.TestCase):
"""Test PyArrow's type inference behavior for pa.array."""

def test_nullable_data(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yicong-Huang shall we check the inference with a for loop like

for value, expected_type in [
   ([None], pa.null())
   ...
]:
   ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer explicit checks instead of group input and expected value with a for loop.

  • the current version is easier to read. for for loop, if anything breaks, it would be harder to find the failed pair.
  • some of the cases need special comments. commenting in a list of pairs would also be harder to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be harder to find the failed pair

assertions supports error message like self.assertEqual(a, b, f"{a} {b}")

some of the cases need special comments

the comments can be interleaved with codes

for value, expected_type in [
   # comments
   ([None], pa.null()),
   # comments
   ...
]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those feel more like workarounds than clear benefits. Is there a concrete advantage to switching to a for loop in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make code concise, so that developers can review more cases in one screen

self.assertEqual(a.type, pa.date32())

# ArrowDtype - timestamp with timezone
tz = "UTC"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for timezone, let's also test a non-UTC timezone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will add it

self.assertEqual(a.type, pa.date32())

# Datetime Series (pandas Timestamp, numpy datetime64[ns])
s = pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also test datetime.datetime and pd.Timestamp with timezone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

dtype=pd.ArrowDtype(pa_type),
)
a = pa.array(s)
self.assertEqual(a.type, pa_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also test nested pandas series?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some cases

#
# Key input types tested:
# 1. nullable data (with None values)
# 2. plain Python instances (list, tuple, array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also consider dict

pa.array([{}]).type
pa.array([{"a":3,"b":None,"c":"s"}]).type
pa.array([{"a":3,"b":"s"}, {"x":5, "a":1.0}]).type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much, it is much cleaner.

inspiried by https://github.com/apache/spark/pull/53727/changes,
I think we need to also test following cases:
1, string: non-english values;
2, integral: min max values, and make it overflow
3, floats: nan, -inf, inf
4, time: Unix epoch, min max values

import pyarrow as pa

for name, data, expected_type in cases:
with self.subTest(name=name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subTest will generate a lot of log in the file, I feel we can just remove it.

and the name field is not necessary.

import pyarrow as pa

cases = [
# fmt: off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel enabling the fmt won't hurt the readability too much, if we remove the name field

@Yicong-Huang Yicong-Huang force-pushed the SPARK-54938/test/add-tests-for-pa-array-type-inference branch from efbc505 to 905b616 Compare January 9, 2026 21:34
@Yicong-Huang Yicong-Huang force-pushed the SPARK-54938/test/add-tests-for-pa-array-type-inference branch from 905b616 to f12b2a0 Compare January 9, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants