-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54938][PYTHON][TESTS] Add tests for pa.array type inference #53718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-54938][PYTHON][TESTS] Add tests for pa.array type inference #53718
Conversation
JIRA Issue Information=== Sub-task SPARK-54938 === This comment was automatically generated by GitHub Actions |
| class PyArrowTypeInferenceTests(unittest.TestCase): | ||
| """Test PyArrow's type inference behavior for pa.array.""" | ||
|
|
||
| def test_nullable_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yicong-Huang shall we check the inference with a for loop like
for value, expected_type in [
([None], pa.null())
...
]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer explicit checks instead of group input and expected value with a for loop.
- the current version is easier to read. for for loop, if anything breaks, it would be harder to find the failed pair.
- some of the cases need special comments. commenting in a list of pairs would also be harder to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be harder to find the failed pair
assertions supports error message like self.assertEqual(a, b, f"{a} {b}")
some of the cases need special comments
the comments can be interleaved with codes
for value, expected_type in [
# comments
([None], pa.null()),
# comments
...
]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those feel more like workarounds than clear benefits. Is there a concrete advantage to switching to a for loop in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make code concise, so that developers can review more cases in one screen
| self.assertEqual(a.type, pa.date32()) | ||
|
|
||
| # ArrowDtype - timestamp with timezone | ||
| tz = "UTC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for timezone, let's also test a non-UTC timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure will add it
| self.assertEqual(a.type, pa.date32()) | ||
|
|
||
| # Datetime Series (pandas Timestamp, numpy datetime64[ns]) | ||
| s = pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also test datetime.datetime and pd.Timestamp with timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
| dtype=pd.ArrowDtype(pa_type), | ||
| ) | ||
| a = pa.array(s) | ||
| self.assertEqual(a.type, pa_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also test nested pandas series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some cases
| # | ||
| # Key input types tested: | ||
| # 1. nullable data (with None values) | ||
| # 2. plain Python instances (list, tuple, array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also consider dict
pa.array([{}]).type
pa.array([{"a":3,"b":None,"c":"s"}]).type
pa.array([{"a":3,"b":"s"}, {"x":5, "a":1.0}]).type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
zhengruifeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much, it is much cleaner.
inspiried by https://github.com/apache/spark/pull/53727/changes,
I think we need to also test following cases:
1, string: non-english values;
2, integral: min max values, and make it overflow
3, floats: nan, -inf, inf
4, time: Unix epoch, min max values
| import pyarrow as pa | ||
|
|
||
| for name, data, expected_type in cases: | ||
| with self.subTest(name=name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subTest will generate a lot of log in the file, I feel we can just remove it.
and the name field is not necessary.
| import pyarrow as pa | ||
|
|
||
| cases = [ | ||
| # fmt: off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel enabling the fmt won't hurt the readability too much, if we remove the name field
efbc505 to
905b616
Compare
905b616 to
f12b2a0
Compare
What changes were proposed in this pull request?
Add tests for PyArrow's
pa.arraytype inference behavior. These tests monitor upstream PyArrow behavior to ensure PySpark's assumptions remain valid across versions.The tests cover type inference across input categories:
Nonevalueslist,tuple,dict(struct)Types tested include:
Pandas extension types tested:
pd.Int8Dtype()...pd.Int64Dtype(),pd.UInt8Dtype()...pd.UInt64Dtype(),pd.Float32Dtype(),pd.Float64Dtype(),pd.BooleanDtype(),pd.StringDtype()pd.ArrowDtype(pa.int64()),pd.ArrowDtype(pa.float64()),pd.ArrowDtype(pa.large_string()), etc.Why are the changes needed?
This is part of SPARK-54936 to monitor behavior changes from upstream dependencies. By testing PyArrow's type inference behavior, we can detect breaking changes when upgrading PyArrow versions.
Does this PR introduce any user-facing change?
No. This PR only adds tests.
How was this patch tested?
New unit tests added:
Was this patch authored or co-authored using generative AI tooling?
No.