[SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF by zhengruifeng · Pull Request #54568 · apache/spark

zhengruifeng · 2026-03-02T07:54:58Z

What changes were proposed in this pull request?

Always use ExtensionDType for integers in Pandas UDF

Why are the changes needed?

Current DType for integers are not predictable: it depends on the nullability of current batch

Does this PR introduce any user-facing change?

yes, controlled by new config spark.sql.execution.pythonUDF.pandas.preferIntExtensionDtype

How was this patch tested?

Added tests

Was this patch authored or co-authored using generative AI tooling?

No

zhengruifeng · 2026-03-02T11:37:01Z

...n/pyspark/sql/tests/coercion/golden_python_udf_input_type_coercion_with_arrow_and_pandas.csv

@@ -1,11 +1,11 @@
 	Test Case	Spark Type	Spark Value	Python Type	Python Value
-0	byte_values	tinyint	[-128, 127, 0]	['int', 'int', 'int']	['-128', '127', '0']
+0	byte_values	tinyint	[-128, 127, 0]	['int8', 'int8', 'int8']	['-128', '127', '0']


for arrow-optimized python udf with legacy pandas conversion, the input int is changed to np.int8, is this acceptable? @ueshin

otherwise, we can exclude it from this PR, and always convert to int/float

zhengruifeng · 2026-03-02T11:39:37Z

python/pyspark/sql/tests/test_udf.py

        def my_udf(x):
            logger = PySparkLogger.getLogger("PySparkLogger")
-            logger.warning("PySparkLogger test", x=x)
+            logger.warning("PySparkLogger test", x=str(x))


this test fails https://github.com/zhengruifeng/spark/actions/runs/22566466865/job/65363804340

! Row(level='ERROR', msg='Traceback (most recent call last):', context={'func_name': 'my_udf'}, logger='stderr') ! Row(level='ERROR', msg='Traceback (most recent call last):', context={'func_name': 'my_udf'}, logger='stderr') ! Row(level='ERROR', msg='TypeError: Object of type int64 is not JSON serializable', context={'func_name': 'my_udf'}, logger='stderr') ! Row(level='ERROR', msg='TypeError: Object of type int64 is not JSON serializable', context={'func_name': 'my_udf'}, logger='stderr')

I guess we need some fix in udf logging

zhengruifeng · 2026-03-03T02:52:24Z

python/pyspark/worker.py

                runner_conf.timezone,
                runner_conf.safecheck,
                runner_conf.assign_cols_by_name,
+                runner_conf.prefer_int_ext_dtype,


@Yicong-Huang shall we enforce keyword arguments in serializers' init? I feel it is error-prone since they takes too many arguments

That's a good point. We should enforce keyword arguments on serializers. Also the majority of serializers will be removed eventually after refactoring. And those parameters won't need to be passed to the remaining serializers.

I think we will stay in the intermediate step for a while before refactoring is done

ok will raise a PR to change it

PR opened: #54605

defaults to false defaults to false defaults to false defaults to false

ueshin

Otherwise, LGTM.

ueshin · 2026-03-03T20:02:15Z

python/pyspark/sql/conversion.py

        schema: Optional["StructType"] = None,
        struct_in_pandas: str = "dict",
        ndarray_as_list: bool = False,
+        prefer_int_ext_dtype: bool = True,


This should be False?

zhengruifeng commented Mar 2, 2026

View reviewed changes

zhengruifeng force-pushed the add_config branch from 0f19bc4 to e826b49 Compare March 3, 2026 00:39

HyukjinKwon approved these changes Mar 3, 2026

View reviewed changes

zhengruifeng requested a review from ueshin March 3, 2026 02:18

zhengruifeng commented Mar 3, 2026

View reviewed changes

zhengruifeng changed the title ~~[SPARK-55788][PYTHON] Always use ExtensionDType for integers in Pandas UDF~~ [SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF Mar 3, 2026

zhengruifeng force-pushed the add_config branch from e826b49 to 355aeef Compare March 3, 2026 04:08

zhengruifeng added 8 commits March 3, 2026 15:05

init

0c9746c

fix log test

09f35a8

regen

d0b0b2f

fix

9d1837b

fix

fe3e009

fix more

a63e37e

fix lint lambda -> def

2c82cfa

fix test

82c6534

zhengruifeng force-pushed the add_config branch from 355aeef to 197bc69 Compare March 3, 2026 07:19

defaults to false

fe6b57d

defaults to false defaults to false defaults to false defaults to false

zhengruifeng force-pushed the add_config branch from 197bc69 to fe6b57d Compare March 3, 2026 07:48

ueshin approved these changes Mar 3, 2026

View reviewed changes

Yicong-Huang mentioned this pull request Mar 3, 2026

[SPARK-55821][PYTHON] Enforce keyword-only arguments in serializer __init__ methods #54605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF#54568

[SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF#54568
zhengruifeng wants to merge 9 commits intoapache:masterfrom
zhengruifeng:add_config

zhengruifeng commented Mar 2, 2026

Uh oh!

zhengruifeng Mar 2, 2026 •

edited

Loading

Uh oh!

zhengruifeng Mar 3, 2026 •

edited

Loading

Uh oh!

zhengruifeng Mar 2, 2026

Uh oh!

zhengruifeng Mar 3, 2026

Uh oh!

Yicong-Huang Mar 3, 2026

Uh oh!

zhengruifeng Mar 3, 2026 •

edited

Loading

Uh oh!

Yicong-Huang Mar 3, 2026

Uh oh!

Yicong-Huang Mar 3, 2026

Uh oh!

ueshin left a comment

Uh oh!

ueshin Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhengruifeng commented Mar 2, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengruifeng Mar 2, 2026 •

edited

Loading

zhengruifeng Mar 3, 2026 •

edited

Loading

zhengruifeng Mar 3, 2026 •

edited

Loading