Skip to content

[SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF#54568

Open
zhengruifeng wants to merge 9 commits intoapache:masterfrom
zhengruifeng:add_config
Open

[SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF#54568
zhengruifeng wants to merge 9 commits intoapache:masterfrom
zhengruifeng:add_config

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Always use ExtensionDType for integers in Pandas UDF

Why are the changes needed?

Current DType for integers are not predictable: it depends on the nullability of current batch

Does this PR introduce any user-facing change?

yes, controlled by new config spark.sql.execution.pythonUDF.pandas.preferIntExtensionDtype

How was this patch tested?

Added tests

Was this patch authored or co-authored using generative AI tooling?

No

@@ -1,11 +1,11 @@
Test Case Spark Type Spark Value Python Type Python Value
0 byte_values tinyint [-128, 127, 0] ['int', 'int', 'int'] ['-128', '127', '0']
0 byte_values tinyint [-128, 127, 0] ['int8', 'int8', 'int8'] ['-128', '127', '0']
Copy link
Contributor Author

@zhengruifeng zhengruifeng Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for arrow-optimized python udf with legacy pandas conversion, the input int is changed to np.int8, is this acceptable? @ueshin

Copy link
Contributor Author

@zhengruifeng zhengruifeng Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, we can exclude it from this PR, and always convert to int/float

def my_udf(x):
logger = PySparkLogger.getLogger("PySparkLogger")
logger.warning("PySparkLogger test", x=x)
logger.warning("PySparkLogger test", x=str(x))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test fails https://github.com/zhengruifeng/spark/actions/runs/22566466865/job/65363804340

! Row(level='ERROR', msg='Traceback (most recent call last):', context={'func_name': 'my_udf'}, logger='stderr')

! Row(level='ERROR', msg='Traceback (most recent call last):', context={'func_name': 'my_udf'}, logger='stderr')

! Row(level='ERROR', msg='TypeError: Object of type int64 is not JSON serializable', context={'func_name': 'my_udf'}, logger='stderr')

! Row(level='ERROR', msg='TypeError: Object of type int64 is not JSON serializable', context={'func_name': 'my_udf'}, logger='stderr')

I guess we need some fix in udf logging

@zhengruifeng zhengruifeng requested a review from ueshin March 3, 2026 02:18
runner_conf.timezone,
runner_conf.safecheck,
runner_conf.assign_cols_by_name,
runner_conf.prefer_int_ext_dtype,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yicong-Huang shall we enforce keyword arguments in serializers' init? I feel it is error-prone since they takes too many arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We should enforce keyword arguments on serializers. Also the majority of serializers will be removed eventually after refactoring. And those parameters won't need to be passed to the remaining serializers.

Copy link
Contributor Author

@zhengruifeng zhengruifeng Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will stay in the intermediate step for a while before refactoring is done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will raise a PR to change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR opened: #54605

@zhengruifeng zhengruifeng changed the title [SPARK-55788][PYTHON] Always use ExtensionDType for integers in Pandas UDF [SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF Mar 3, 2026
defaults to false

defaults to false

defaults to false

defaults to false
Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

schema: Optional["StructType"] = None,
struct_in_pandas: str = "dict",
ndarray_as_list: bool = False,
prefer_int_ext_dtype: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be False?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants