[SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF#54568
[SPARK-55788][PYTHON] Support ExtensionDType for integers in Pandas UDF#54568zhengruifeng wants to merge 9 commits intoapache:masterfrom
Conversation
| @@ -1,11 +1,11 @@ | |||
| Test Case Spark Type Spark Value Python Type Python Value | |||
| 0 byte_values tinyint [-128, 127, 0] ['int', 'int', 'int'] ['-128', '127', '0'] | |||
| 0 byte_values tinyint [-128, 127, 0] ['int8', 'int8', 'int8'] ['-128', '127', '0'] | |||
There was a problem hiding this comment.
for arrow-optimized python udf with legacy pandas conversion, the input int is changed to np.int8, is this acceptable? @ueshin
There was a problem hiding this comment.
otherwise, we can exclude it from this PR, and always convert to int/float
python/pyspark/sql/tests/test_udf.py
Outdated
| def my_udf(x): | ||
| logger = PySparkLogger.getLogger("PySparkLogger") | ||
| logger.warning("PySparkLogger test", x=x) | ||
| logger.warning("PySparkLogger test", x=str(x)) |
There was a problem hiding this comment.
this test fails https://github.com/zhengruifeng/spark/actions/runs/22566466865/job/65363804340
! Row(level='ERROR', msg='Traceback (most recent call last):', context={'func_name': 'my_udf'}, logger='stderr')
! Row(level='ERROR', msg='Traceback (most recent call last):', context={'func_name': 'my_udf'}, logger='stderr')
! Row(level='ERROR', msg='TypeError: Object of type int64 is not JSON serializable', context={'func_name': 'my_udf'}, logger='stderr')
! Row(level='ERROR', msg='TypeError: Object of type int64 is not JSON serializable', context={'func_name': 'my_udf'}, logger='stderr')
I guess we need some fix in udf logging
| runner_conf.timezone, | ||
| runner_conf.safecheck, | ||
| runner_conf.assign_cols_by_name, | ||
| runner_conf.prefer_int_ext_dtype, |
There was a problem hiding this comment.
@Yicong-Huang shall we enforce keyword arguments in serializers' init? I feel it is error-prone since they takes too many arguments
There was a problem hiding this comment.
That's a good point. We should enforce keyword arguments on serializers. Also the majority of serializers will be removed eventually after refactoring. And those parameters won't need to be passed to the remaining serializers.
There was a problem hiding this comment.
I think we will stay in the intermediate step for a while before refactoring is done
There was a problem hiding this comment.
ok will raise a PR to change it
defaults to false defaults to false defaults to false defaults to false
| schema: Optional["StructType"] = None, | ||
| struct_in_pandas: str = "dict", | ||
| ndarray_as_list: bool = False, | ||
| prefer_int_ext_dtype: bool = True, |
What changes were proposed in this pull request?
Always use ExtensionDType for integers in Pandas UDF
Why are the changes needed?
Current DType for integers are not predictable: it depends on the nullability of current batch
Does this PR introduce any user-facing change?
yes, controlled by new config
spark.sql.execution.pythonUDF.pandas.preferIntExtensionDtypeHow was this patch tested?
Added tests
Was this patch authored or co-authored using generative AI tooling?
No