-
Notifications
You must be signed in to change notification settings - Fork 254
fix: re-enable Comet abs
#2595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: re-enable Comet abs
#2595
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2595 +/- ##
============================================
+ Coverage 56.12% 59.33% +3.20%
- Complexity 976 1444 +468
============================================
Files 119 146 +27
Lines 11743 13758 +2015
Branches 2251 2353 +102
============================================
+ Hits 6591 8163 +1572
- Misses 4012 4373 +361
- Partials 1140 1222 +82 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| Seq(2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 15, 16, 17).foreach { col => | ||
| checkSparkAnswerAndOperator(s"SELECT abs(_${col}) FROM tbl") | ||
| test("abs") { | ||
| Seq(true, false).foreach { ansi_enabled => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the diff, test with ANSI mode on/off.
| | optional int32 _10(UINT_16); | ||
| | optional int32 _11(UINT_32); | ||
| | optional int64 _12(UINT_64); | ||
| | optional int32 _9(INT_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store negative values in these columns, I think the schema should not be unsigned int.
// CometTestBase.scala
record.add(8, (-i).toByte)
record.add(9, (-i).toShort)
record.add(10, -i)
record.add(11, (-i).toLong)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have UINT here to make sure we cover all types that parquet has. The data files created here are specifically designed to test whether parquet readers can handle all types correctly. Negative values stored in a UINT parquet type test the values around the boundary of allowed values.
To illustrate with an example, when you store the value -1 in a UINT_8 field what gets stored is the bit pattern 0xff. On reading, this is read back as the value 255 which is the maximum value for a UINT_8.
This is both correct and desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented in https://github.com/apache/datafusion-comet/pull/2595/files#r2445168046 but I don't think that we should continue writing new tests that rely on makeParquetFileAllPrimitiveTypes because it doesn't explicitly generate edge case values.
|
Thanks @hsiang-c WDYT of implementing abs with spark flavor in DF? Like I did recently for concat apache/datafusion#18128 |
| withParquetTable(path.toString, "tbl") { | ||
| Seq(2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 15, 16, 17).foreach { col => | ||
| checkSparkAnswerAndOperator(s"SELECT abs(_${col}) FROM tbl") | ||
| test("abs") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see this test added to CometFuzzMathSuite because the generated Parquet file there tests for specific edge cases such as negative floating point zero and min/max values for all types. The data generated by makeParquetFileAllPrimitiveTypes does not explicitly add these edge cases.
| test("abs") { | ||
| Seq(true, false).foreach { ansi_enabled => | ||
| Seq(true, false).foreach { dictionaryEnabled => | ||
| withSQLConf(SQLConf.ANSI_ENABLED.key -> ansi_enabled.toString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test isn't currently testing any scenarios where ANSI mode would cause an exception to be thrown.
| } | ||
| } | ||
| }) | ||
| .unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the behavior here if the argument is ColumnarValue::Scalar(ScalarValue::Int8(None)) ?
Because currently the .unwrap() would panic.
IMO it should just return ColumnarValue::Scalar(ScalarValue::Int8(None))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…mode (#18205) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of #15914 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> - Apache Spark's `abs()` behaves differently than DataFusion. - Apache Spark's [ANSI-compliant](https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#ansi-compliance) dialect can be toggled by SparkConf `spark.sql.ansi.enabled`. When ANSI mode is off, arithmetic overflow doesn't throw exception like DataFusion does. - DataFusion Comet can leverage it at apache/datafusion-comet#2595 ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - This is the 1st PR to support non-ANSI mode Spark-compatible `abs` math function - Mimics Apache Spark `v4.0.1` [abs expression](https://github.com/apache/spark/blob/v4.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L148) for numeric types only and non-ANSI mode, i.e. `spark.sql.ansi.enabled=false` ### Tasks breakdown | Non-ANSI mode | ANSI mode | ANSI Interval Types | | - | - | - | | this PR | hsiang-c#1 (will change base branch) | TODO | ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> - unit tests - sqllogictest: `test_files/spark/math/abs.slt` ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> Yes, the abs function can be specified in the SQL. - Arithmetic overflow will NOT be thrown on arithmetic overflow. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
…mode (apache#18205) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Part of apache#15914 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> - Apache Spark's `abs()` behaves differently than DataFusion. - Apache Spark's [ANSI-compliant](https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#ansi-compliance) dialect can be toggled by SparkConf `spark.sql.ansi.enabled`. When ANSI mode is off, arithmetic overflow doesn't throw exception like DataFusion does. - DataFusion Comet can leverage it at apache/datafusion-comet#2595 ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - This is the 1st PR to support non-ANSI mode Spark-compatible `abs` math function - Mimics Apache Spark `v4.0.1` [abs expression](https://github.com/apache/spark/blob/v4.0.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L148) for numeric types only and non-ANSI mode, i.e. `spark.sql.ansi.enabled=false` ### Tasks breakdown | Non-ANSI mode | ANSI mode | ANSI Interval Types | | - | - | - | | this PR | hsiang-c#1 (will change base branch) | TODO | ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> - unit tests - sqllogictest: `test_files/spark/math/abs.slt` ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> Yes, the abs function can be specified in the SQL. - Arithmetic overflow will NOT be thrown on arithmetic overflow. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
Which issue does this PR close?
Closes #1890
Partially closes #2314
Rationale for this change
absimplementation returns unexpected result and was turned off.What changes are included in this PR?
org.apache.spark.SparkArithmeticExceptionon theMIN_VALUEof Spark'sIntegralType, see doc.CometTestBase, changed the types of column_9,_10,_11and_12fromUINT_8/16/32/64toINT_8/16/32/64b/c we actually have negative values in test data.How are these changes tested?
MIN_VALUEand decimal values with different precision and scale.