feat: enable native_datafusion scan in auto mode by andygrove · Pull Request #3781 · apache/datafusion-comet

andygrove · 2026-03-24T13:38:45Z

Which issue does this PR close?

Part of #3321

Rationale for this change

Improve performance with default config. The native_datafusion scan does not have the FFI roundtrip overhead that native_iceberg_compat has.

What changes are included in this PR?

Update "auto" mode to try native_datafusion first, falling back to Spark when not supported (no longer falls back to native_iceberg_compat)
Add COMET_EXEC_ENABLED guard to nativeDataFusionScan() since the scan node requires CometExecRule.transform() to wrap it in CometNativeExec
Update shuffle test suites to use COMET_EXEC_ENABLED=true so they run with auto mode (native_datafusion) scans
Remove stale assume(COMET_NATIVE_SCAN_IMPL != SCAN_NATIVE_DATAFUSION) guards from shuffle tests
Review ignored tests across all Spark diffs and make sure they are up-to-date
Ensure documentation is up-to-date

How are these changes tested?

Existing tests. Shuffle test suites (CometShuffleSuite, CometAsyncShuffleSuite, CometNativeShuffleSuite) updated and verified to pass with auto mode native_datafusion scans.

andygrove · 2026-03-24T15:58:12Z

Some row index generation - vectorized reader tests are currently failing.

auto scan mode now uses native_datafusion scan, which does not support row index generation, so these tests should be skipped for auto mode just as they are for native_datafusion mode.

…usion mode Remove IgnoreCometNativeDataFusion annotations from 3.5.8 Spark SQL test diff for issues that have been closed: apache#3311, apache#3313, apache#3314, apache#3315, apache#3320, apache#3401.

…ve_datafusion mode" This reverts commit d7fd22e.

…_datafusion The auto scan mode now tries native_datafusion first and falls back to native_iceberg_compat if the scan cannot be converted, rather than always using native_iceberg_compat.

Remove IgnoreCometNativeDataFusion tags for issues that have been resolved and closed: apache#3312, apache#3313, apache#3314, apache#3315.

This reverts commit 96622cf.

comphead · 2026-03-31T19:43:09Z

dev/diffs/3.4.3.diff

@@ -2377,7 +2377,7 @@ index 351c6d698fc..583d9225cca 100644

     test(s"invalid row index column type - ${conf.desc}") {
 +      // native_datafusion Parquet scan does not support row index generation.


should we point to github issue?

comphead · 2026-03-31T19:47:54Z

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

+              .orElse {
+                // clear explain info tags from the failed nativeDataFusionScan
+                // attempt so they don't leak into the fallback path
+                scanExec.unsetTagValue(CometExplainInfo.EXTENSION_INFO)


just wondering why we don't unset tags for failed in variants below?

if fallback reasons are added for native_datafusion then we need to remove them before trying native_iceberg_compat, otherwise the plan still falls back to Spark.

I feel we should keep the reason, so that we know why we used native_iceberg_compat instead of native_datafusion

If we keep a fallback reason then we will fall back to Spark. The goal was to use native_iceberg_compat for the cases that native_datafusion cannot support.

An alternative approach is for auto mode to just try native_datafusion and then fall back to Spark, rather than try native_iceberg_compat first.

I updated this PR to just use native_datafusion in auto mode. It is likely that I will need to fix some test assumptions as well, but will wait for CI to run first.

…diffs Add reference to apache#3432 in ParquetRowIndexSuite assume() comments across all three Spark version diffs.

parthchandra · 2026-04-01T17:27:05Z

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

+              .orElse {
+                // clear explain info tags from the failed nativeDataFusionScan
+                // attempt so they don't leak into the fallback path
+                scanExec.unsetTagValue(CometExplainInfo.EXTENSION_INFO)


I feel we should keep the reason, so that we know why we used native_iceberg_compat instead of native_datafusion

parthchandra · 2026-04-01T17:45:47Z

spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala

-                  .get(conf) == CometConf.SCAN_NATIVE_DATAFUSION) {
+              val scan = CometConf.COMET_NATIVE_SCAN_IMPL.get(conf)
+              val isNativeDataFusionScan =
+                scan == CometConf.SCAN_NATIVE_DATAFUSION || scan == CometConf.SCAN_AUTO


nit: this is not strictly correct since we could still fallback to native_iceberg_compat when mode is AUTO

With the latest commits, auto mode no longer falls back to native_iceberg_compat

… shuffle tests native_datafusion scan needs CometExecRule.transform() to wrap it in CometNativeExec, so it cannot work when COMET_EXEC_ENABLED is false. Add a guard in nativeDataFusionScan() to return None in this case, matching the existing pattern for v2 Iceberg scans. Update shuffle test suites to use COMET_EXEC_ENABLED=true so they run with auto mode (native_datafusion) scans. Remove stale assume guards that checked for explicit native_datafusion config but didn't account for auto mode now using native_datafusion.

Auto mode now falls back to Spark directly instead of native_iceberg_compat. Document that native_datafusion scan requires spark.comet.exec.enabled=true.

andygrove · 2026-04-02T15:18:44Z

@parthchandra @comphead @mbutrovich Thanks for the feedback so far. I simplified this PR so that auto mode now chooses native_datafusion instead of native_iceberg_compat. This is much simpler than trying one scan then the other, which makes fallback reporting complex. This was the eventual goal anyway but I was just trying to do it in smaller steps. PTAL when you can.

comphead · 2026-04-02T15:38:27Z

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

        CometConf.COMET_COLUMNAR_SHUFFLE_ASYNC_ENABLED.key -> asyncShuffleEnable.toString,
        CometConf.COMET_COLUMNAR_SHUFFLE_SPILL_THRESHOLD.key -> numElementsForceSpillThreshold.toString,
-        CometConf.COMET_EXEC_ENABLED.key -> "false",
+        CometConf.COMET_EXEC_ENABLED.key -> "true",


is it a cleanup?

native_datafusion scan cannot work if COMET_EXEC_ENABLED is disabled

comphead · 2026-04-02T15:39:30Z

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

+      return None
+    }
    if (!CometNativeScan.isSupported(scanExec)) {
      return None


should we also add withInfo here?

yes, good point, I will add this

actually, nm, withInfo is called from within isSupported already

comphead

Thanks @andygrove it is lgtm overall

mbutrovich

Just one overall question that applies to the Spark SQL tests.

dev/diffs/3.4.3.diff

Row index tests were being skipped for native_datafusion and auto scan modes. Instead of skipping, add CometNativeScanExec to the pattern match in ParquetRowIndexSuite so the test correctly counts partitions and output rows when the v1 scan uses CometNativeScanExec. Also fix FileBasedDataSourceSuite line length violation in 3.5.8 diff.

mbutrovich

Thanks @andygrove! Glad we’re making progress cleaning up the scan code.

Comet's NativeBatchReader throws RuntimeException instead of SparkException for invalid row index column types. Skip the test for SCAN_NATIVE_DATAFUSION and SCAN_AUTO modes. See apache#3886

andygrove · 2026-04-02T22:58:15Z

Merged. Thanks @comphead @mbutrovich

andygrove added 6 commits March 24, 2026 06:32

enable native_datafusion scan in auto mode

9ea95ab

scalastyle

5d6c4ff

add CometDateTimeUtilsSuite to CI workflow

44f9673

update schema evolution test

ea28ffd

add link to issue

c7a402f

Merge branch 'add-suite' into auto-native-df

03c691f

andygrove added 2 commits March 24, 2026 09:03

skip row index tests for auto scan mode

d38ffc4

auto scan mode now uses native_datafusion scan, which does not support row index generation, so these tests should be skipped for auto mode just as they are for native_datafusion mode.

Merge remote-tracking branch 'apache/main' into auto-native-df

d094726

andygrove changed the title ~~feat: enable native_datafusion scan in auto mode [WIP]~~ feat: enable native_datafusion scan in auto mode Mar 24, 2026

andygrove added 7 commits March 24, 2026 13:38

chore: Enable spark SQL tests for issues now resolved in native_dataf…

d7fd22e

…usion mode Remove IgnoreCometNativeDataFusion annotations from 3.5.8 Spark SQL test diff for issues that have been closed: apache#3311, apache#3313, apache#3314, apache#3315, apache#3320, apache#3401.

Revert "chore: Enable spark SQL tests for issues now resolved in nati…

1b35dbc

…ve_datafusion mode" This reverts commit d7fd22e.

docs: update parquet_scans.md to reflect auto mode now prefers native…

dc2506f

…_datafusion The auto scan mode now tries native_datafusion first and falls back to native_iceberg_compat if the scan cannot be converted, rather than always using native_iceberg_compat.

chore: re-enable Spark SQL tests for fixed issues in 3.5.8 diff

96622cf

Remove IgnoreCometNativeDataFusion tags for issues that have been resolved and closed: apache#3312, apache#3313, apache#3314, apache#3315.

Revert "chore: re-enable Spark SQL tests for fixed issues in 3.5.8 diff"

7e4ee86

This reverts commit 96622cf.

Merge remote-tracking branch 'apache/main' into auto-native-df

da052f8

Merge remote-tracking branch 'apache/main' into auto-native-df

03cc8ee

andygrove marked this pull request as ready for review March 30, 2026 14:37

Merge branch 'main' into auto-native-df

99deeaf

andygrove requested review from comphead, mbutrovich and parthchandra March 31, 2026 14:54

Merge branch 'main' into auto-native-df

e34531b

comphead reviewed Mar 31, 2026

View reviewed changes

fix: add issue link for row index generation skips in Spark SQL test …

b4a5aa7

…diffs Add reference to apache#3432 in ParquetRowIndexSuite assume() comments across all three Spark version diffs.

parthchandra reviewed Apr 1, 2026

View reviewed changes

andygrove added 2 commits April 2, 2026 07:37

docs: update parquet_scans.md for auto mode fallback to Spark

870274a

Auto mode now falls back to Spark directly instead of native_iceberg_compat. Document that native_datafusion scan requires spark.comet.exec.enabled=true.

comphead reviewed Apr 2, 2026

View reviewed changes

comphead approved these changes Apr 2, 2026

View reviewed changes

mbutrovich reviewed Apr 2, 2026

View reviewed changes

dev/diffs/3.4.3.diff Outdated Show resolved Hide resolved

dev/diffs/3.4.3.diff Outdated Show resolved Hide resolved

mbutrovich approved these changes Apr 2, 2026

View reviewed changes

fix: skip invalid row index column type test for native_datafusion scan

0adbacc

Comet's NativeBatchReader throws RuntimeException instead of SparkException for invalid row index column types. Skip the test for SCAN_NATIVE_DATAFUSION and SCAN_AUTO modes. See apache#3886

andygrove merged commit 4f5eaf0 into apache:main Apr 2, 2026
159 checks passed

andygrove deleted the auto-native-df branch April 2, 2026 22:58

vaibhawvipul pushed a commit to vaibhawvipul/datafusion-comet that referenced this pull request Apr 4, 2026

feat: enable native_datafusion scan in auto mode (apache#3781)

1caff9e

		@@ -2377,7 +2377,7 @@ index 351c6d698fc..583d9225cca 100644

		test(s"invalid row index column type - ${conf.desc}") {
		+ // native_datafusion Parquet scan does not support row index generation.

Conversation

andygrove commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Mar 24, 2026 •

edited

Loading