Skip to content

feat: enable native_datafusion scan in auto mode#3781

Merged
andygrove merged 22 commits intoapache:mainfrom
andygrove:auto-native-df
Apr 2, 2026
Merged

feat: enable native_datafusion scan in auto mode#3781
andygrove merged 22 commits intoapache:mainfrom
andygrove:auto-native-df

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Mar 24, 2026

Which issue does this PR close?

Part of #3321

Rationale for this change

Improve performance with default config. The native_datafusion scan does not have the FFI roundtrip overhead that native_iceberg_compat has.

What changes are included in this PR?

  • Update "auto" mode to try native_datafusion first, falling back to Spark when not supported (no longer falls back to native_iceberg_compat)
  • Add COMET_EXEC_ENABLED guard to nativeDataFusionScan() since the scan node requires CometExecRule.transform() to wrap it in CometNativeExec
  • Update shuffle test suites to use COMET_EXEC_ENABLED=true so they run with auto mode (native_datafusion) scans
  • Remove stale assume(COMET_NATIVE_SCAN_IMPL != SCAN_NATIVE_DATAFUSION) guards from shuffle tests
  • Review ignored tests across all Spark diffs and make sure they are up-to-date
  • Ensure documentation is up-to-date

How are these changes tested?

Existing tests. Shuffle test suites (CometShuffleSuite, CometAsyncShuffleSuite, CometNativeShuffleSuite) updated and verified to pass with auto mode native_datafusion scans.

@andygrove
Copy link
Copy Markdown
Member Author

Some row index generation - vectorized reader tests are currently failing.

auto scan mode now uses native_datafusion scan, which does not support
row index generation, so these tests should be skipped for auto mode
just as they are for native_datafusion mode.
@andygrove andygrove changed the title feat: enable native_datafusion scan in auto mode [WIP] feat: enable native_datafusion scan in auto mode Mar 24, 2026
…usion mode

Remove IgnoreCometNativeDataFusion annotations from 3.5.8 Spark SQL test
diff for issues that have been closed: apache#3311, apache#3313, apache#3314, apache#3315, apache#3320,
apache#3401.
…_datafusion

The auto scan mode now tries native_datafusion first and falls back to
native_iceberg_compat if the scan cannot be converted, rather than always
using native_iceberg_compat.
Remove IgnoreCometNativeDataFusion tags for issues that have been
resolved and closed: apache#3312, apache#3313, apache#3314, apache#3315.
@andygrove andygrove marked this pull request as ready for review March 30, 2026 14:37
@@ -2377,7 +2377,7 @@ index 351c6d698fc..583d9225cca 100644

test(s"invalid row index column type - ${conf.desc}") {
+ // native_datafusion Parquet scan does not support row index generation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we point to github issue?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

.orElse {
// clear explain info tags from the failed nativeDataFusionScan
// attempt so they don't leak into the fallback path
scanExec.unsetTagValue(CometExplainInfo.EXTENSION_INFO)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering why we don't unset tags for failed in variants below?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if fallback reasons are added for native_datafusion then we need to remove them before trying native_iceberg_compat, otherwise the plan still falls back to Spark.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should keep the reason, so that we know why we used native_iceberg_compat instead of native_datafusion

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep a fallback reason then we will fall back to Spark. The goal was to use native_iceberg_compat for the cases that native_datafusion cannot support.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative approach is for auto mode to just try native_datafusion and then fall back to Spark, rather than try native_iceberg_compat first.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR to just use native_datafusion in auto mode. It is likely that I will need to fix some test assumptions as well, but will wait for CI to run first.

…diffs

Add reference to apache#3432
in ParquetRowIndexSuite assume() comments across all three Spark version
diffs.
.orElse {
// clear explain info tags from the failed nativeDataFusionScan
// attempt so they don't leak into the fallback path
scanExec.unsetTagValue(CometExplainInfo.EXTENSION_INFO)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should keep the reason, so that we know why we used native_iceberg_compat instead of native_datafusion

.get(conf) == CometConf.SCAN_NATIVE_DATAFUSION) {
val scan = CometConf.COMET_NATIVE_SCAN_IMPL.get(conf)
val isNativeDataFusionScan =
scan == CometConf.SCAN_NATIVE_DATAFUSION || scan == CometConf.SCAN_AUTO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not strictly correct since we could still fallback to native_iceberg_compat when mode is AUTO

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commits, auto mode no longer falls back to native_iceberg_compat

… shuffle tests

native_datafusion scan needs CometExecRule.transform() to wrap it in
CometNativeExec, so it cannot work when COMET_EXEC_ENABLED is false.
Add a guard in nativeDataFusionScan() to return None in this case,
matching the existing pattern for v2 Iceberg scans.

Update shuffle test suites to use COMET_EXEC_ENABLED=true so they run
with auto mode (native_datafusion) scans. Remove stale assume guards
that checked for explicit native_datafusion config but didn't account
for auto mode now using native_datafusion.
Auto mode now falls back to Spark directly instead of
native_iceberg_compat. Document that native_datafusion scan requires
spark.comet.exec.enabled=true.
@andygrove
Copy link
Copy Markdown
Member Author

@parthchandra @comphead @mbutrovich Thanks for the feedback so far. I simplified this PR so that auto mode now chooses native_datafusion instead of native_iceberg_compat. This is much simpler than trying one scan then the other, which makes fallback reporting complex. This was the eventual goal anyway but I was just trying to do it in smaller steps. PTAL when you can.

CometConf.COMET_COLUMNAR_SHUFFLE_ASYNC_ENABLED.key -> asyncShuffleEnable.toString,
CometConf.COMET_COLUMNAR_SHUFFLE_SPILL_THRESHOLD.key -> numElementsForceSpillThreshold.toString,
CometConf.COMET_EXEC_ENABLED.key -> "false",
CometConf.COMET_EXEC_ENABLED.key -> "true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a cleanup?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

native_datafusion scan cannot work if COMET_EXEC_ENABLED is disabled

return None
}
if (!CometNativeScan.isSupported(scanExec)) {
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add withInfo here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point, I will add this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, nm, withInfo is called from within isSupported already

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove it is lgtm overall

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one overall question that applies to the Spark SQL tests.

Row index tests were being skipped for native_datafusion and auto scan
modes. Instead of skipping, add CometNativeScanExec to the pattern
match in ParquetRowIndexSuite so the test correctly counts partitions
and output rows when the v1 scan uses CometNativeScanExec.

Also fix FileBasedDataSourceSuite line length violation in 3.5.8 diff.
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove! Glad we’re making progress cleaning up the scan code.

Comet's NativeBatchReader throws RuntimeException instead of
SparkException for invalid row index column types. Skip the test
for SCAN_NATIVE_DATAFUSION and SCAN_AUTO modes.

See apache#3886
@andygrove andygrove merged commit 4f5eaf0 into apache:main Apr 2, 2026
159 checks passed
@andygrove andygrove deleted the auto-native-df branch April 2, 2026 22:58
@andygrove
Copy link
Copy Markdown
Member Author

Merged. Thanks @comphead @mbutrovich

vaibhawvipul pushed a commit to vaibhawvipul/datafusion-comet that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants