-
Notifications
You must be signed in to change notification settings - Fork 29.1k
Description
Description
Currently, the pyspark.sql.DataFrame.mergeInto API (introduced in Spark 4.0: SPARK-48714, #47086 by @xupefei) only accepts a table name that must be registered in the Spark Catalog.
While Delta Lake's standalone Python API allows DeltaTable.forPath(), the native PySpark mergeInto method lacks a direct way to target a Delta table (or any supported provider) via a URI/path without first registering it as a table in the catalog.
Motivation
In many modern Data Lake architectures, data engineers often interact with tables directly via their storage paths to:
- Avoid catalog overhead for transient or landing data.
- Operate in environments where a shared Hive Metastore or Unity Catalog might not be the primary source of truth for every directory.
- Simplify CI/CD pipelines where physical paths are parameterized.
Adding support for paths would bring mergeInto in line with other PySpark APIs like spark.read.load(path) or df.write.save(path).
Proposed Change
Modify pyspark.sql.DataFrame.mergeInto(tableName) to either:
- Automatically detect paths: If the string starts with a protocol (e.g.,
abfss://,s3://) or a forward slash, treat it as a path. - Add an optional parameter: Add a boolean flag or a specific method to distinguish between a catalog table and a path.
Proposed Syntax Example:
# Current limitation: requires catalog registration
# df.mergeInto("prod.db.target_table").whenMatchedUpdateAll().execute()
# Proposed: Reference via path directly
path = "abfss://container@account.dfs.core.windows.net/layer/table_name"
# Option A: String detection (similar to SQL: MERGE INTO delta.`path`)
df.mergeInto(table=f"delta.`{path}`").whenMatchedUpdateAll().execute()
# Option B: Explicit parameter
df.mergeInto(path=path, format="delta").whenMatchedUpdateAll().execute()