Delta: Updating Delta to Iceberg conversion#15407
Delta: Updating Delta to Iceberg conversion#15407vladislav-sidorovich wants to merge 28 commits intoapache:mainfrom
Conversation
anoopj
left a comment
There was a problem hiding this comment.
Thank you for the PR. Moving to the Delta kernel is a great improvement. Here is my initial feedback.
...c/integration/java/org/apache/iceberg/delta/DeltaLakeToIcebergMigrationSparkIntegration.java
Outdated
Show resolved
Hide resolved
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeKernelTableAction.java
Outdated
Show resolved
Hide resolved
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeKernelTableAction.java
Outdated
Show resolved
Hide resolved
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeKernelTableAction.java
Outdated
Show resolved
Hide resolved
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeKernelTableAction.java
Outdated
Show resolved
Hide resolved
| import io.delta.kernel.exceptions.TableNotFoundException; | ||
| import io.delta.kernel.internal.DeltaHistoryManager; | ||
| import io.delta.kernel.internal.DeltaLogActionUtils; | ||
| import io.delta.kernel.internal.SnapshotImpl; |
There was a problem hiding this comment.
We are using internal APIs of the kernel. This is fragile - can we refactor this to use the public APIs instead? Snapshot, Table etc. Or are we doing this because we are trying to preserve the table history during the conversion? I would try to avoid this as much as possible.
There was a problem hiding this comment.
No, there are no public API available for these purposes we need.
Yes, I want go through table history step by step, so we will have exactly the same granularity in the history.
At the same time it's quite safe to use an internal API because it's depends on the Delta protocol which is stable.
There was a problem hiding this comment.
The internal APIs can change or disappear without any notice. I would think hard about avoiding dependencies on internal APIs, including changing semantics. (e.g. not preserving all the history by default).
| while (rows.hasNext()) { | ||
| Row row = rows.next(); | ||
| if (DeltaLakeActionsTranslationUtil.isAdd(row)) { | ||
| AddFile addFile = DeltaLakeActionsTranslationUtil.toAdd(row); |
There was a problem hiding this comment.
Can we avoid the use of the internal AddFile class and read fields directly from the Row using ordinals defined by the scan file schema?
There was a problem hiding this comment.
Yes, I will refactor this part after all the conversion features will be in place.
| public SnapshotDeltaLakeTable deltaLakeConfiguration(Configuration conf) { | ||
| deltaEngine = DefaultEngine.create(conf); | ||
| deltaLakeFileIO = new HadoopFileIO(conf); | ||
| deltaTable = (TableImpl) Table.forPath(deltaEngine, deltaTableLocation); |
There was a problem hiding this comment.
It's necessary because I use internal API below. getChanges API is available only in TableImpl but not in the Table interface.
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeKernelTableAction.java
Outdated
Show resolved
Hide resolved
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeKernelTableAction.java
Show resolved
Hide resolved
842ee68 to
aa9f316
Compare
|
@nastra since you kindly reviewed the earlier version, I'd love to get your thoughts on the updated core logic before I do the final refactoring to remove internal Delta classes. @aokolnychyi Since you’ve contributed so much to the Deletion Vectors implementation in Iceberg, I wanted to reach out. Could you take a quick look at the DV conversion logic in my PR to make sure I’ve wired everything up correctly? |
Current PRs contains initial version of the code to update of the existing functionality: https://iceberg.apache.org/docs/1.4.3/delta-lake-migration/ to the recent Delta Lake version (read: 3, write: 7). The motivation of the PR is to receive the earliest feedback from the community.
Note: The PR doesn't remove the old logic but adds new Interface implementation, so it will be easier to compare/review. Also base on the usage scenario of the module, such approach will not introduce any issues.
The PR scope:
Addaction)Removeaction)Future steps:
Tests:
Unit-tests: contains all supported datatypes including complex arrays and structures.
Integration-tests: contains inserts only scenario with Spark 3.5. The test must be updated for newer Delta Lake version once the previous solution will be deleted from the code.
In the following PRs, I will add all the tables from: Delta golden tables