Skip to content

Delta: Updating Delta to Iceberg conversion#15407

Open
vladislav-sidorovich wants to merge 28 commits intoapache:mainfrom
vladislav-sidorovich:delta-conversion
Open

Delta: Updating Delta to Iceberg conversion#15407
vladislav-sidorovich wants to merge 28 commits intoapache:mainfrom
vladislav-sidorovich:delta-conversion

Conversation

@vladislav-sidorovich
Copy link
Contributor

@vladislav-sidorovich vladislav-sidorovich commented Feb 22, 2026

Current PRs contains initial version of the code to update of the existing functionality: https://iceberg.apache.org/docs/1.4.3/delta-lake-migration/ to the recent Delta Lake version (read: 3, write: 7). The motivation of the PR is to receive the earliest feedback from the community.

Note: The PR doesn't remove the old logic but adds new Interface implementation, so it will be easier to compare/review. Also base on the usage scenario of the module, such approach will not introduce any issues.

The PR scope:

  1. Support existing interface
  2. Uses Delta Lake kernel library instead of deprecated Delta Lake standalone
  3. Contains the basic flow
  4. Converts all data types
  5. Converts table schema and partitions spec
  6. Support only INSERT operation (Delta Lake Add action)
  7. Support UPDATES and DELETS (Delta Lake Remove action)
  8. Support Delta VACUUM scenario
  9. Support DVs

Future steps:

  1. Support All Delta Lake actions
  2. Support All Delta Lake features (column mapping, generated columns and so on)
  3. Handle Edge cases for partitions and Generated columns
  4. Handle Schema evolution
  5. Incremental Conversion (from/to a specific Delta Version)

Tests:
Unit-tests: contains all supported datatypes including complex arrays and structures.
Integration-tests: contains inserts only scenario with Spark 3.5. The test must be updated for newer Delta Lake version once the previous solution will be deleted from the code.

In the following PRs, I will add all the tables from: Delta golden tables

Copy link
Contributor

@anoopj anoopj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. Moving to the Delta kernel is a great improvement. Here is my initial feedback.

import io.delta.kernel.exceptions.TableNotFoundException;
import io.delta.kernel.internal.DeltaHistoryManager;
import io.delta.kernel.internal.DeltaLogActionUtils;
import io.delta.kernel.internal.SnapshotImpl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using internal APIs of the kernel. This is fragile - can we refactor this to use the public APIs instead? Snapshot, Table etc. Or are we doing this because we are trying to preserve the table history during the conversion? I would try to avoid this as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are no public API available for these purposes we need.

Yes, I want go through table history step by step, so we will have exactly the same granularity in the history.

At the same time it's quite safe to use an internal API because it's depends on the Delta protocol which is stable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal APIs can change or disappear without any notice. I would think hard about avoiding dependencies on internal APIs, including changing semantics. (e.g. not preserving all the history by default).

while (rows.hasNext()) {
Row row = rows.next();
if (DeltaLakeActionsTranslationUtil.isAdd(row)) {
AddFile addFile = DeltaLakeActionsTranslationUtil.toAdd(row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the use of the internal AddFile class and read fields directly from the Row using ordinals defined by the scan file schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will refactor this part after all the conversion features will be in place.

public SnapshotDeltaLakeTable deltaLakeConfiguration(Configuration conf) {
deltaEngine = DefaultEngine.create(conf);
deltaLakeFileIO = new HadoopFileIO(conf);
deltaTable = (TableImpl) Table.forPath(deltaEngine, deltaTableLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary because I use internal API below. getChanges API is available only in TableImpl but not in the Table interface.

@vladislav-sidorovich vladislav-sidorovich changed the title Delta: Updating Delta to Iceberg conversion - Inserts only Delta: Updating Delta to Iceberg conversion Mar 1, 2026
@github-actions github-actions bot added the data label Mar 22, 2026
@github-actions github-actions bot added the INFRA label Mar 22, 2026
@vladislav-sidorovich
Copy link
Contributor Author

@nastra since you kindly reviewed the earlier version, I'd love to get your thoughts on the updated core logic before I do the final refactoring to remove internal Delta classes.

@aokolnychyi Since you’ve contributed so much to the Deletion Vectors implementation in Iceberg, I wanted to reach out. Could you take a quick look at the DV conversion logic in my PR to make sure I’ve wired everything up correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants