feat(iceberg): stream Iceberg writes with constant memory#3753
feat(iceberg): stream Iceberg writes with constant memory#3753ajinzrathod wants to merge 4 commits intodlt-hub:develfrom
Conversation
There was a problem hiding this comment.
this is really cool! and I think solves a lot of practical problems before pyiceberg is fixed. two things before we can merge this
- some of the methods do garbage collection. could you parametrize how often it is done (which will also allow to disable it)?
- this PR does not include tests. if you are using claude code (and probably any other decent agent) you should be able to ask it to add tests in the correct place and with compatible parametrization
- we do not need to test on bucket. here IMO tests on local filesystem are sufficient
ie.test_open_table_pipelinemodule is a good start and
@pytest.mark.parametrize(
"destination_config",
destinations_configs(
table_format_local_configs=True,
with_table_format="iceberg",
),
ids=lambda x: x.name,
)
def test_iceberg_table_properties(
destination_config: DestinationTestConfiguration,
) -> None:
``` is good example parametrization
|
Updated with both changes: GC parametrization: Added iceberg_gc_collect_interval config option (default 0 = disabled). Any positive integer triggers gc.collect() every N batches. Threaded through from FilesystemDestinationClientConfiguration -> IcebergLoadFilesystemJob -> streaming functions. Progress logging is decoupled from GC and always runs every 10 batches. Tests: Added 3 tests in test_open_table_pipeline.py using table_format_local_configs=True with with_table_format="iceberg":
All passing locally on filesystem. Let me know if this looks good or if any other changes are needed. |
- Add store_decimal_as_integer=True to manual pq.write_table() calls to fix FIXED_LEN_BYTE_ARRAY vs INT64 mismatch - Use txn.append() instead of add_files for partitioned tables since batches can span multiple partition values - Separate update and insert transactions for partitioned upserts to work around pyiceberg 0.9.x SIGILL crash when mixing overwrite + append in one transaction - Extract _upload_parquet_to_remote and _process_upsert_batch helpers
|
I was looking at the failed test cases on CI, all the 1. Decimal encoding mismatchPyArrow defaults to encoding decimals as Fix: Included 2. Partitioned tables failing on streamed writesWhile fixing the above, I noticed the streaming path used Fix: Split into two strategies:
3. SIGILL crash during upsert on partitioned tablesAfter fixing the decimal encoding, I re-ran the tests locally. The tests that were failing earlier now got past the load step but the upsert test ( Fix: Separated updates ( Cleanup
|
|
@ajinzrathod I reviewed the code again. the fundamental change - to stream data via batch reader - makes total sense. but it seems to me that if we bump pyiceberg to |
Review & Suggested SimplificationThanks for the PR — it solves a real OOM problem and the streaming approach is the right direction. After investigation, we believe the implementation can be significantly simplified by bumping the pyiceberg minimum to Context
With pyiceberg Key findings
SIGILL crash doesn't reproduce: We tested the single-transaction approach (mixing
Memory is constant: Proposed changes1. Bump pyiceberg minimum to
|
| Function/Config | Reason |
|---|---|
_upload_parquet_to_remote() |
pyiceberg writes files internally |
_write_iceberg_table_streamed() |
Replaced by txn.append() loop |
_write_streamed_partitioned() |
pyiceberg handles partitions |
_write_streamed_add_files() |
pyiceberg handles file registration |
_process_upsert_batch() |
Replaced by txn.upsert() |
_upsert_iceberg_table() |
Replaced by txn.upsert() loop |
_UPLOAD_CHUNK_SIZE |
No manual uploads |
_iter_parquet_batches() |
Scanner does this |
iceberg_gc_collect_interval |
Not needed with pyiceberg's writer |
recommended_file_size = 128MB |
Unrelated global change |
store_decimal_as_integer patch |
Native in 0.10+ |
1. Scanner OOMs on large datasets
The scanner works fine on small data, our earlier tests with Fix: Streaming via |
2. Schema evolution breaks
|
3. Trade-off in replace operation: txn.append() files are deletable but O(N²), fast_append() is O(N) but files can't be deletedFor streaming replace, we need to delete existing data first, then append new batches. We tested Problem: Files written by What we verified:
|
|
Summary:
FWIW, it would be valuable to validate append, replace, and merge operations on datasets larger than 10 GB (with container RAM smaller than the data size) to confirm the streaming approach at scale. I tested the proposed approach on Apple Silicon using a 4 GB Docker container with datasets ranging from 2 GB to 55 GB. The test I'm not pushing code to this repo yet, will do after we have proper sign-off on the approach. My current working code is on my personal fork in the feat/3752-v2-iceberg-streaming-atomic-commit branch for reference against the findings above. |
Sounds good, happy to look at the implementation plan. Given the findings above (especially the O(N²) txn.append(), performance, SIGILL, and the fast_append() delete issue), interested to hear your thoughts. Open to collaborating on this or letting you take it forward, whatever works best. |


Description
Closes #3752
Previously, Iceberg loads materialised the entire Arrow dataset in memory (
self.arrow_dataset.to_table()) and upserts created one snapshot per batch. This caused OOM risk on large datasets and snapshot bloat that slowed metadata operations.This PR streams all Iceberg write paths batch-by-batch with constant memory and produces minimal Iceberg snapshots — bringing Iceberg to parity with the Delta Lake code path which already uses
RecordBatchReader.Changes
dlt/common/libs/pyiceberg.pywrite_iceberg_tablenow acceptsUnion[pa.Table, pa.RecordBatchReader]. When given aRecordBatchReader, it dispatches to the new_write_iceberg_table_streamedfunction._write_iceberg_table_streamed(new): Streams Arrow batches one at a time — each batch is written to a temp parquet file, uploaded to remote storage via Iceberg's IO, then discarded. All files are registered atomically withtable.add_files()(append) ordelete+add_filesin a single transaction (replace). Memory stays constant: only one batch + one parquet file in memory at a time.merge_iceberg_tablenow acceptsUnion[pa.Table, pa.RecordBatchReader]and delegates to the new_upsert_iceberg_tablefunction._upsert_iceberg_table(new): Streams batches within a singletable.transaction(). Updates go viatxn.overwrite()only when rows actually changed (and only forupsertstrategy, notinsert-only). Inserts are collected as remote parquet files and registered viatxn.add_files()at the end of the transaction. For pure-insert loads (e.g. first load into an empty table) this produces exactly one Iceberg snapshot.dlt/destinations/impl/filesystem/filesystem.pyIcebergLoadFilesystemJob.run(): Reads schema cheaply from the first parquet file header (pq.read_schema) instead of materialising the full dataset. For merge path: streams viascanner(batch_readahead=0, fragment_readahead=0, use_threads=False).to_reader(). For append/replace: uses_iter_parquet_batchesgenerator wrapped inRecordBatchReader.from_batches(). Never calls.to_table()on the full dataset. Explicitgc.collect()after loads and periodically during batching._iter_parquet_batches(new static method): Yields Arrow batches from parquet files one at a time for constant memory.dlt/destinations/impl/filesystem/factory.pyrecommended_file_size = 128MBfor the filesystem destination so upstream produces reasonably-sized parquet files for streaming.Key invariants
.to_table()on the full dataset — data streams batch-by-batchRecordBatch+ one temp parquet file in memory at a timeinsert-onlystrategy support (PR (Feat) insert only merge strategy #3741)time.monotonic()throughout (aligned with PR (fix) implements monotonic wall clock #3695)propertiesparameter increate_tablecalls (PR (feat) adds iceberg table properties #3699)pa.Tableinputs still work as beforeRelated upstream issues
This addresses a known limitation in pyiceberg itself:
table.append()due to full materializationRecordBatchReaderwrite support. A pyiceberg maintainer (@kevinjqliu) pointed to iceberg-go's approach: materialize arrow streams as individual parquet files, then register them viaadd_files— which is exactly the approach this PR takes.Testing
Tested locally with Docker (MinIO + Iceberg REST catalog, 4GB memory limit) across varying dataset sizes (up to 100 parquet files, 1M+ rows × 400 columns). Behavior is consistent at all scales:
A ready-to-use test harness with Docker Compose (MinIO + Iceberg REST catalog), data generators, and test scripts is available on the
github.com/ajinzrathod/dlt-ajinzrathod's(me)feat/iceberg-streaming-testingbranch if you'd like to reproduce locally.