Add batching support to fetch_df_by_partition for improved performance with large file counts #49

Ronxvier · 2025-08-17T12:44:10Z

Adds batching support to fetch_df_by_partition with use_batching and batch_size parameters. When enabled, files are processed in smaller batches instead of loading all files into memory simultaneously, which should reduce peak memory usage if users have extremely large file counts. Uses the existing fetch_dfs_by_paths_batching function. Maintains backward compatibility - existing code works unchanged.

Copilot

Pull Request Overview

This PR adds optional batching functionality to the fetch_df_by_partition function to improve memory efficiency when processing large numbers of files. The change introduces two new optional parameters to control batching behavior while maintaining full backward compatibility.

Added use_batching and batch_size parameters to fetch_df_by_partition
Implemented fetch_dfs_by_paths_batching function to handle batched file processing
Enhanced function documentation with proper parameter descriptions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/datarepo/core/tables/deltalake_table.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Ronxvier added 2 commits August 17, 2025 08:40

Added Batching functionality

d2b323b

Update deltalake_table.py

348e75b

houqp requested review from asura-io and Copilot August 17, 2025 23:19

Copilot AI reviewed Aug 17, 2025

View reviewed changes

src/datarepo/core/tables/deltalake_table.py Show resolved Hide resolved

src/datarepo/core/tables/deltalake_table.py Show resolved Hide resolved

src/datarepo/core/tables/deltalake_table.py Outdated Show resolved Hide resolved

src/datarepo/core/tables/deltalake_table.py Show resolved Hide resolved

Update src/datarepo/core/tables/deltalake_table.py

fb0813f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

houqp requested a review from PeterKeDer August 18, 2025 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add batching support to fetch_df_by_partition for improved performance with large file counts #49

Add batching support to fetch_df_by_partition for improved performance with large file counts #49

Uh oh!

Ronxvier commented Aug 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add batching support to fetch_df_by_partition for improved performance with large file counts #49

Are you sure you want to change the base?

Add batching support to fetch_df_by_partition for improved performance with large file counts #49

Uh oh!

Conversation

Ronxvier commented Aug 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant