Skip to content

GH-48467: [C++][Parquet] Add total_buffered_bytes() API for RowGroupWriter#49527

Open
wecharyu wants to merge 2 commits intoapache:mainfrom
wecharyu:GH-48467_spilit1
Open

GH-48467: [C++][Parquet] Add total_buffered_bytes() API for RowGroupWriter#49527
wecharyu wants to merge 2 commits intoapache:mainfrom
wecharyu:GH-48467_spilit1

Conversation

@wecharyu
Copy link

@wecharyu wecharyu commented Mar 16, 2026

Rationale for this change

Expose an API for buffered bytes of values and levels in RowGroupWriter, it's useful in deciding whether a new row group is needed.
Discussion in #48468 (comment)

What changes are included in this PR?

Add new API total_buffered_bytes() in RowGroupWriter.

Are these changes tested?

Test locally.

Are there any user-facing changes?

Yes, user could use the new API to implement their customized row group strategy.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good that @pitrou can take a look at the API.

if (column_writers_[i]) {
estimated_buffered_value_bytes +=
column_writers_[i]->estimated_buffered_value_bytes() +
column_writers_[i]->EstimatedBufferedLevelsBytes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have the question: do we need to consider buffered dictionary values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of return a sum, should we return a struct like below:

struct BufferedStats {
  int64_t def_level_bytes;
  int64_t rep_level_bytes;
  int64_t value_bytes;
  int64_t dict_bytes;
};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 19, 2026
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM

@@ -34,6 +34,13 @@ class ColumnWriter;
static constexpr uint8_t kParquetMagic[4] = {'P', 'A', 'R', '1'};
static constexpr uint8_t kParquetEMagic[4] = {'P', 'A', 'R', 'E'};

struct PARQUET_EXPORT BufferedStats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufferedStats is a bit weird naming for me. AFAIK, if we append column batch by batch, before flushing the row-group, the pages are still "buffered". Can we choose a better name or add the comment for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants