[POC] Metadata index for Parquet files #8714

etseidl · 2025-10-27T05:32:37Z

Which issue does this PR close?

Rationale for this change

This is a proof of concept implementation of an index into the Parquet metadata. The hope is this will greatly speed up acquiring only the needed bits of the Parquet footer when not all the metadata is needed (such as when projecting a few columns out of an extremely large table).

What changes are included in this PR?

Modifications are made to the Thrift footer encoding to allow for noting the start and end positions of each RowGroup and ColumnMetaData, as well as the location and size of the schema.

Are these changes tested?

Tests will be added as this progresses.

Are there any user-facing changes?

No, this should only impact private APIs

etseidl · 2025-10-27T05:36:30Z

CC @alamb @XiangpengHao @adriangb who may have interest in the design.

The index is currently a new Thrift structure, but could easily be switched to fixed-length arrays. Right now I'm just trying to get something working to prove this is a) doable, b) worthwhile.

add a builder for the index and move that to the encoder

alamb

Sweet!

alamb · 2025-10-27T19:42:21Z

parquet/src/file/metadata/thrift/mod.rs

+            // TODO(ets): this should use UUID rather than simple string, but this works for prototype
+            let idx_len = index.len() as u64;
+            index.extend_from_slice(idx_len.as_bytes());
+            index.extend_from_slice("PARI".as_bytes());


PARI -- love it

alamb · 2025-10-27T19:46:08Z

parquet/src/file/metadata/thrift/mod.rs

+        return Ok(None);
+    }
+    let magic = &buf[buf.len() - 5..buf.len() - 1];
+    if magic != "PARI".as_bytes() {


Do I read this right that the format would look like:

(.. data pages ..) (.. metadata ..) (.. current footer - PAR1 w/ len) (.. MetaIndex ..) (.. new footer - PARI w/ len)

It is clever, but would not be backwards compatible with existing readers. Though this is a nice way to get some sense of how much faster thrift-parsing could go without having to implement an entirely new parsing system...

This uses the same approach as the flatbuffers stuff. The index becomes binary field 32767 in the FileMetaData struct. Old readers will simply skip this unparsed and then hit end-of-struct and return. I'm modifying the current reader to search backwards some for the "PARI", and if found then parse that field first to get the index. That part seems to be working 😄

So it's

(.. data pages ..) (.. page indexes ..) (.. current footer, not terminated ..) (.. field 32767 marker ..) (.. thrift encoded MetaIndex|len|PARI ..) (.. 0 (to end FileMetaData) ..) (.. PAR1 w/ len ..)

Seems like everyone is racing to grab field 32767 lol

I'm modifying the current reader to search backwards some for the "PARI", and if found then parse that field first to get the index. That part seems to be working 😄

Another approach would be to place it right before the current metadata -- since you know where the current footer metadata starts, you could check for PARI at an offset you know after you read the last 8 bytes 🤔

I was just wanting to use the existing protocol extension recently added for parquet-format. Makes things easy for testing, and shouldn't conflict with the flatbuffers stuff...they'll see the field but not recognice the footer so should ignore.

This brings up an issue I ran across while doing the remodel, but the thrift implementation of skip for binary fields uses read_string. So when you try to skip a pure binary field, if it's not all UTF-8, it throws an error. 56.0 can't read my footer (nor will it be able to read the flatbuffer one either). Seems as if the python thrift parser suffers from the same problem. Guess it's time for PR there, but that doesn't really help us with backwards compatibility.

Another approach would be to place it right before the current metadata

The advantage of using field 32767 is that it's already included in the footer length. No changes necessary to code used to fetch the footer in as few GETs as possible. The footer just gets a bit larger.

Now down the road, if the community decides an index is good enough and we don't need a complete rewrite of the metadata, there would be a need for a more permanent solution, which likely would involve tucking it in above the footer.

The advantage of using field 32767 is that it's already included in the footer length. No changes necessary to code used to fetch the footer in as few GETs as possible. The footer just gets a bit larger.

That is true in theory, though I am not sure how much it would matter in practice.

So the argument goes something like even if you have optimistically fetched a bunch of bytes in the hopes of reading the entire footer in the first read, you could not guarantee that a second fetch would get the index too (you would have to do a second optimistic fetch or something)

🤔

So the argument goes something like even if you have optimistically fetched a bunch of bytes in the hopes of reading the entire footer in the first read, you could not guarantee that a second fetch would get the index too (you would have to do a second optimistic fetch or something)

Yes. Worst case would be two fetches for the file metadata (one for the 8 byte footer, a second for the thrift encoded file meta), followed by two for the index. If you get really lucky you might get everything in a single fetch, less lucky you'd most likely do two.

etseidl · 2025-10-27T21:28:33Z

I just did a quick experiment with the parquet_footer_parsing rig. I had to fix 56.2 to skip binary properly. The "57 no stats" is using the index to completely skip the bytes for the statistics, rather than still parse the thrift but not materialize anything.

Here's an old run on my workstation with 57.0 just before release

+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+
| Description                       | Parse Time Arrow 56 | Parse Time Arrow 56       | Parse Time Arrow 57 | Parse Time Arrow 57       | Parse Time Arrow 57 (no stats) | Parse Time Arrow 57 (no stats) |
|                                   |                     |                           |                     |                           |                                |                                |
|                                   | Metadata            | PageIndex (Column/Offset) | Metadata            | PageIndex (Column/Offset) | Metadata                       | PageIndex (Column/Offset)      |
+=========================================================================================================================================================================================================+
|  Float 100 cols 20 row groups     | 1.818656ms          | 2.742926ms                | 371.597µs           | 412.292µs                 | 278.182µs                      | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 1000 cols 20 row groups    | 17.94358ms          | 27.645205ms               | 3.660315ms          | 4.193104ms                | 2.802049ms                     | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 10000 cols 20 row groups   | 185.972585ms        | 307.935846ms              | 38.203277ms         | 44.805143ms               | 29.839642ms                    | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 100000 cols 20 row groups  | 1.859111093s        | 3.277136801s              | 387.584434ms        | 464.782303ms              | 311.1496ms                     | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100 cols 20 row groups    | 1.590131ms          | 2.502389ms                | 445.781µs           | 513.278µs                 | 277.58µs                       | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 1000 cols 20 row groups   | 15.814435ms         | 25.203266ms               | 4.424308ms          | 5.022333ms                | 2.780101ms                     | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 10000 cols 20 row groups  | 163.855822ms        | 269.453287ms              | 45.111337ms         | 55.967408ms               | 29.530731ms                    | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100000 cols 20 row groups | 1.650930706s        | 2.882455606s              | 457.848214ms        | 567.259783ms              | 304.9529ms                     | 0ns                            |
+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+

and here's a run using the index (didn't set the page index offsets to 0 so they're still parsed)

+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+
| Description                       | Parse Time Arrow 56 | Parse Time Arrow 56       | Parse Time Arrow 57 | Parse Time Arrow 57       | Parse Time Arrow 57 (no stats) | Parse Time Arrow 57 (no stats) |
|                                   |                     |                           |                     |                           |                                |                                |
|                                   | Metadata            | PageIndex (Column/Offset) | Metadata            | PageIndex (Column/Offset) | Metadata                       | PageIndex (Column/Offset)      |
+=========================================================================================================================================================================================================+
|  Float 100 cols 20 row groups     | 1.782124ms          | 2.788277ms                | 384.006µs           | 437.833µs                 | 190.7µs                        | 442.136µs                      |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 1000 cols 20 row groups    | 17.773675ms         | 27.990747ms               | 3.689049ms          | 4.011346ms                | 1.803588ms                     | 4.077435ms                     |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 10000 cols 20 row groups   | 186.135302ms        | 319.160885ms              | 38.658397ms         | 48.454485ms               | 20.437285ms                    | 45.337169ms                    |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 100000 cols 20 row groups  | 1.850730717s        | 3.308524542s              | 392.43504ms         | 468.130178ms              | 208.983728ms                   | 452.502117ms                   |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100 cols 20 row groups    | 1.551055ms          | 3.540635ms                | 451.13µs            | 535.625µs                 | 190.766µs                      | 522.67µs                       |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 1000 cols 20 row groups   | 15.781785ms         | 25.655606ms               | 4.420568ms          | 5.245406ms                | 1.85453ms                      | 5.031554ms                     |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 10000 cols 20 row groups  | 162.570823ms        | 272.449084ms              | 45.722412ms         | 58.275058ms               | 20.454023ms                    | 55.498968ms                    |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100000 cols 20 row groups | 1.624907725s        | 2.803188356s              | 465.555399ms        | 570.005157ms              | 208.100961ms                   | 548.227371ms                   |
+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+

alamb · 2025-10-28T20:01:36Z

🤔 I bet we would see a crazy speedup if we could also skip parsing ColumnChunk metadata for columns that are not read in the query

The benchmark above parses all the columns

etseidl · 2025-10-28T21:57:26Z

🤔 I bet we would see a crazy speedup if we could also skip parsing ColumnChunk metadata for columns that are not read in the query

The benchmark above parses all the columns

For sure. I did a quick test with b367562 where I only read every other row group's metadata. The "wide" benchmark (which happily now includes the index, thanks again @lichuang!) went from 54s to 30s. I'd bet only decoding 10 out of 10000 column would be crazy fast (still have to do more plumbing before I can try that one).

Edit: Added column plumbing in cc2e1ec. Decoding only 10 columns takes 9s. The remaining time is likely schema parsing and reading the index.

On a related note, if you (@alamb, but others welcome) could opine on #8643 I'd appreciated it. I'm having a hard time wrapping my head around how best to convey down to the thrift parsing code which bits of metadata are wanted. I get confused with multiple readers each with different options objects, that all then sort of use ParquetMetaDataReader, except now there's the push decoder and MetadataParser. For instance, how would one hook a column projection or pushdown predicate into the metadata parsing?

etseidl · 2025-10-29T00:02:34Z

decoding all columns:

decoding 10 columns

Now if we had a cached schema laying around... 🤔

etseidl added 2 commits October 26, 2025 17:00

wrap thrift writer in TrackedWrite

aef117c

write and read meta index

217fd38

github-actions bot added the parquet Changes to the parquet crate label Oct 27, 2025

add lengths of ColumnMetaData to enable skipping stats

6f6ca59

add a builder for the index and move that to the encoder

etseidl mentioned this pull request Oct 27, 2025

Support file row number in Parquet reader #7299

Open

etseidl added 2 commits October 27, 2025 11:36

add more to the output protocol api

a219576

quick test of skipping column statistics

b61f2de

alamb reviewed Oct 27, 2025

View reviewed changes

etseidl added 2 commits October 27, 2025 13:30

fix meta length

3a354ad

another fix

29be63f

etseidl and others added 2 commits October 27, 2025 15:20

revert stats skipping

f3d1239

fix test

4b165e6

start adding scaffolding for schema and row group skipping

b367562

etseidl added 2 commits October 28, 2025 15:36

add plumbing for column skipping

cc2e1ec

clippy

f64b055

add some validation

ca2cff4

Uh oh!

[POC] Metadata index for Parquet files #8714

Are you sure you want to change the base?

[POC] Metadata index for Parquet files #8714

Conversation

etseidl commented Oct 27, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Oct 27, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl commented Oct 27, 2025

Uh oh!

alamb commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etseidl commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etseidl commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alamb Oct 28, 2025 •

edited

Loading

alamb commented Oct 28, 2025 •

edited

Loading

etseidl commented Oct 28, 2025 •

edited

Loading