Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 27, 2025

Which issue does this PR close?

Rationale for this change

This is a proof of concept implementation of an index into the Parquet metadata. The hope is this will greatly speed up acquiring only the needed bits of the Parquet footer when not all the metadata is needed (such as when projecting a few columns out of an extremely large table).

What changes are included in this PR?

Modifications are made to the Thrift footer encoding to allow for noting the start and end positions of each RowGroup and ColumnMetaData, as well as the location and size of the schema.

Are these changes tested?

Tests will be added as this progresses.

Are there any user-facing changes?

No, this should only impact private APIs

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 27, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 27, 2025

CC @alamb @XiangpengHao @adriangb who may have interest in the design.

The index is currently a new Thrift structure, but could easily be switched to fixed-length arrays. Right now I'm just trying to get something working to prove this is a) doable, b) worthwhile.

add a builder for the index and move that to the encoder
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet!

// TODO(ets): this should use UUID rather than simple string, but this works for prototype
let idx_len = index.len() as u64;
index.extend_from_slice(idx_len.as_bytes());
index.extend_from_slice("PARI".as_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PARI -- love it

return Ok(None);
}
let magic = &buf[buf.len() - 5..buf.len() - 1];
if magic != "PARI".as_bytes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I read this right that the format would look like:

(.. data pages ..)
(.. metadata ..)
(.. current footer - PAR1 w/ len)
(.. MetaIndex ..)
(.. new footer - PARI w/ len)

It is clever, but would not be backwards compatible with existing readers. Though this is a nice way to get some sense of how much faster thrift-parsing could go without having to implement an entirely new parsing system...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the same approach as the flatbuffers stuff. The index becomes binary field 32767 in the FileMetaData struct. Old readers will simply skip this unparsed and then hit end-of-struct and return. I'm modifying the current reader to search backwards some for the "PARI", and if found then parse that field first to get the index. That part seems to be working 😄

So it's

(.. data pages ..)
(.. page indexes ..)
(.. current footer, not terminated ..)
(.. field 32767 marker ..)
(.. thrift encoded MetaIndex|len|PARI ..)
(.. 0 (to end FileMetaData) ..)
(.. PAR1 w/ len ..)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like everyone is racing to grab field 32767 lol

I'm modifying the current reader to search backwards some for the "PARI", and if found then parse that field first to get the index. That part seems to be working 😄

Another approach would be to place it right before the current metadata -- since you know where the current footer metadata starts, you could check for PARI at an offset you know after you read the last 8 bytes 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wanting to use the existing protocol extension recently added for parquet-format. Makes things easy for testing, and shouldn't conflict with the flatbuffers stuff...they'll see the field but not recognice the footer so should ignore.

This brings up an issue I ran across while doing the remodel, but the thrift implementation of skip for binary fields uses read_string. So when you try to skip a pure binary field, if it's not all UTF-8, it throws an error. 56.0 can't read my footer (nor will it be able to read the flatbuffer one either). Seems as if the python thrift parser suffers from the same problem. Guess it's time for PR there, but that doesn't really help us with backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to place it right before the current metadata

The advantage of using field 32767 is that it's already included in the footer length. No changes necessary to code used to fetch the footer in as few GETs as possible. The footer just gets a bit larger.

Now down the road, if the community decides an index is good enough and we don't need a complete rewrite of the metadata, there would be a need for a more permanent solution, which likely would involve tucking it in above the footer.

Copy link
Contributor

@alamb alamb Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of using field 32767 is that it's already included in the footer length. No changes necessary to code used to fetch the footer in as few GETs as possible. The footer just gets a bit larger.

That is true in theory, though I am not sure how much it would matter in practice.

So the argument goes something like even if you have optimistically fetched a bunch of bytes in the hopes of reading the entire footer in the first read, you could not guarantee that a second fetch would get the index too (you would have to do a second optimistic fetch or something)

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the argument goes something like even if you have optimistically fetched a bunch of bytes in the hopes of reading the entire footer in the first read, you could not guarantee that a second fetch would get the index too (you would have to do a second optimistic fetch or something)

Yes. Worst case would be two fetches for the file metadata (one for the 8 byte footer, a second for the thrift encoded file meta), followed by two for the index. If you get really lucky you might get everything in a single fetch, less lucky you'd most likely do two.

@etseidl
Copy link
Contributor Author

etseidl commented Oct 27, 2025

I just did a quick experiment with the parquet_footer_parsing rig. I had to fix 56.2 to skip binary properly. The "57 no stats" is using the index to completely skip the bytes for the statistics, rather than still parse the thrift but not materialize anything.

Here's an old run on my workstation with 57.0 just before release

+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+
| Description                       | Parse Time Arrow 56 | Parse Time Arrow 56       | Parse Time Arrow 57 | Parse Time Arrow 57       | Parse Time Arrow 57 (no stats) | Parse Time Arrow 57 (no stats) |
|                                   |                     |                           |                     |                           |                                |                                |
|                                   | Metadata            | PageIndex (Column/Offset) | Metadata            | PageIndex (Column/Offset) | Metadata                       | PageIndex (Column/Offset)      |
+=========================================================================================================================================================================================================+
|  Float 100 cols 20 row groups     | 1.818656ms          | 2.742926ms                | 371.597µs           | 412.292µs                 | 278.182µs                      | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 1000 cols 20 row groups    | 17.94358ms          | 27.645205ms               | 3.660315ms          | 4.193104ms                | 2.802049ms                     | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 10000 cols 20 row groups   | 185.972585ms        | 307.935846ms              | 38.203277ms         | 44.805143ms               | 29.839642ms                    | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 100000 cols 20 row groups  | 1.859111093s        | 3.277136801s              | 387.584434ms        | 464.782303ms              | 311.1496ms                     | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100 cols 20 row groups    | 1.590131ms          | 2.502389ms                | 445.781µs           | 513.278µs                 | 277.58µs                       | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 1000 cols 20 row groups   | 15.814435ms         | 25.203266ms               | 4.424308ms          | 5.022333ms                | 2.780101ms                     | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 10000 cols 20 row groups  | 163.855822ms        | 269.453287ms              | 45.111337ms         | 55.967408ms               | 29.530731ms                    | 0ns                            |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100000 cols 20 row groups | 1.650930706s        | 2.882455606s              | 457.848214ms        | 567.259783ms              | 304.9529ms                     | 0ns                            |
+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+

and here's a run using the index (didn't set the page index offsets to 0 so they're still parsed)

+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+
| Description                       | Parse Time Arrow 56 | Parse Time Arrow 56       | Parse Time Arrow 57 | Parse Time Arrow 57       | Parse Time Arrow 57 (no stats) | Parse Time Arrow 57 (no stats) |
|                                   |                     |                           |                     |                           |                                |                                |
|                                   | Metadata            | PageIndex (Column/Offset) | Metadata            | PageIndex (Column/Offset) | Metadata                       | PageIndex (Column/Offset)      |
+=========================================================================================================================================================================================================+
|  Float 100 cols 20 row groups     | 1.782124ms          | 2.788277ms                | 384.006µs           | 437.833µs                 | 190.7µs                        | 442.136µs                      |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 1000 cols 20 row groups    | 17.773675ms         | 27.990747ms               | 3.689049ms          | 4.011346ms                | 1.803588ms                     | 4.077435ms                     |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 10000 cols 20 row groups   | 186.135302ms        | 319.160885ms              | 38.658397ms         | 48.454485ms               | 20.437285ms                    | 45.337169ms                    |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  Float 100000 cols 20 row groups  | 1.850730717s        | 3.308524542s              | 392.43504ms         | 468.130178ms              | 208.983728ms                   | 452.502117ms                   |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100 cols 20 row groups    | 1.551055ms          | 3.540635ms                | 451.13µs            | 535.625µs                 | 190.766µs                      | 522.67µs                       |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 1000 cols 20 row groups   | 15.781785ms         | 25.655606ms               | 4.420568ms          | 5.245406ms                | 1.85453ms                      | 5.031554ms                     |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 10000 cols 20 row groups  | 162.570823ms        | 272.449084ms              | 45.722412ms         | 58.275058ms               | 20.454023ms                    | 55.498968ms                    |
|-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------|
|  String 100000 cols 20 row groups | 1.624907725s        | 2.803188356s              | 465.555399ms        | 570.005157ms              | 208.100961ms                   | 548.227371ms                   |
+-----------------------------------+---------------------+---------------------------+---------------------+---------------------------+--------------------------------+--------------------------------+

@alamb
Copy link
Contributor

alamb commented Oct 28, 2025

🤔 I bet we would see a crazy speedup if we could also skip parsing ColumnChunk metadata for columns that are not read in the query

The benchmark above parses all the columns

@etseidl
Copy link
Contributor Author

etseidl commented Oct 28, 2025

🤔 I bet we would see a crazy speedup if we could also skip parsing ColumnChunk metadata for columns that are not read in the query

The benchmark above parses all the columns

For sure. I did a quick test with b367562 where I only read every other row group's metadata. The "wide" benchmark (which happily now includes the index, thanks again @lichuang!) went from 54s to 30s. I'd bet only decoding 10 out of 10000 column would be crazy fast (still have to do more plumbing before I can try that one).

Edit: Added column plumbing in cc2e1ec. Decoding only 10 columns takes 9s. The remaining time is likely schema parsing and reading the index.

On a related note, if you (@alamb, but others welcome) could opine on #8643 I'd appreciated it. I'm having a hard time wrapping my head around how best to convey down to the thrift parsing code which bits of metadata are wanted. I get confused with multiple readers each with different options objects, that all then sort of use ParquetMetaDataReader, except now there's the push decoder and MetadataParser. For instance, how would one hook a column projection or pushdown predicate into the metadata parsing?

@etseidl
Copy link
Contributor Author

etseidl commented Oct 29, 2025

decoding all columns:
Screen Shot 2025-10-28 at 4 59 04 PM

decoding 10 columns
Screen Shot 2025-10-28 at 4 58 27 PM

Now if we had a cached schema laying around... 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add metadata index for Parquet files

2 participants