-
Couldn't load subscription status.
- Fork 1k
[POC] Metadata index for Parquet files #8714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
CC @alamb @XiangpengHao @adriangb who may have interest in the design. The index is currently a new Thrift structure, but could easily be switched to fixed-length arrays. Right now I'm just trying to get something working to prove this is a) doable, b) worthwhile. |
add a builder for the index and move that to the encoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet!
| // TODO(ets): this should use UUID rather than simple string, but this works for prototype | ||
| let idx_len = index.len() as u64; | ||
| index.extend_from_slice(idx_len.as_bytes()); | ||
| index.extend_from_slice("PARI".as_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PARI -- love it
| return Ok(None); | ||
| } | ||
| let magic = &buf[buf.len() - 5..buf.len() - 1]; | ||
| if magic != "PARI".as_bytes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I read this right that the format would look like:
(.. data pages ..)
(.. metadata ..)
(.. current footer - PAR1 w/ len)
(.. MetaIndex ..)
(.. new footer - PARI w/ len)
It is clever, but would not be backwards compatible with existing readers. Though this is a nice way to get some sense of how much faster thrift-parsing could go without having to implement an entirely new parsing system...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses the same approach as the flatbuffers stuff. The index becomes binary field 32767 in the FileMetaData struct. Old readers will simply skip this unparsed and then hit end-of-struct and return. I'm modifying the current reader to search backwards some for the "PARI", and if found then parse that field first to get the index. That part seems to be working 😄
So it's
(.. data pages ..)
(.. page indexes ..)
(.. current footer, not terminated ..)
(.. field 32767 marker ..)
(.. thrift encoded MetaIndex|len|PARI ..)
(.. 0 (to end FileMetaData) ..)
(.. PAR1 w/ len ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like everyone is racing to grab field 32767 lol
I'm modifying the current reader to search backwards some for the "PARI", and if found then parse that field first to get the index. That part seems to be working 😄
Another approach would be to place it right before the current metadata -- since you know where the current footer metadata starts, you could check for PARI at an offset you know after you read the last 8 bytes 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just wanting to use the existing protocol extension recently added for parquet-format. Makes things easy for testing, and shouldn't conflict with the flatbuffers stuff...they'll see the field but not recognice the footer so should ignore.
This brings up an issue I ran across while doing the remodel, but the thrift implementation of skip for binary fields uses read_string. So when you try to skip a pure binary field, if it's not all UTF-8, it throws an error. 56.0 can't read my footer (nor will it be able to read the flatbuffer one either). Seems as if the python thrift parser suffers from the same problem. Guess it's time for PR there, but that doesn't really help us with backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach would be to place it right before the current metadata
The advantage of using field 32767 is that it's already included in the footer length. No changes necessary to code used to fetch the footer in as few GETs as possible. The footer just gets a bit larger.
Now down the road, if the community decides an index is good enough and we don't need a complete rewrite of the metadata, there would be a need for a more permanent solution, which likely would involve tucking it in above the footer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of using field 32767 is that it's already included in the footer length. No changes necessary to code used to fetch the footer in as few GETs as possible. The footer just gets a bit larger.
That is true in theory, though I am not sure how much it would matter in practice.
So the argument goes something like even if you have optimistically fetched a bunch of bytes in the hopes of reading the entire footer in the first read, you could not guarantee that a second fetch would get the index too (you would have to do a second optimistic fetch or something)
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the argument goes something like even if you have optimistically fetched a bunch of bytes in the hopes of reading the entire footer in the first read, you could not guarantee that a second fetch would get the index too (you would have to do a second optimistic fetch or something)
Yes. Worst case would be two fetches for the file metadata (one for the 8 byte footer, a second for the thrift encoded file meta), followed by two for the index. If you get really lucky you might get everything in a single fetch, less lucky you'd most likely do two.
|
I just did a quick experiment with the Here's an old run on my workstation with 57.0 just before release and here's a run using the index (didn't set the page index offsets to 0 so they're still parsed) |
|
🤔 I bet we would see a crazy speedup if we could also skip parsing ColumnChunk metadata for columns that are not read in the query The benchmark above parses all the columns |
For sure. I did a quick test with b367562 where I only read every other row group's metadata. The "wide" benchmark (which happily now includes the index, thanks again @lichuang!) went from 54s to 30s. I'd bet only decoding 10 out of 10000 column would be crazy fast (still have to do more plumbing before I can try that one). Edit: Added column plumbing in cc2e1ec. Decoding only 10 columns takes 9s. The remaining time is likely schema parsing and reading the index. On a related note, if you (@alamb, but others welcome) could opine on #8643 I'd appreciated it. I'm having a hard time wrapping my head around how best to convey down to the thrift parsing code which bits of metadata are wanted. I get confused with multiple readers each with different options objects, that all then sort of use |


Which issue does this PR close?
Rationale for this change
This is a proof of concept implementation of an index into the Parquet metadata. The hope is this will greatly speed up acquiring only the needed bits of the Parquet footer when not all the metadata is needed (such as when projecting a few columns out of an extremely large table).
What changes are included in this PR?
Modifications are made to the Thrift footer encoding to allow for noting the start and end positions of each
RowGroupandColumnMetaData, as well as the location and size of the schema.Are these changes tested?
Tests will be added as this progresses.
Are there any user-facing changes?
No, this should only impact private APIs