-
Notifications
You must be signed in to change notification settings - Fork 13
Add Apache Parquet Adapter #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- digest fields can now be initialized using bytes as well. - fix bug where value would be set, even if the value was bad. - added test_digest.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #208 +/- ##
==========================================
+ Coverage 83.33% 84.26% +0.92%
==========================================
Files 35 37 +2
Lines 3714 4035 +321
==========================================
+ Hits 3095 3400 +305
- Misses 619 635 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ba28bcf to
84200b9
Compare
|
Could not solve the Note that when Python 3.10 is deprecated and is not the base Python version anymore, sphinx will be auto upgraded to a newer version (eg: v9.0.4) that seems to throw more warnings. |
652e297 to
831ecb6
Compare
This fixes the: ValueError: I/O operation on closed file warnings during tests. To summarize what was happening: * Tests calling rdump.main() were setting up logging handlers * When those tests finished, the handlers' streams were getting closed * But the handlers remained attached to the root logger * Python tried to write to the closed stream → ValueError: I/O operation on closed file The `reset_logging` fixture ensures every test gets a clean logging state
Description
This PR adds support for reading and writing Apache Parquet files using pyarrow. This allows
flow.recordtools likerdumpto interact with the Parquet ecosystem, enabling efficient storage and integration with other data analysis tools.Implementation Details
The implementation introduces a new adapter in
flow/record/adapter/parquet.pywith the following features:ParquetWriter
Usage:
rdump -w parquet://[PATH]?batch_size=[BATCH_SIZE]&compression=[COMPRESSION]orrdump -w output.parquetCompression: Supports
snappy,gzip,brotli,zstd(default),lz4, andnone.Schema Handling:
flow.recordRecordDescriptorto PyArrow schemas.flow.recorddescriptor metadata (descriptor_name,descriptor_fields) in the Parquet file metadata to ensure lossless round-tripping.Type Mapping: Maps
flow.recordtypes to their Arrow equivalents:timestamp('us', tz='UTC')int64struct<md5: binary, sha1: binary, sha256: binary>struct<path: string, path_type: uint8>ParquetReader
rdump parquet://[PATH]orrdump dataset.parquetrdump
rdumpto exclude or include columns when reading Parquet files:--fields-reador-Fr: columns to read, skipping reading of other columns--exclude-reador-Xr: columns to exclude/skip for readingRelated Improvements
fieldtypes.digestto support initialization from bytes and fixed a bug where invalid values could be set.tests/_data/iris-zstd.parquet(git-lfs)..arrowadapter has also been added as a bonus to this PR. It's basically the IPC streaming format for Apache Arrow which could be handy for future.Dependencies
pyarrow
Usage Example
resolves #207