-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Labels
Description
Describe the bug
I'm using the https://github.com/pacman82/odbc2parquet library that is based on this crate.
I observe that statistics like min/max are not written for string columns:
In [4]: pq.ParquetFile("/tmp/o2p").metadata.row_group(0).column(1)
Out[4]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x1033c1080>
file_offset: 1123
file_path:
physical_type: BYTE_ARRAY
num_values: 100
path_in_schema: XXX
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x103476070>
has_min_max: False
min: None
max: None
null_count: None
distinct_count: None
num_values: 100
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD
encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
has_dictionary_page: True
dictionary_page_offset: 394
data_page_offset: 938
total_compressed_size: 729
total_uncompressed_size: 2993
Relevant code: https://github.com/pacman82/odbc2parquet/blob/b571cad6fae1b58e1aab8348f14b32f20d6ec165/src/query/parquet_writer.rs#L47
To Reproduce
Use odbc2parquet to download any table that contains a string column
Expected behavior
Should have min/max statistics.
Additional context
Reactions are currently unavailable