-
Couldn't load subscription status.
- Fork 3.9k
Description
I encountered a bug while trying to retrieve Parquet metadata for a column chunk with logical type Decimal128(15, 2).
- The Parquet file was generated using arrow-rs, and I can successfully access its metadata via
arrow-rs,DataFusion, or this tool: https://parquet-viewer.xiangpeng.systems/ - However, I run into an error when attempting to read the metadata using PyArrow.
I attached the Parquet file (under 50MB) along with a minimal Python script to reproduce the issue. If the bug isn’t reproducible on your end, I’m happy to help investigate further.
The parquet file in my separate repo: https://github.com/JigaoLuo/arrow-47955
#!/usr/bin/env python3
# $ python parquet_metadata_reader.py customer.parquet
import sys
import pyarrow.parquet as pq
def print_parquet_metadata(parquet_file):
pq_metadata = pq.read_metadata(parquet_file)
schema = pq_metadata.schema.to_arrow_schema()
for col_idx in range(len(schema)):
field = schema.field(col_idx)
col_name = field.name
column_meta = pq_metadata.schema.column(col_idx)
print(f"Column {col_idx}: {col_name}")
print(f" Type: {column_meta.physical_type}")
row_group = pq_metadata.row_group(0) # Stats of the first row group
rg_column = row_group.column(col_idx)
print(" Stats:", rg_column.statistics)
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python parquet_metadata_reader.py <parquet_file>")
sys.exit(1)
try:
print_parquet_metadata(sys.argv[1])
except Exception as e:
print(f"Error: {e}")
sys.exit(1)%The error message:
Column 0: c_custkey
Type: INT64
Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd60>
has_min_max: True
min: 1
max: 14999999
null_count: 0
distinct_count: None
num_values: 3000188
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
Column 1: c_nationkey
Type: INT32
Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd10>
has_min_max: True
min: 0
max: 24
null_count: 0
distinct_count: None
num_values: 3000188
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
Column 2: c_acctbal
Type: INT64
Stats: Error: Cannot extract statistics for type
Thanks!
Version
I installed pyarrow via conda:
$ conda list | grep pyarrow
pyarrow 21.0.0 py313h78bf25f_1 conda-forge
pyarrow-core 21.0.0 py313he109ebe_1_cpu conda-forgePlatform
I use bare-metal on CPU AMD EPYC 7742 64-Core Processor and Ubuntu from NVIDIA 5.15.0-1042-nvidia
$ uname -a
Linux dgx 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/LinuxRelated issue (?)
I could only find a similar one, but not exactly the same issue: microsoft/semantic-link-labs#909
Component(s)
Parquet, Python