Skip to content

[Python] [PyArrow] [Parquet] Error: Cannot extract statistics for type in Decimal128(15, 2) Column #47955

@JigaoLuo

Description

@JigaoLuo

I encountered a bug while trying to retrieve Parquet metadata for a column chunk with logical type Decimal128(15, 2).

  • The Parquet file was generated using arrow-rs, and I can successfully access its metadata via arrow-rs, DataFusion, or this tool: https://parquet-viewer.xiangpeng.systems/
  • However, I run into an error when attempting to read the metadata using PyArrow.

I attached the Parquet file (under 50MB) along with a minimal Python script to reproduce the issue. If the bug isn’t reproducible on your end, I’m happy to help investigate further.

The parquet file in my separate repo: https://github.com/JigaoLuo/arrow-47955

#!/usr/bin/env python3
# $ python parquet_metadata_reader.py customer.parquet 

import sys
import pyarrow.parquet as pq

def print_parquet_metadata(parquet_file):
    pq_metadata = pq.read_metadata(parquet_file)
    schema = pq_metadata.schema.to_arrow_schema()
    for col_idx in range(len(schema)):
        field = schema.field(col_idx)
        col_name = field.name
        column_meta = pq_metadata.schema.column(col_idx)
        print(f"Column {col_idx}: {col_name}")
        print(f"  Type: {column_meta.physical_type}")
        row_group = pq_metadata.row_group(0) # Stats of the first row group
        rg_column = row_group.column(col_idx)
        print("  Stats:", rg_column.statistics)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python parquet_metadata_reader.py <parquet_file>")
        sys.exit(1)
    try:
        print_parquet_metadata(sys.argv[1])
    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)%

The error message:

Column 0: c_custkey
  Type: INT64
  Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd60>
  has_min_max: True
  min: 1
  max: 14999999
  null_count: 0
  distinct_count: None
  num_values: 3000188
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE
Column 1: c_nationkey
  Type: INT32
  Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd10>
  has_min_max: True
  min: 0
  max: 24
  null_count: 0
  distinct_count: None
  num_values: 3000188
  physical_type: INT32
  logical_type: None
  converted_type (legacy): NONE
Column 2: c_acctbal
  Type: INT64
  Stats: Error: Cannot extract statistics for type 

Thanks!

Version

I installed pyarrow via conda:

$ conda list | grep pyarrow
pyarrow                             21.0.0              py313h78bf25f_1               conda-forge
pyarrow-core                        21.0.0              py313he109ebe_1_cpu           conda-forge

Platform

I use bare-metal on CPU AMD EPYC 7742 64-Core Processor and Ubuntu from NVIDIA 5.15.0-1042-nvidia

$ uname -a
Linux dgx 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Related issue (?)

I could only find a similar one, but not exactly the same issue: microsoft/semantic-link-labs#909

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions