Skip to content

How to enable statistics for string columns? #5270

@jonashaag

Description

@jonashaag

Describe the bug

I'm using the https://github.com/pacman82/odbc2parquet library that is based on this crate.

I observe that statistics like min/max are not written for string columns:

In [4]: pq.ParquetFile("/tmp/o2p").metadata.row_group(0).column(1)
Out[4]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x1033c1080>
  file_offset: 1123
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 100
  path_in_schema: XXX
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x103476070>
      has_min_max: False
      min: None
      max: None
      null_count: None
      distinct_count: None
      num_values: 100
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: ZSTD
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 394
  data_page_offset: 938
  total_compressed_size: 729
  total_uncompressed_size: 2993

Relevant code: https://github.com/pacman82/odbc2parquet/blob/b571cad6fae1b58e1aab8348f14b32f20d6ec165/src/query/parquet_writer.rs#L47

To Reproduce

Use odbc2parquet to download any table that contains a string column

Expected behavior

Should have min/max statistics.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions