Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 4 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,40 +155,11 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md

### Sort Order

Parquet stores min/max statistics at several levels (such as Column Chunk,
Column Index and Data Page). Comparison for values of a type obey the
following rules:

1. Each logical type has a specified comparison order. If a column is
annotated with an unknown logical type, statistics may not be used
for pruning data. The sort order for logical types is documented in
the [LogicalTypes.md][logical-types] page.
2. For primitive types, the following rules apply:

* BOOLEAN - false, true
* INT32, INT64 - Signed comparison.
* FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
signed zeros. The details are documented in the
[Thrift definition](src/main/thrift/parquet.thrift) in the
`ColumnOrder` union. They are summarized here but the Thrift definition
is considered authoritative:
* NaNs should not be written to min or max statistics fields.
* If the computed max value is zero (whether negative or positive),
`+0.0` should be written into the max statistics field.
* If the computed min value is zero (whether negative or positive),
`-0.0` should be written into the min statistics field.

For backwards compatibility when reading files:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.

* BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
comparison.

Column Index, and Data Page). These statistics are according to a sort order,
which is defined for each column in the file footer. Parquet supports common
sort orders for logical and primitve types. The details are documented in the
[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.

## Nested Encoding
To encode nested columns, Parquet uses the Dremel encoding with definition and
Expand Down
5 changes: 3 additions & 2 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -313,12 +313,12 @@ struct Statistics {

/** Empty structs to use as logical type annotations */
struct StringType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes
struct UUIDType {} // allowed for FIXED[16], must be encoded as raw UUID bytes
struct MapType {} // see LogicalTypes.md
struct ListType {} // see LogicalTypes.md
struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct DateType {} // allowed for INT32
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md)

/**
* Logical type to annotate a column that is always null.
Expand Down Expand Up @@ -1057,6 +1057,7 @@ union ColumnOrder {
* UINT64 - unsigned comparison
* DECIMAL - signed comparison of the represented value
* DATE - signed comparison
* FLOAT16 - signed comparison of the represented value (*)
* TIME_MILLIS - signed comparison
* TIME_MICROS - signed comparison
* TIMESTAMP_MILLIS - signed comparison
Expand Down