apache · JFinis · Nov 10, 2023 · Feb 11, 2026 · Mar 23, 2026 · rdblue
diff --git a/LogicalTypes.md b/LogicalTypes.md
@@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti
 
 The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.
 
-The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.
+Like `FLOAT` and `DOUBLE`, the sort order for `FLOAT16` is signed with special
+handling for NaNs and signed zeros. Writers should use IEEE754TotalOrder for
+consistent handling of these edge cases. See the `ColumnOrder` union in the
+[Thrift definition](src/main/thrift/parquet.thrift) for details.
 
 ## Temporal Types
 

diff --git a/README.md b/README.md
@@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types].
 Parquet stores min/max statistics at several levels (such as Column Chunk,
 Column Index, and Data Page). These statistics are according to a sort order,
 which is defined for each column in the file footer. Parquet supports common
-sort orders for logical and primitive types. The details are documented in the
+sort orders for logical and primitive types and also special orders for types
+with potentially ambiguous semantics (e.g., NaN ordering for floating point
+types). The details are documented in the
 [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
 
 ## Nested Encoding

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
@@ -309,6 +309,13 @@ struct Statistics {
    7: optional bool is_max_value_exact;
    /** If true, min_value is the actual minimum value for a column */
    8: optional bool is_min_value_exact;
+   /**
+    * Count of NaN values in the column; only present if physical type is FLOAT
+    * or DOUBLE, or logical type is FLOAT16.
+    * If this field is not present, readers MUST assume NaNs may be present
+    * (i.e. MUST assume nan_count > 0 and MAY NOT assume nan_count == 0).
+    */
+   9: optional i64 nan_count;
 }
 
 /** Empty structs to use as logical type annotations */
@@ -1050,6 +1057,9 @@ struct RowGroup {
 /** Empty struct to signal the order defined by the physical or logical type */
 struct TypeDefinedOrder {}
 
+/** Empty struct to signal IEEE 754 total order for floating point types */
+struct IEEE754TotalOrder {}
+
 /**
  * Union to specify the order used for the min_value and max_value fields for a
  * column. This union takes the role of an enhanced enum that allows rich
@@ -1058,6 +1068,7 @@ struct TypeDefinedOrder {}
  * Possible values are:
  * * TypeDefinedOrder - the column uses the order defined by its logical or
  *                      physical type (if there is no logical type).
+ * * IEEE754TotalOrder - the floating point column uses IEEE 754 total order.
  *
  * If the reader does not support the value of this union, min and max stats
  * for this column should be ignored.
@@ -1111,23 +1122,99 @@ union ColumnOrder {
    *      64-bit signed integer (nanos)
    *    See https://github.com/apache/parquet-format/issues/502 for more details
    *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because TYPE_ORDER is ambiguous for floating point types due to
+   *     underspecified handling of NaN and -0/+0, it is recommended that writers
+   *     use IEEE_754_TOTAL_ORDER for these types.
+   *
+   *     If TYPE_ORDER is used for floating point types, then the following
    *     compatibility rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-null
+   *       values are NaN.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
    *     - When looking for NaN values, min and max should be ignored.
+   *       If the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - Always set the nan_count field for floating point types, even if
+   *       it is zero.
+   *     - NaNs should not be written to min or max statistics fields except
+   *       in the column index when a page contains only NaN values. In this
+   *       case, since min_values and max_values are required, a NaN value
+   *       must be written.
    *     - If the computed max value is zero (whether negative or positive),
    *       `+0.0` should be written into the max statistics field.
    *     - If the computed min value is zero (whether negative or positive),
    *       `-0.0` should be written into the min statistics field.
    */
   1: TypeDefinedOrder TYPE_ORDER;
+
+  /*
+   * The floating point type is ordered according to the totalOrder predicate,
+   * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
+   * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering.
+   *
+   * Intuitively, this orders floats mathematically, but defines -0 to be less
+   * than +0, -NaN to be less than anything else, and +NaN to be greater than
+   * anything else. It also defines an order between different bit representations
+   * of the same value.
+   *
+   * The formal definition is as follows:
+   *   a) If x<y, totalOrder(x, y) is true.
+   *   b) If x>y, totalOrder(x, y) is false.
+   *   c) If x=y:
+   *     1) totalOrder(−0, +0) is true.
+   *     2) totalOrder(+0, −0) is false.
+   *     3) If x and y represent the same floating-point datum:
+   *        i) If x and y have negative sign, totalOrder(x, y) is true if and
+   *           only if the exponent of x ≥ the exponent of y
+   *       ii) otherwise totalOrder(x, y) is true if and only if the exponent
+   *           of x ≤ the exponent of y.
+   *   d) If x and y are unordered numerically because x or y is NaN:
+   *     1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
+   *        negative sign bit and y is a non-NaN floating-point number.
+   *     2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
+   *        positive sign bit and x is a non-NaN floating-point number.
+   *     3) If x and y are both NaNs, then totalOrder reflects a total ordering
+   *        based on:
+   *         i) negative sign orders below positive sign
+   *        ii) signaling orders below quiet for +NaN, reverse for −NaN
+   *       iii) lesser payload, when regarded as an integer, orders below
+   *            greater payload for +NaN, reverse for −NaN.
+   *
+   * Note that this ordering can be implemented efficiently in software by bit-wise
+   * operations on the integer representation of the floating point values.
+   * E.g., this is a possible implementation for DOUBLE in Rust:
+   *
+   *   pub fn totalOrder(x: f64, y: f64) -> bool {
+   *     let mut x_int = x.to_bits() as i64;
+   *     let mut y_int = y.to_bits() as i64;
+   *     x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
+   *     y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
+   *     return x_int <= y_int;
+   *   }
+   *
+   * When writing statistics for columns with this order, the following rules
+   * must be followed:
+   * - Writing the nan_count field is mandatory when using this ordering.
+   * - Min and max statistics must contain the smallest and largest non-NaN
+   *   values respectively, or if all non-null values are NaN, the smallest and
+   *   largest NaN values as defined by IEEE 754 total order.
+   *
+   * When reading statistics for columns with this order, the following rules
+   * should be followed:
+   * - Readers should consult the nan_count field to determine whether NaNs
+   *   are present.
+   * - A reader can compute nan_count + null_count == num_values to deduce
+   *   whether all non-null values are NaN. In the page index, which does not
+   *   have a num_values field, the presence of a NaN value in min_values
+   *   or max_values indicates that all non-null values are NaN.
+   */
+  2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER;
 }
 
 struct PageLocation {
@@ -1199,6 +1286,18 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated before
    * using them by inspecting null_pages.
+   *
+   * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+   * NaN values are not to be included in these bounds. If all non-null values
+   * of a page are NaN, then a writer must do the following:
+   * - If the order of this column is TYPE_ORDER, then no column index
+   *   must be written for this column chunk. While this is unfortunate for
+   *   performance, it is necessary to avoid conflict with legacy files that
+   *   still included NaN in min_values and max_values even if the page had
+   *   non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+   * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+   *   and max_values[i] of that page must be set to the smallest and largest
+   *   NaN values as defined by IEEE 754 total order.
    */
   2: required list<binary> min_values
   3: required list<binary> max_values
@@ -1240,6 +1339,15 @@ struct ColumnIndex {
     * Same as repetition_level_histograms except for definitions levels.
     **/
    7: optional list<i64> definition_level_histograms;
+
+   /**
+    * A list containing the number of NaN values for each page. Only present
+    * for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16.
+    * If this field is not present, readers MUST assume that there might be
+    * NaN values in any page.
+    */
+   8: optional list<i64> nan_counts
+
 }
 
 struct AesGcmV1 {