Skip to content

feat: Dictionary page pruning for row filter predicates#9574

Open
Dandandan wants to merge 7 commits intoapache:mainfrom
Dandandan:dictionary-page-pruning
Open

feat: Dictionary page pruning for row filter predicates#9574
Dandandan wants to merge 7 commits intoapache:mainfrom
Dandandan:dictionary-page-pruning

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Mar 18, 2026

Closes: #9588

Summary

  • Adds dictionary page pruning for row filter predicates in the parquet reader
  • When evaluating predicates on dictionary-encoded columns, the predicate is first evaluated against dictionary values before decoding data pages
  • If no dictionary values match (AllFalse): skip the entire column chunk
  • If all dictionary values match (AllTrue): skip per-row predicate evaluation
  • Adds evaluate_dictionary method to ArrowPredicate trait with a default implementation that delegates to evaluate

Details

  • Supports BYTE_ARRAY (strings), INT32, and INT64 physical types
  • Only applies when all data pages use dictionary encoding (no plain fallback)
  • Uses column encoding metadata and page encoding stats to verify safety
  • Currently implemented for the async push decoder path

Benchmark Results (ClickBench async_object_store)

Query Before After Change Notes
Q19 2.57ms 1.66ms -35% CounterID=62 — prunes 1 of 3 row groups
Q42 3.63ms 3.35ms -8% Same CounterID filter
Q36 17.3ms 16.7ms -3% CounterID + string predicates
Others ~0% No regressions

The optimization is most effective for selective equality filters on dictionary-encoded columns where the target value doesn't appear in some row groups' dictionaries.

Test plan

  • Existing parquet tests pass
  • ClickBench benchmark verifies correctness (row counts match expected)
  • Tested with columns that have plain encoding fallback (correctly skips pruning)
  • Structure size test unchanged (no new state machine variants)

🤖 Generated with Claude Code

When evaluating row filter predicates on dictionary-encoded columns,
evaluate the predicate against dictionary values before decoding data
pages. If no dictionary values match (AllFalse), skip the entire column
chunk. If all dictionary values match (AllTrue), skip per-row predicate
evaluation entirely.

This optimization is most effective for selective equality filters
(e.g. `CounterID = 62`) on dictionary-encoded columns where the value
doesn't exist in some row groups' dictionaries.

Benchmark results on ClickBench (async_object_store):
- Q19 (CounterID=62, 3 predicates): -35% (2.57ms → 1.66ms)
- Q42 (CounterID=62, 2 predicates): -8% (3.63ms → 3.35ms)
- No regressions on other queries

Supports BYTE_ARRAY (strings), INT32, and INT64 physical types.
Only applies when all data pages are dictionary-encoded (no fallback).
Currently implemented for the async push decoder path only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 18, 2026
@Dandandan
Copy link
Contributor Author

run benchmark arrow_reader_clickbench

@adriangbot
Copy link

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4085313791-433-jqc52 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing dictionary-page-pruning (d2d1d28) to 3931179 (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete

@adriangbot
Copy link

Benchmark for this request failed.

Last 20 lines of output:

Click to expand
    Updating crates.io index
     Locking 417 packages to latest compatible versions
      Adding generic-array v0.14.7 (available: v0.14.9)
      Adding lz4_flex v0.12.1 (available: v0.13.0)
      Adding matchit v0.8.4 (available: v0.8.6)
      Adding rand v0.9.2 (available: v0.10.0)
rustc 1.94.0 (4a4ef493e 2026-03-02)
d2d1d28adf23dcb75dbb2ae0c01b02b9410d05ac
393117979882e97a15125edd142c70a5e2c16386
Looking for ClickBench files starting in current_dir and all parent directories: "/workspace/arrow-rs-base/parquet"
    Finished `bench` profile [optimized] target(s) in 0.15s
     Running benches/arrow_reader_clickbench.rs (target/release/deps/arrow_reader_clickbench-783851a0a179a92e)
Could not find hits_1.parquet in directory or parents: "/workspace/arrow-rs-base/parquet". Download it via

wget --continue https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet

thread 'main' (15622) panicked at parquet/benches/arrow_reader_clickbench.rs:618:9:
Stopping
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: bench failed, to rerun pass `-p parquet --bench arrow_reader_clickbench`

@Dandandan
Copy link
Contributor Author

run benchmark arrow_reader_clickbench

@adriangbot
Copy link

🤖 Arrow criterion benchmark running (GKE) | trigger
Linux bench-c4085358733-437-vzs9l 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing dictionary-page-pruning (fe5dc07) to 88422cb (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Mar 18, 2026

YAAAS

@adriangbot
Copy link

🤖 Arrow criterion benchmark completed (GKE) | trigger

Details

group                                             dictionary-page-pruning                main
-----                                             -----------------------                ----
arrow_reader_clickbench/async/Q1                  1.00   1087.0±6.86µs        ? ?/sec    1.01   1093.3±3.29µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.00      6.7±0.03ms        ? ?/sec    1.01      6.8±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.00      7.7±0.03ms        ? ?/sec    1.01      7.8±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.00     14.3±0.04ms        ? ?/sec    1.03     14.7±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.00     16.9±0.06ms        ? ?/sec    1.03     17.4±0.10ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.00     15.8±0.04ms        ? ?/sec    1.02     16.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.00  1976.5±21.72µs        ? ?/sec    1.55      3.1±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.07     82.6±9.34ms        ? ?/sec    1.00     77.1±4.56ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.00     80.2±0.19ms        ? ?/sec    1.08     86.8±0.65ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.00    122.8±4.28ms        ? ?/sec    1.06    130.3±4.26ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.00    237.4±0.83ms        ? ?/sec    1.05    248.7±1.09ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.00     19.3±0.11ms        ? ?/sec    1.03     19.9±0.17ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.00     57.1±0.19ms        ? ?/sec    1.02     58.3±0.21ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.00     57.0±0.18ms        ? ?/sec    1.05     59.6±0.28ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.00     18.4±0.07ms        ? ?/sec    1.02     18.8±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.00     14.6±0.16ms        ? ?/sec    1.03     15.1±0.14ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.12      6.1±0.03ms        ? ?/sec    1.00      5.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.00     12.9±0.13ms        ? ?/sec    1.05     13.5±0.15ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.00     23.8±0.22ms        ? ?/sec    1.04     24.9±0.32ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.00      5.3±0.03ms        ? ?/sec    1.08      5.7±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.00      4.6±0.02ms        ? ?/sec    1.07      5.0±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.00      3.2±0.02ms        ? ?/sec    1.09      3.5±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.00   1067.9±6.11µs        ? ?/sec    1.00   1068.9±3.07µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.00      6.5±0.02ms        ? ?/sec    1.01      6.6±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.00      7.5±0.05ms        ? ?/sec    1.01      7.6±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.00     14.3±0.05ms        ? ?/sec    1.02     14.6±0.12ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.00     16.8±0.10ms        ? ?/sec    1.03     17.2±0.14ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.00     15.8±0.13ms        ? ?/sec    1.02     16.1±0.09ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.00  1884.4±16.61µs        ? ?/sec    1.55      2.9±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.00     70.7±0.22ms        ? ?/sec    1.01     71.7±0.42ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.00     79.1±0.22ms        ? ?/sec    1.02     80.8±0.54ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.00     98.2±1.92ms        ? ?/sec    1.00     98.6±0.58ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.00    217.3±0.63ms        ? ?/sec    1.08    234.1±1.15ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.00     19.1±0.11ms        ? ?/sec    1.01     19.3±0.13ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.00     56.1±0.20ms        ? ?/sec    1.02     57.1±0.46ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.00     56.1±0.17ms        ? ?/sec    1.03     57.9±0.72ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.00     18.2±0.06ms        ? ?/sec    1.02     18.5±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.00     14.6±2.10ms        ? ?/sec    1.00     14.6±0.15ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.12      6.0±0.02ms        ? ?/sec    1.00      5.4±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.00     12.5±0.15ms        ? ?/sec    1.02     12.8±0.15ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.00     22.9±0.19ms        ? ?/sec    1.02     23.4±0.30ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.00      5.1±0.03ms        ? ?/sec    1.06      5.5±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.00      4.5±0.03ms        ? ?/sec    1.06      4.8±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.00      3.2±0.02ms        ? ?/sec    1.08      3.4±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    872.1±2.46µs        ? ?/sec    1.00    869.0±1.88µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.00      5.1±0.02ms        ? ?/sec    1.02      5.1±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.00      6.0±0.03ms        ? ?/sec    1.02      6.1±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.00     21.6±0.14ms        ? ?/sec    1.04     22.6±0.64ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.02     29.5±0.12ms        ? ?/sec    1.00     28.9±0.81ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.14     26.8±0.37ms        ? ?/sec    1.00     23.4±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.00      2.7±0.04ms        ? ?/sec    1.01      2.7±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.00    120.5±0.23ms        ? ?/sec    1.01    122.0±1.54ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.00     96.4±0.11ms        ? ?/sec    1.02     98.0±0.14ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.00    142.4±0.67ms        ? ?/sec    1.02    145.1±0.34ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.00   276.9±14.50ms        ? ?/sec    1.02   281.8±13.65ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.00     26.9±0.07ms        ? ?/sec    1.02     27.6±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.00    106.1±0.11ms        ? ?/sec    1.02    108.1±0.16ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.00    103.8±0.16ms        ? ?/sec    1.04    107.8±0.27ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.00     18.5±0.06ms        ? ?/sec    1.03     19.1±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.00     22.5±0.20ms        ? ?/sec    1.00     22.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.00      6.9±0.01ms        ? ?/sec    1.00      6.9±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.00     11.4±0.05ms        ? ?/sec    1.00     11.5±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.00     20.9±0.05ms        ? ?/sec    1.00     21.0±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.00      5.1±0.02ms        ? ?/sec    1.02      5.2±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.00      5.5±0.02ms        ? ?/sec    1.01      5.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.00      4.3±0.02ms        ? ?/sec    1.00      4.3±0.02ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 786.3s
Peak memory 3.1 GiB
Avg memory 2.9 GiB
CPU user 706.4s
CPU sys 79.2s
Disk read 0 B
Disk write 846.8 MiB

branch

Metric Value
Wall time 776.9s
Peak memory 3.2 GiB
Avg memory 3.1 GiB
CPU user 705.8s
CPU sys 70.3s
Disk read 0 B
Disk write 171.2 MiB

@Dandandan
Copy link
Contributor Author

Dandandan commented Mar 18, 2026

arrow_reader_clickbench/async_object_store/Q19 1.00 1884.4±16.61µs ? ?/sec 1.55 2.9±0.03ms ? ?/sec

🎉

We could (at the cost of extra IO request, but saving for some) also do this before loading the column chunks I think -
but probably needs to be some config.

(Perhaps it makes sense for other row filters as well to disable making IO requests small/sequential (especially object storage): don't try to save IO (for small/medium sized columns) but still try to prune to save CPU.

@alamb
Copy link
Contributor

alamb commented Mar 18, 2026

We could (at the cost of extra IO request, but saving for some) also do this before loading the column chunks I think -
but probably needs to be some config.

I agree

- Use arrow type from ParquetField tree instead of hardcoded Utf8View
- Support Utf8, LargeUtf8, BinaryView string types
- Support Timestamp types for INT64 dictionary columns
- Skip nested/struct columns (only prune top-level primitives)
- Update snapshot for changed I/O pattern (AllTrue skips filter eval)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
" Row Group 1, column 'b': DictionaryPage (1617 bytes, 1 requests) [data]",
" Row Group 1, column 'b': DataPage(0) (113 bytes , 1 requests) [data]",
" Row Group 1, column 'b': DataPage(1) (126 bytes , 1 requests) [data]",
" Row Group 1, column 'b': MultiPage(dictionary_page: true, data_pages: [0, 1]) (1856 bytes, 1 requests) [data]",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a bit surprised to see this changed (it's more optimal).
Seems when all values pass it creates 3 requests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone following along, I think @Dandandan fixed it here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :)

I think for object storage it will still coalesce the ranges afterwards - but on localfilesystem it will do a number of syscalls (which as it does it one after another should be less efficient then one go anyway).

The test_row_numbers_with_multiple_row_groups_and_filter test used a
stateful position-based predicate that broke when evaluate_dictionary
called evaluate on dictionary values, advancing the internal offset
incorrectly. Replace with a stateless value-based filter (value % 2 != 0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alamb
Copy link
Contributor

alamb commented Mar 19, 2026

@Dandandan
Copy link
Contributor Author

Is this similar to

Is this similar to

No - this PR applies it to the dictionary (which is small thus fast) and avoids decompressing / decoding the data pages.

The linked PR #9464 as far as I see tries to reuse the booleans from the predicate by gathering them back onto the rows.
I think in case you only need to apply the rowfilter (and don't need the values) it could save materializing the values.

@Dandandan Dandandan marked this pull request as ready for review March 19, 2026 20:38
Dandandan and others added 2 commits March 19, 2026 21:38
Add Method 1a to is_all_dictionary_encoded that checks
col_meta.page_encoding_stats() (the full Vec<PageEncodingStats>) when
the mask form wasn't used, covering the case where
ParquetMetaDataOptions::with_encoding_stats_as_mask was set to false.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
// These are used for definition/repetition levels, not data
#[allow(deprecated)]
Encoding::RLE | Encoding::BIT_PACKED => {}
// PLAIN is ambiguous - used for def/rep levels in V1 pages AND
Copy link
Contributor

@etseidl etseidl Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about this one. So the way the encodings work is:

  • levels are RLE or BIT_PACKED (as noted above)
  • V1 dictionary uses the PLAIN_DICTIONARY variant for both the dictionary and data
  • V2 dictionary uses PLAIN for dictionary and RLE_DICTIONARY for data
  • all other encodings are for data

So if we see PLAIN_DICTIONARY and PLAIN, we can know that fallback has definitely occurred. In the case of RLE_DICTIONARY, it's a coin flip, but I'd err on the side of caution and return false in that case. In my experience fallback is quite common.

    let mut has_plain_dict_encoding = false;
    let mut has_plain_encoding = false;
    for enc in col_meta.encodings() {
        match enc {
            Encoding::PLAIN_DICTIONARY => has_plain_dict_encoding = true,
            // for RLE_DICT we can't know if fallback has occurred...be pessimistic
            Encoding::RLE_DICTIONARY => return false,
            Encoding::PLAIN => has_plain_encoding = true,
            // These are used for definition/repetition levels, not data
            #[allow(deprecated)]
            Encoding::RLE | Encoding::BIT_PACKED => {}
            // Any other encoding (DELTA_*, etc.) means non-dictionary data
            _ => return false,
        }
    }

    has_plain_dict_encoding && !has_plain_encoding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or more simply:

    for enc in col_meta.encodings() {
        match enc {
            #[allow(deprecated)]
            // Either V1 dict encoded or level data
            Encoding::PLAIN_DICTIONARY | Encoding::RLE | Encoding::BIT_PACKED => {}
            // Any other encoding means non-dictionary data
            _ => return false,
        }
    }

    true

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan, this looks really cool. I'm happy to see the encodings mask being used :)

I'm not super up on the filtering bits, but they look correct to me. I just have a few questions, and think the all_dict test should be more conservative.


let physical_type = schema_descr.column(col_idx).physical_type();

// Only support BYTE_ARRAY and INT32/INT64 columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why not other physical types?

)),
}
}
_ => Ok(Arc::new(arrow_array::Int64Array::from(values))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UInt64Array?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dictionary page pruning for row filter predicates

4 participants