webgpu/rans: advance split shared-table path and decode hotspot pass by ChrisLundquist · Pull Request #97 · ChrisLundquist/libpz

ChrisLundquist · 2026-02-20T09:04:16Z

Summary

deepen async overlap for WebGPU rANS batched paths and retune profile defaults
add nvCOMP-style independent-block split probe, shared-table split encode/decode, and packed shared-table decode gating
remove decode prep hotspots in src/webgpu/rans.rs by replacing repeated per-lane packed writes with bulk u16 copy/pack and trimming shared-table setup overhead
add generated profiling/samply artifacts and update active plan/probe docs

Validation

cargo fmt --check
cargo clippy --features webgpu -- -D warnings
cargo test --features webgpu batched

Performance Notes

sampled split-decode hotspots (write_packed_u16_slice, simd::avx2::byte_frequencies) were removed from top rows in follow-up captures
throughput remains variable under current host contention; defaults were stable around 69 MB/s in the hotspot reruns while split paths varied

- raise rANS batched in-flight ring cap to 8 (memory-bounded) - add batched decode completion/readback to reduce per-output submit/map/poll overhead - retune profile default GPU batch to 6 after post-change sweeps - refresh active roadmap status and add 2026-02-20 baseline/perf artifacts

…dback - add --rans-independent-block-bytes to the profile harness for nvCOMP-style split-stage runs\n- route split-mode stage profiling through batched GPU encode/decode over per-block payloads\n- batch rANS encode readback/completion to amortize map/poll overhead across ring-drained work\n- record split probe results and update the active execution plan notes

- add batched shared-table API for chunked payload encode to avoid per-block normalization in split mode\n- wire profile split path to seed one normalized table from full input\n- add WebGPU round-trip test for shared-table batched encode path\n- capture new 1MB split artifacts and update plan/probe notes; result is mixed and does not clear the perf gate

- add batched shared-table decode API to reuse one GPU table across split payload batches\n- wire split decode profiling to the new shared-table path\n- add WebGPU round-trip coverage for shared-table batched decode\n- rerun 1MB split profiles (256KB/64KB, encode+decode) and record rerun artifacts\n- update probe and active plan with the confirmed outcome: table reuse still does not clear split-mode perf gate

- add packed shared-table decode submission path that consolidates split payloads into one dispatch/readback cycle\n- gate packed decode to higher split payload counts to avoid regressions on small split sets\n- refactor chunked payload decode parsing into reusable host-side parser\n- add tests covering packed-path round-trip and mixed-lane fallback behavior\n- record gated packed-decode profiling artifacts and update active plan/probe notes

Replace per-lane packed writes with bulk u16 copy/pack in decode prep, and avoid per-call seed frequency counting in shared-table split decode setup. Add hotspot-pass profiling artifacts plus plan/probe notes documenting symbol-level wins and remaining split decode variance.

Chris Lundquist added 6 commits February 19, 2026 20:56

ChrisLundquist merged commit 34a4ea0 into master Feb 20, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu/rans: advance split shared-table path and decode hotspot pass#97

webgpu/rans: advance split shared-table path and decode hotspot pass#97
ChrisLundquist merged 6 commits intomasterfrom
codex/advance-roadmap

ChrisLundquist commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Feb 20, 2026

Summary

Validation

Performance Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant