Skip to content

webgpu/rans: advance split shared-table path and decode hotspot pass#97

Merged
ChrisLundquist merged 6 commits intomasterfrom
codex/advance-roadmap
Feb 20, 2026
Merged

webgpu/rans: advance split shared-table path and decode hotspot pass#97
ChrisLundquist merged 6 commits intomasterfrom
codex/advance-roadmap

Conversation

@ChrisLundquist
Copy link
Owner

Summary

  • deepen async overlap for WebGPU rANS batched paths and retune profile defaults
  • add nvCOMP-style independent-block split probe, shared-table split encode/decode, and packed shared-table decode gating
  • remove decode prep hotspots in src/webgpu/rans.rs by replacing repeated per-lane packed writes with bulk u16 copy/pack and trimming shared-table setup overhead
  • add generated profiling/samply artifacts and update active plan/probe docs

Validation

  • cargo fmt --check
  • cargo clippy --features webgpu -- -D warnings
  • cargo test --features webgpu batched

Performance Notes

  • sampled split-decode hotspots (write_packed_u16_slice, simd::avx2::byte_frequencies) were removed from top rows in follow-up captures
  • throughput remains variable under current host contention; defaults were stable around 69 MB/s in the hotspot reruns while split paths varied

Chris Lundquist added 6 commits February 19, 2026 20:56
- raise rANS batched in-flight ring cap to 8 (memory-bounded)
- add batched decode completion/readback to reduce per-output submit/map/poll overhead
- retune profile default GPU batch to 6 after post-change sweeps
- refresh active roadmap status and add 2026-02-20 baseline/perf artifacts
…dback

- add --rans-independent-block-bytes to the profile harness for nvCOMP-style split-stage runs\n- route split-mode stage profiling through batched GPU encode/decode over per-block payloads\n- batch rANS encode readback/completion to amortize map/poll overhead across ring-drained work\n- record split probe results and update the active execution plan notes
- add batched shared-table API for chunked payload encode to avoid per-block normalization in split mode\n- wire profile split path to seed one normalized table from full input\n- add WebGPU round-trip test for shared-table batched encode path\n- capture new 1MB split artifacts and update plan/probe notes; result is mixed and does not clear the perf gate
- add batched shared-table decode API to reuse one GPU table across split payload batches\n- wire split decode profiling to the new shared-table path\n- add WebGPU round-trip coverage for shared-table batched decode\n- rerun 1MB split profiles (256KB/64KB, encode+decode) and record rerun artifacts\n- update probe and active plan with the confirmed outcome: table reuse still does not clear split-mode perf gate
- add packed shared-table decode submission path that consolidates split payloads into one dispatch/readback cycle\n- gate packed decode to higher split payload counts to avoid regressions on small split sets\n- refactor chunked payload decode parsing into reusable host-side parser\n- add tests covering packed-path round-trip and mixed-lane fallback behavior\n- record gated packed-decode profiling artifacts and update active plan/probe notes
Replace per-lane packed writes with bulk u16 copy/pack in decode prep, and avoid per-call seed frequency counting in shared-table split decode setup.

Add hotspot-pass profiling artifacts plus plan/probe notes documenting symbol-level wins and remaining split decode variance.
@ChrisLundquist ChrisLundquist merged commit 34a4ea0 into master Feb 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant