webgpu/rans: advance split shared-table path and decode hotspot pass#97
Merged
ChrisLundquist merged 6 commits intomasterfrom Feb 20, 2026
Merged
webgpu/rans: advance split shared-table path and decode hotspot pass#97ChrisLundquist merged 6 commits intomasterfrom
ChrisLundquist merged 6 commits intomasterfrom
Conversation
added 6 commits
February 19, 2026 20:56
- raise rANS batched in-flight ring cap to 8 (memory-bounded) - add batched decode completion/readback to reduce per-output submit/map/poll overhead - retune profile default GPU batch to 6 after post-change sweeps - refresh active roadmap status and add 2026-02-20 baseline/perf artifacts
…dback - add --rans-independent-block-bytes to the profile harness for nvCOMP-style split-stage runs\n- route split-mode stage profiling through batched GPU encode/decode over per-block payloads\n- batch rANS encode readback/completion to amortize map/poll overhead across ring-drained work\n- record split probe results and update the active execution plan notes
- add batched shared-table API for chunked payload encode to avoid per-block normalization in split mode\n- wire profile split path to seed one normalized table from full input\n- add WebGPU round-trip test for shared-table batched encode path\n- capture new 1MB split artifacts and update plan/probe notes; result is mixed and does not clear the perf gate
- add batched shared-table decode API to reuse one GPU table across split payload batches\n- wire split decode profiling to the new shared-table path\n- add WebGPU round-trip coverage for shared-table batched decode\n- rerun 1MB split profiles (256KB/64KB, encode+decode) and record rerun artifacts\n- update probe and active plan with the confirmed outcome: table reuse still does not clear split-mode perf gate
- add packed shared-table decode submission path that consolidates split payloads into one dispatch/readback cycle\n- gate packed decode to higher split payload counts to avoid regressions on small split sets\n- refactor chunked payload decode parsing into reusable host-side parser\n- add tests covering packed-path round-trip and mixed-lane fallback behavior\n- record gated packed-decode profiling artifacts and update active plan/probe notes
Replace per-lane packed writes with bulk u16 copy/pack in decode prep, and avoid per-call seed frequency counting in shared-table split decode setup. Add hotspot-pass profiling artifacts plus plan/probe notes documenting symbol-level wins and remaining split decode variance.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation
Performance Notes