Skip to content

docs: GPU strategy doc and north-star plan update#102

Merged
ChrisLundquist merged 3 commits intomasterfrom
claude/docs-gpu-strategy
Mar 2, 2026
Merged

docs: GPU strategy doc and north-star plan update#102
ChrisLundquist merged 3 commits intomasterfrom
claude/docs-gpu-strategy

Conversation

@ChrisLundquist
Copy link
Owner

Summary

  • Add docs/design-docs/gpu-strategy.md — a single document capturing the GPU compression strategy based on empirical results across the project's history
  • Update PLAN-unified-scheduler-north-star.md to reflect current reality (Phase 1 done but perf gate failed, Phase 3 done via PR Unify GPU schedulers into a single coordinator thread #101)

What the strategy doc covers

  • What GPU is good at: LZ77 cooperative-stitch matching (1,788 parallel probes, 94% quality)
  • Why hash tables failed on GPU: atomic ordering destroys match quality (6.25% on repetitive data)
  • Compression ratio gap: PZ-LZR 41% vs gzip 29% — primarily a match-finding quality gap, not entropy
  • What GPU is bad at: entropy coding (rANS at 0.77x CPU encode, 0.54x decode — serial state machine bottleneck)
  • Current architecture: unified scheduler with GPU coordinator, try_send() deadlock prevention, GPU-to-CPU fallback
  • The FusedGpu problem: routing entropy to GPU is currently counterproductive
  • What would need to change: match quality improvements, on-device chaining (blocked on GPU entropy parity)

North-star plan changes

  • Removed stale "Critical gap: No GPU rANS kernels" (they've existed since Feb 17)
  • Added status table: Phase 1 DONE (perf gate FAIL), Phase 3 DONE (PR Unify GPU schedulers into a single coordinator thread #101), Phase 2/5 DEFERRED
  • Updated existing assets table with GPU rANS kernels
  • Added recommended next actions based on current evidence

Test plan

  • No code changes — docs only
  • Pre-commit hook passes (fmt, clippy, tests)

🤖 Generated with Claude Code

Add docs/design-docs/gpu-strategy.md documenting the GPU compression
strategy: what GPU is good at (LZ77 parallel probes), what it's bad at
(serial entropy coding), why hash tables failed, the compression ratio
gap vs gzip, and the current unified scheduler architecture.

Update PLAN-unified-scheduler-north-star.md to reflect reality:
- Phase 1 (GPU rANS): kernels exist but perf gate FAILED (0.77x CPU)
- Phase 3 (scheduler): DONE via PR #101 unified scheduler
- Remove "Critical gap: No GPU rANS kernels" (they exist since Feb)
- Add status table and recommended next actions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist force-pushed the claude/docs-gpu-strategy branch from 040354d to eb061d0 Compare March 2, 2026 00:27
Chris Lundquist and others added 2 commits March 1, 2026 17:25
Key insights for future agents:
- Ratio gap vs gzip is encoding efficiency, not match quality
- GPU rANS kernels exist but are 0.77x CPU (don't re-implement)
- GPU wins on LZ77, loses on entropy — FusedGpu is counterproductive
- Per-stream frequency table overhead is real but small (~9% of gap)
- LzSeq is the right pipeline family for ratio improvements
- GPU hash tables don't work for LZ77 due to atomic ordering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document five approaches for future work:
1. dietGPU-style warp-per-segment rANS (proven, needs subgroup ops)
2. Huffman sync-point decode (simplest, plan exists)
3. Recoil-style arbitrary-position rANS decode (best ratio)
4. Sparse frequency tables (small but free win)
5. Match encoding improvements (zstd sequences, repeat offsets, etc.)

Includes comparison table, WebGPU feasibility notes, and risk assessment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit 8ff3b19 into master Mar 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant