Add zstd-style sequence encoding (LzSeq)#99
Merged
ChrisLundquist merged 23 commits intomasterfrom Feb 22, 2026
Merged
Conversation
New lzseq module with log2-based offset/length code tables and packed extra-bits bitstreams. Matches encode as (code byte + extra bits) instead of fixed-width u16+u16, making close matches cheaper and enabling future window expansion beyond 32KB. Integrated as Pipeline::LzSeqR (ID=8) with rANS entropy coding across 6 independent streams. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ase 3) Parameterize HashChainFinder for larger windows via with_window() and find_match_wide() (u32 offsets). LzSeq pipeline defaults to 128KB window via SeqConfig, exposed as seq_window_size on CompressOptions. Existing 32KB-window pipelines (Deflate, Lzr, etc.) are completely unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject short matches at large distances where code+extra-bits cost exceeds literal cost. Add min_profitable_length(offset) to lzseq with tiered thresholds (close=3, medium=4, far=5+). Integrate into lazy matching loop. Widen MatchCandidate to u32 offset/length for LzSeq support. Add distance-aware match_cost() to optimal CostModel using actual code+extra-bits costs instead of fixed overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track last 3 offsets in RepeatOffsets state. Matches reusing a recent offset encode with code 0-2 (0 extra bits), saving the full offset encoding cost. Literal offsets shift to code 3+ in the offset_codes stream. Encoder checks repeat candidates alongside hash-chain matches, preferring repeats when they're competitive (within offset-encoding-cost bytes). Decoder maintains identical RepeatOffsets state for correct round-trip. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add upfront validation of stream lengths in decode() to prevent panics on malformed input (flags, offset_codes, length_codes) - Add bounds check before copy loop to prevent unbounded allocation - Fix optimal_parse to use match_token() instead of match_cost(), keeping the DP cost model consistent with LZ77's fixed encoding - Return is_repeat flag from select_best_match to avoid redundant repeat-offset scan in the caller - Remove debug eprintln from tests - Gate decode_offset with #[cfg(test)] (only used in tests) - Update module documentation for repeat offsets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add overflow tracking to BitReader (bits_consumed vs total_bits) and check it in decode() to reject truncated extra-bits streams - Add debug_assert in unpack_flags for defense-in-depth - Wire match_cost() into optimal_parse: distance-aware costing improves parse decisions for all pipelines (closer matches genuinely cost less) - Document match_cost() limitation: uses raw offsets, not repeat-shifted Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ARCHITECTURE.md: add LzSeq to algorithms and pipelines - QUALITY.md: add lzseq module (grade B) and lzseqr pipeline entries - PLAN-competitive-roadmap: add LzSeq as Phase 2 Task 5 (complete) - tech-debt-tracker, exec-plans index: update dates - CLAUDE.md: add lzseq to project layout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1d40f79 to
520232c
Compare
Add lzseq_demux.wgsl kernel that walks the GPU match buffer and produces all 6 LzSeq output streams (flags, literals, offset_codes, offset_extra, length_codes, length_extra) entirely on-device, eliminating the PCIe bottleneck of downloading 12 bytes/position from the match buffer. Key design decisions: - Single-thread serial kernel (@workgroup_size(1)) since the dedupe walk is inherently sequential (each token position depends on prior match length) - No repeat offsets on GPU — all offset codes are literal (shifted by NUM_REPEAT_CODES=3). The decoder handles this transparently. - Match lengths capped to u16::MAX (65535) for decode_length compatibility - Distance-dependent min_profitable_length matches CPU logic exactly GPU pipeline: find_matches_to_device() → lzseq_demux → download 6 streams - Reuses input buffer from match finding (no double upload) - Output buffer is driver zero-initialized (no host upload of zeros) - extract_lzseq_streams validates all counter/offset bounds before slicing - GpuMatchBuf now carries input_buf for downstream kernel reuse Benchmark results (Canterbury corpus, 13.32 MB): - LzSeqR achieves 32.0% ratio at 22 MB/s end-to-end (CPU path) - GPU demux eliminates one PCIe round-trip but overall GPU pipeline still slower than CPU (8-42x) due to per-block transfer overhead - Full GPU benefit requires chaining rANS on-device (future Step 6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire GPU rANS encoding into the LzSeqR pipeline's entropy stage using the batched API with ring-buffered submit/readback overlap. All 6 demuxed streams are encoded in a single batched GPU dispatch instead of 6 serial calls, amortizing GPU synchronization overhead. Small streams fall back to CPU basic rANS. The wire format is unchanged — the existing decoder handles GPU-encoded payloads transparently via RANS_INTERLEAVED_FLAG. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pack 64/num_lanes chunks into each 64-wide workgroup instead of wasting 93.75% of RDNA wave lanes (4 threads per 64-lane wave). Also reduce rANS chunk size from 1024 to 256 bytes for 4x more workgroups per stream. No compression ratio impact — chunk_size only affects rANS entropy coding granularity, not the LZ search window. Benchmarks: 22.3 MB/s compress (was 22.0), identical 32.0% ratio. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirror the encode-side wave-packing optimization for decode: pack 64/num_lanes chunks into each 64-wide workgroup. Both dispatch sites (rans_decode_chunked_gpu and rans_decode_chunked_gpu_with_chunk_meta) now select the packed kernel when num_lanes divides 64 evenly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Decode: replace byte-at-a-time match copy loop with extend_from_within (bulk memcpy for non-overlapping, offset-chunked for overlapping). Iterate packed flag bytes directly instead of allocating Vec<bool>. Profile: add lzseqr pipeline to the profiling harness. Profiling shows the decode bottleneck is the CPU rANS decode loop (stage 0), not LzSeq reconstruction (stage 1), so the match copy improvement has minimal end-to-end impact (~0% on bench.sh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents why CPU-only LzSeqR (328 MB/s, 32% ratio) outperforms GPU-accelerated paths on AMD Radeon Pro 5500M due to PCIe overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds fuzz testing for all standalone algorithms (BWT, rANS, FSE, Huffman, LZ77, LZ78, LZSS, LzSeq, RLE, MTF) and both pipeline-level targets (roundtrip and crash-resistance decompress). Each target tests encode/decode roundtrip correctness and feeds arbitrary bytes to decode paths to verify crash resistance. Requires nightly + cargo-fuzz to run. Completes milestone M5.3 (12/12 milestones). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update ARCHITECTURE.md, QUALITY.md, and tech-debt-tracker to reflect completed fuzz testing infrastructure (12 targets, all algorithms and pipelines). 24h CI campaign still pending. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-allocate the full output buffer and reuse per-chunk parsing buffers (initial_states, word_counts, word_slices) instead of creating new Vecs on each of ~160 chunk iterations. Add rans_decode_4way_into and rans_decode_interleaved_into variants that write directly into a provided buffer. No measurable throughput change (bottleneck is rANS state machine, not allocation), but reduces memory churn and GC pressure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers backward DP algorithm, cost model (literal/match overhead, distance-aware LzSeq costs), GPU top-K handoff, MatchTable layout, and tuning parameters. Updates index, QUALITY.md grades, and tech-debt-tracker to reflect completion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analyzes how NVIDIA nvcomp achieves 90-320 GB/s on A100 and identifies 6 applicable patterns: massive batching, block-independent LZ, segmented ANS, persistent buffers, minimize transfers, and hardware-aware kernels. Includes throughput comparisons, gap analysis, and actionable recommendations ranked by impact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gsl) nvcomp-inspired kernel: 4KB blocks, 4096-slot hash table in var<workgroup> shared memory, atomicStore last-writer-wins. 3-14x faster kernel execution but CATASTROPHIC compression quality — nearly zero matches found. Root cause: parallel BUILD phase with atomicStore stores only late positions. FIND at early positions gets candidates after them (filtered by candidate < pos). Same fundamental flaw as the earlier lz77_hash.wgsl — parallel hash build is incompatible with LZ77's sequential lookup-then-update requirement. Also tested single mega-dispatch vs ring buffer: ring's GPU/CPU overlap (44 MB/s) beats single dispatch (30-38 MB/s) by interleaving compute and readback. This commit preserves the experiment for historian reference. Will be reverted in the next commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the code changes from 89db5f3 (per-workgroup shared-memory hash table experiment). The kernel produced nearly zero matches due to a fundamental flaw in parallel hash build with last-writer-wins semantics. Preserved for reference: - kernels/lz77_local.wgsl (kernel source) - docs/experiments/bulk_vs_ring.rs.txt (dispatch strategy benchmark) - docs/experiments/local_vs_coop.rs.txt (quality comparison benchmark) - docs/design-docs/experiments.md (detailed findings) See experiments.md "Failed Experiment #2" for full analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline variant Pipeline::LzSeqH = 9 that pairs the LzSeq sequence encoder with Huffman entropy coding instead of rANS. Reuses existing stage_huffman_encode/decode and LzSeq demuxer. Benchmarking showed LzSeqH decode is 24% slower than LzSeqR (59 MB/s vs 77 MB/s) because Huffman's bit-level accumulator is fundamentally slower than rANS's word-level multiply-add. The LzSeq decoder itself (58% of total time) is the bottleneck, not entropy coding. Kept as an available pipeline option since it demonstrates the modular stage architecture. - Pipeline enum variant, TryFrom, trial candidates, GPU options - Entropy encode/decode dispatch in blocks.rs - Demuxer mapping in demux.rs - Stage dispatch + empty stream handling in stages.rs - CLI argument parsing in pz.rs - 6 new round-trip tests (empty, hello, repeating, binary, all_same, large) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Profiles LzSeqR, LzSeqH, and Deflate pipeline decode, then isolates raw LzSeq decode (no entropy) and raw rANS vs Huffman entropy decode separately. Uses Canterbury alice29.txt if available, otherwise generates 150KB of mixed text+noise data. Key findings from profiling: - Raw LzSeq decode: 133 MB/s (58% of pipeline decode time) - rANS decode: 118 MB/s per stream - Huffman decode: 25 MB/s per stream (5x slower than rANS) - The LzSeq decoder is the bottleneck, not entropy coding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
lzseqcompression module implementing zstd-style code+extra-bits encoding with:SeqConfigPipeline::LzSeqR(LzSeq + rANS entropy coding)Benchmark results (Canterbury corpus)
LzSeqR achieves the best compression ratio of any libpz pipeline, ~8.7pp better than LzssR.
Key files
src/lzseq.rssrc/lz77.rsHashChainFinder::with_window()for configurable windows,find_match_wide()for u32 offsetssrc/optimal.rsMatchCandidateto u32, addedmatch_cost()for distance-aware costingsrc/pipeline/demux.rsLzDemuxer::LzSeq(6-stream demux/remux)src/pipeline/mod.rsPipeline::LzSeqR = 8,seq_window_sizeoptionsrc/pipeline/stages.rssrc/bin/pz.rspz -p lzseqrCommits
89785fePhase 1+2: Core module + pipeline integration53c1b2aPhase 3: Window expansion to 128KBc03edacPhase 4: Distance-dependent MIN_MATCH3502828Phase 5: Repeat offset tracking1d40f79Review: Harden decode paths, fix optimal parser cost modelTest plan
PzErrorinstead of panickingpz -p lzseqr file && pz -d file.pzround-trips correctly🤖 Generated with Claude Code