Skip to content

Add zstd-style sequence encoding (LzSeq)#99

Merged
ChrisLundquist merged 23 commits intomasterfrom
worktree-lzseq
Feb 22, 2026
Merged

Add zstd-style sequence encoding (LzSeq)#99
ChrisLundquist merged 23 commits intomasterfrom
worktree-lzseq

Conversation

@ChrisLundquist
Copy link
Owner

Summary

Adds a new lzseq compression module implementing zstd-style code+extra-bits encoding with:

  • Log2-based offset/length codes: Close matches cost 2-3 bytes, far matches scale with log2(distance)
  • 128KB sliding window (4x larger than LZSS's 32KB), configurable via SeqConfig
  • Repeat offset tracking: Last 3 offsets cached; repeat matches encode with 0 extra bits
  • Distance-dependent MIN_MATCH: Rejects unprofitable short matches at large distances
  • 6-stream demux: flags, literals, offset_codes, offset_extra, length_codes, length_extra
  • Pipeline integration: Pipeline::LzSeqR (LzSeq + rANS entropy coding)

Benchmark results (Canterbury corpus)

Pipeline Ratio Encode MB/s Decode MB/s
LzSeqR 32.0% 22.2 33.5
LzssR 40.7% 33.3 37.2
Deflate 35.3% 28.2 65.0

LzSeqR achieves the best compression ratio of any libpz pipeline, ~8.7pp better than LzssR.

Key files

File Description
src/lzseq.rs New module: code tables, BitWriter/BitReader, encode/decode, repeat offsets, ~45 tests
src/lz77.rs HashChainFinder::with_window() for configurable windows, find_match_wide() for u32 offsets
src/optimal.rs Widened MatchCandidate to u32, added match_cost() for distance-aware costing
src/pipeline/demux.rs LzDemuxer::LzSeq (6-stream demux/remux)
src/pipeline/mod.rs Pipeline::LzSeqR = 8, seq_window_size option
src/pipeline/stages.rs LzSeqR stage dispatch
src/bin/pz.rs CLI: pz -p lzseqr

Commits

  1. 89785fe Phase 1+2: Core module + pipeline integration
  2. 53c1b2a Phase 3: Window expansion to 128KB
  3. c03edac Phase 4: Distance-dependent MIN_MATCH
  4. 3502828 Phase 5: Repeat offset tracking
  5. 1d40f79 Review: Harden decode paths, fix optimal parser cost model

Test plan

  • All 651+ existing tests pass
  • 45 new lzseq unit tests (code table consistency, BitWriter/BitReader round-trip, encode/decode round-trip, repeat offsets, edge cases)
  • 7 pipeline integration tests (round-trip, multiblock)
  • Decode validation: malformed input returns PzError instead of panicking
  • pz -p lzseqr file && pz -d file.pz round-trips correctly
  • No regression on existing pipelines

🤖 Generated with Claude Code

Chris Lundquist and others added 7 commits February 21, 2026 04:03
New lzseq module with log2-based offset/length code tables and packed
extra-bits bitstreams. Matches encode as (code byte + extra bits) instead
of fixed-width u16+u16, making close matches cheaper and enabling future
window expansion beyond 32KB. Integrated as Pipeline::LzSeqR (ID=8) with
rANS entropy coding across 6 independent streams.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ase 3)

Parameterize HashChainFinder for larger windows via with_window() and
find_match_wide() (u32 offsets). LzSeq pipeline defaults to 128KB window
via SeqConfig, exposed as seq_window_size on CompressOptions. Existing
32KB-window pipelines (Deflate, Lzr, etc.) are completely unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject short matches at large distances where code+extra-bits cost
exceeds literal cost. Add min_profitable_length(offset) to lzseq with
tiered thresholds (close=3, medium=4, far=5+). Integrate into lazy
matching loop.

Widen MatchCandidate to u32 offset/length for LzSeq support. Add
distance-aware match_cost() to optimal CostModel using actual
code+extra-bits costs instead of fixed overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track last 3 offsets in RepeatOffsets state. Matches reusing a recent
offset encode with code 0-2 (0 extra bits), saving the full offset
encoding cost. Literal offsets shift to code 3+ in the offset_codes
stream.

Encoder checks repeat candidates alongside hash-chain matches, preferring
repeats when they're competitive (within offset-encoding-cost bytes).
Decoder maintains identical RepeatOffsets state for correct round-trip.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add upfront validation of stream lengths in decode() to prevent
  panics on malformed input (flags, offset_codes, length_codes)
- Add bounds check before copy loop to prevent unbounded allocation
- Fix optimal_parse to use match_token() instead of match_cost(),
  keeping the DP cost model consistent with LZ77's fixed encoding
- Return is_repeat flag from select_best_match to avoid redundant
  repeat-offset scan in the caller
- Remove debug eprintln from tests
- Gate decode_offset with #[cfg(test)] (only used in tests)
- Update module documentation for repeat offsets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add overflow tracking to BitReader (bits_consumed vs total_bits) and
  check it in decode() to reject truncated extra-bits streams
- Add debug_assert in unpack_flags for defense-in-depth
- Wire match_cost() into optimal_parse: distance-aware costing improves
  parse decisions for all pipelines (closer matches genuinely cost less)
- Document match_cost() limitation: uses raw offsets, not repeat-shifted

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ARCHITECTURE.md: add LzSeq to algorithms and pipelines
- QUALITY.md: add lzseq module (grade B) and lzseqr pipeline entries
- PLAN-competitive-roadmap: add LzSeq as Phase 2 Task 5 (complete)
- tech-debt-tracker, exec-plans index: update dates
- CLAUDE.md: add lzseq to project layout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Chris Lundquist and others added 16 commits February 21, 2026 20:28
Add lzseq_demux.wgsl kernel that walks the GPU match buffer and produces
all 6 LzSeq output streams (flags, literals, offset_codes, offset_extra,
length_codes, length_extra) entirely on-device, eliminating the PCIe
bottleneck of downloading 12 bytes/position from the match buffer.

Key design decisions:
- Single-thread serial kernel (@workgroup_size(1)) since the dedupe walk
  is inherently sequential (each token position depends on prior match length)
- No repeat offsets on GPU — all offset codes are literal (shifted by
  NUM_REPEAT_CODES=3). The decoder handles this transparently.
- Match lengths capped to u16::MAX (65535) for decode_length compatibility
- Distance-dependent min_profitable_length matches CPU logic exactly

GPU pipeline: find_matches_to_device() → lzseq_demux → download 6 streams
- Reuses input buffer from match finding (no double upload)
- Output buffer is driver zero-initialized (no host upload of zeros)
- extract_lzseq_streams validates all counter/offset bounds before slicing
- GpuMatchBuf now carries input_buf for downstream kernel reuse

Benchmark results (Canterbury corpus, 13.32 MB):
- LzSeqR achieves 32.0% ratio at 22 MB/s end-to-end (CPU path)
- GPU demux eliminates one PCIe round-trip but overall GPU pipeline
  still slower than CPU (8-42x) due to per-block transfer overhead
- Full GPU benefit requires chaining rANS on-device (future Step 6)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire GPU rANS encoding into the LzSeqR pipeline's entropy stage using
the batched API with ring-buffered submit/readback overlap. All 6 demuxed
streams are encoded in a single batched GPU dispatch instead of 6 serial
calls, amortizing GPU synchronization overhead. Small streams fall back
to CPU basic rANS. The wire format is unchanged — the existing decoder
handles GPU-encoded payloads transparently via RANS_INTERLEAVED_FLAG.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pack 64/num_lanes chunks into each 64-wide workgroup instead of wasting
93.75% of RDNA wave lanes (4 threads per 64-lane wave). Also reduce rANS
chunk size from 1024 to 256 bytes for 4x more workgroups per stream.

No compression ratio impact — chunk_size only affects rANS entropy coding
granularity, not the LZ search window.

Benchmarks: 22.3 MB/s compress (was 22.0), identical 32.0% ratio.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirror the encode-side wave-packing optimization for decode: pack
64/num_lanes chunks into each 64-wide workgroup. Both dispatch sites
(rans_decode_chunked_gpu and rans_decode_chunked_gpu_with_chunk_meta)
now select the packed kernel when num_lanes divides 64 evenly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Decode: replace byte-at-a-time match copy loop with extend_from_within
(bulk memcpy for non-overlapping, offset-chunked for overlapping).
Iterate packed flag bytes directly instead of allocating Vec<bool>.

Profile: add lzseqr pipeline to the profiling harness.

Profiling shows the decode bottleneck is the CPU rANS decode loop
(stage 0), not LzSeq reconstruction (stage 1), so the match copy
improvement has minimal end-to-end impact (~0% on bench.sh).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents why CPU-only LzSeqR (328 MB/s, 32% ratio) outperforms
GPU-accelerated paths on AMD Radeon Pro 5500M due to PCIe overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds fuzz testing for all standalone algorithms (BWT, rANS, FSE,
Huffman, LZ77, LZ78, LZSS, LzSeq, RLE, MTF) and both pipeline-level
targets (roundtrip and crash-resistance decompress). Each target tests
encode/decode roundtrip correctness and feeds arbitrary bytes to decode
paths to verify crash resistance. Requires nightly + cargo-fuzz to run.

Completes milestone M5.3 (12/12 milestones).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update ARCHITECTURE.md, QUALITY.md, and tech-debt-tracker to reflect
completed fuzz testing infrastructure (12 targets, all algorithms
and pipelines). 24h CI campaign still pending.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-allocate the full output buffer and reuse per-chunk parsing
buffers (initial_states, word_counts, word_slices) instead of
creating new Vecs on each of ~160 chunk iterations. Add
rans_decode_4way_into and rans_decode_interleaved_into variants
that write directly into a provided buffer.

No measurable throughput change (bottleneck is rANS state machine,
not allocation), but reduces memory churn and GC pressure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers backward DP algorithm, cost model (literal/match overhead,
distance-aware LzSeq costs), GPU top-K handoff, MatchTable layout,
and tuning parameters. Updates index, QUALITY.md grades, and
tech-debt-tracker to reflect completion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analyzes how NVIDIA nvcomp achieves 90-320 GB/s on A100 and
identifies 6 applicable patterns: massive batching, block-independent
LZ, segmented ANS, persistent buffers, minimize transfers, and
hardware-aware kernels. Includes throughput comparisons, gap analysis,
and actionable recommendations ranked by impact.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gsl)

nvcomp-inspired kernel: 4KB blocks, 4096-slot hash table in var<workgroup>
shared memory, atomicStore last-writer-wins. 3-14x faster kernel execution
but CATASTROPHIC compression quality — nearly zero matches found.

Root cause: parallel BUILD phase with atomicStore stores only late positions.
FIND at early positions gets candidates after them (filtered by candidate < pos).
Same fundamental flaw as the earlier lz77_hash.wgsl — parallel hash build is
incompatible with LZ77's sequential lookup-then-update requirement.

Also tested single mega-dispatch vs ring buffer: ring's GPU/CPU overlap
(44 MB/s) beats single dispatch (30-38 MB/s) by interleaving compute
and readback.

This commit preserves the experiment for historian reference.
Will be reverted in the next commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the code changes from 89db5f3 (per-workgroup shared-memory hash
table experiment). The kernel produced nearly zero matches due to a
fundamental flaw in parallel hash build with last-writer-wins semantics.

Preserved for reference:
- kernels/lz77_local.wgsl (kernel source)
- docs/experiments/bulk_vs_ring.rs.txt (dispatch strategy benchmark)
- docs/experiments/local_vs_coop.rs.txt (quality comparison benchmark)
- docs/design-docs/experiments.md (detailed findings)

See experiments.md "Failed Experiment #2" for full analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pipeline variant Pipeline::LzSeqH = 9 that pairs the LzSeq
sequence encoder with Huffman entropy coding instead of rANS.
Reuses existing stage_huffman_encode/decode and LzSeq demuxer.

Benchmarking showed LzSeqH decode is 24% slower than LzSeqR
(59 MB/s vs 77 MB/s) because Huffman's bit-level accumulator
is fundamentally slower than rANS's word-level multiply-add.
The LzSeq decoder itself (58% of total time) is the bottleneck,
not entropy coding. Kept as an available pipeline option since
it demonstrates the modular stage architecture.

- Pipeline enum variant, TryFrom, trial candidates, GPU options
- Entropy encode/decode dispatch in blocks.rs
- Demuxer mapping in demux.rs
- Stage dispatch + empty stream handling in stages.rs
- CLI argument parsing in pz.rs
- 6 new round-trip tests (empty, hello, repeating, binary,
  all_same, large)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Profiles LzSeqR, LzSeqH, and Deflate pipeline decode, then
isolates raw LzSeq decode (no entropy) and raw rANS vs Huffman
entropy decode separately. Uses Canterbury alice29.txt if
available, otherwise generates 150KB of mixed text+noise data.

Key findings from profiling:
- Raw LzSeq decode: 133 MB/s (58% of pipeline decode time)
- rANS decode: 118 MB/s per stream
- Huffman decode: 25 MB/s per stream (5x slower than rANS)
- The LzSeq decoder is the bottleneck, not entropy coding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit 3d53789 into master Feb 22, 2026
5 checks passed
@ChrisLundquist ChrisLundquist deleted the worktree-lzseq branch February 25, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant