Add zstd-style sequence encoding (LzSeq) by ChrisLundquist · Pull Request #99 · ChrisLundquist/libpz

ChrisLundquist · 2026-02-22T01:10:44Z

Summary

Adds a new lzseq compression module implementing zstd-style code+extra-bits encoding with:

Log2-based offset/length codes: Close matches cost 2-3 bytes, far matches scale with log2(distance)
128KB sliding window (4x larger than LZSS's 32KB), configurable via SeqConfig
Repeat offset tracking: Last 3 offsets cached; repeat matches encode with 0 extra bits
Distance-dependent MIN_MATCH: Rejects unprofitable short matches at large distances
6-stream demux: flags, literals, offset_codes, offset_extra, length_codes, length_extra
Pipeline integration: Pipeline::LzSeqR (LzSeq + rANS entropy coding)

Benchmark results (Canterbury corpus)

Pipeline	Ratio	Encode MB/s	Decode MB/s
LzSeqR	32.0%	22.2	33.5
LzssR	40.7%	33.3	37.2
Deflate	35.3%	28.2	65.0

LzSeqR achieves the best compression ratio of any libpz pipeline, ~8.7pp better than LzssR.

Key files

File	Description
`src/lzseq.rs`	New module: code tables, BitWriter/BitReader, encode/decode, repeat offsets, ~45 tests
`src/lz77.rs`	`HashChainFinder::with_window()` for configurable windows, `find_match_wide()` for u32 offsets
`src/optimal.rs`	Widened `MatchCandidate` to u32, added `match_cost()` for distance-aware costing
`src/pipeline/demux.rs`	`LzDemuxer::LzSeq` (6-stream demux/remux)
`src/pipeline/mod.rs`	`Pipeline::LzSeqR = 8`, `seq_window_size` option
`src/pipeline/stages.rs`	LzSeqR stage dispatch
`src/bin/pz.rs`	CLI: `pz -p lzseqr`

Commits

89785fe Phase 1+2: Core module + pipeline integration
53c1b2a Phase 3: Window expansion to 128KB
c03edac Phase 4: Distance-dependent MIN_MATCH
3502828 Phase 5: Repeat offset tracking
1d40f79 Review: Harden decode paths, fix optimal parser cost model

Test plan

All 651+ existing tests pass
45 new lzseq unit tests (code table consistency, BitWriter/BitReader round-trip, encode/decode round-trip, repeat offsets, edge cases)
7 pipeline integration tests (round-trip, multiblock)
Decode validation: malformed input returns PzError instead of panicking
pz -p lzseqr file && pz -d file.pz round-trips correctly
No regression on existing pipelines

🤖 Generated with Claude Code

New lzseq module with log2-based offset/length code tables and packed extra-bits bitstreams. Matches encode as (code byte + extra bits) instead of fixed-width u16+u16, making close matches cheaper and enabling future window expansion beyond 32KB. Integrated as Pipeline::LzSeqR (ID=8) with rANS entropy coding across 6 independent streams. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ase 3) Parameterize HashChainFinder for larger windows via with_window() and find_match_wide() (u32 offsets). LzSeq pipeline defaults to 128KB window via SeqConfig, exposed as seq_window_size on CompressOptions. Existing 32KB-window pipelines (Deflate, Lzr, etc.) are completely unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reject short matches at large distances where code+extra-bits cost exceeds literal cost. Add min_profitable_length(offset) to lzseq with tiered thresholds (close=3, medium=4, far=5+). Integrate into lazy matching loop. Widen MatchCandidate to u32 offset/length for LzSeq support. Add distance-aware match_cost() to optimal CostModel using actual code+extra-bits costs instead of fixed overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Track last 3 offsets in RepeatOffsets state. Matches reusing a recent offset encode with code 0-2 (0 extra bits), saving the full offset encoding cost. Literal offsets shift to code 3+ in the offset_codes stream. Encoder checks repeat candidates alongside hash-chain matches, preferring repeats when they're competitive (within offset-encoding-cost bytes). Decoder maintains identical RepeatOffsets state for correct round-trip. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add upfront validation of stream lengths in decode() to prevent panics on malformed input (flags, offset_codes, length_codes) - Add bounds check before copy loop to prevent unbounded allocation - Fix optimal_parse to use match_token() instead of match_cost(), keeping the DP cost model consistent with LZ77's fixed encoding - Return is_repeat flag from select_best_match to avoid redundant repeat-offset scan in the caller - Remove debug eprintln from tests - Gate decode_offset with #[cfg(test)] (only used in tests) - Update module documentation for repeat offsets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add overflow tracking to BitReader (bits_consumed vs total_bits) and check it in decode() to reject truncated extra-bits streams - Add debug_assert in unpack_flags for defense-in-depth - Wire match_cost() into optimal_parse: distance-aware costing improves parse decisions for all pipelines (closer matches genuinely cost less) - Document match_cost() limitation: uses raw offsets, not repeat-shifted Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- ARCHITECTURE.md: add LzSeq to algorithms and pipelines - QUALITY.md: add lzseq module (grade B) and lzseqr pipeline entries - PLAN-competitive-roadmap: add LzSeq as Phase 2 Task 5 (complete) - tech-debt-tracker, exec-plans index: update dates - CLAUDE.md: add lzseq to project layout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add lzseq_demux.wgsl kernel that walks the GPU match buffer and produces all 6 LzSeq output streams (flags, literals, offset_codes, offset_extra, length_codes, length_extra) entirely on-device, eliminating the PCIe bottleneck of downloading 12 bytes/position from the match buffer. Key design decisions: - Single-thread serial kernel (@workgroup_size(1)) since the dedupe walk is inherently sequential (each token position depends on prior match length) - No repeat offsets on GPU — all offset codes are literal (shifted by NUM_REPEAT_CODES=3). The decoder handles this transparently. - Match lengths capped to u16::MAX (65535) for decode_length compatibility - Distance-dependent min_profitable_length matches CPU logic exactly GPU pipeline: find_matches_to_device() → lzseq_demux → download 6 streams - Reuses input buffer from match finding (no double upload) - Output buffer is driver zero-initialized (no host upload of zeros) - extract_lzseq_streams validates all counter/offset bounds before slicing - GpuMatchBuf now carries input_buf for downstream kernel reuse Benchmark results (Canterbury corpus, 13.32 MB): - LzSeqR achieves 32.0% ratio at 22 MB/s end-to-end (CPU path) - GPU demux eliminates one PCIe round-trip but overall GPU pipeline still slower than CPU (8-42x) due to per-block transfer overhead - Full GPU benefit requires chaining rANS on-device (future Step 6) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wire GPU rANS encoding into the LzSeqR pipeline's entropy stage using the batched API with ring-buffered submit/readback overlap. All 6 demuxed streams are encoded in a single batched GPU dispatch instead of 6 serial calls, amortizing GPU synchronization overhead. Small streams fall back to CPU basic rANS. The wire format is unchanged — the existing decoder handles GPU-encoded payloads transparently via RANS_INTERLEAVED_FLAG. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pack 64/num_lanes chunks into each 64-wide workgroup instead of wasting 93.75% of RDNA wave lanes (4 threads per 64-lane wave). Also reduce rANS chunk size from 1024 to 256 bytes for 4x more workgroups per stream. No compression ratio impact — chunk_size only affects rANS entropy coding granularity, not the LZ search window. Benchmarks: 22.3 MB/s compress (was 22.0), identical 32.0% ratio. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mirror the encode-side wave-packing optimization for decode: pack 64/num_lanes chunks into each 64-wide workgroup. Both dispatch sites (rans_decode_chunked_gpu and rans_decode_chunked_gpu_with_chunk_meta) now select the packed kernel when num_lanes divides 64 evenly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Decode: replace byte-at-a-time match copy loop with extend_from_within (bulk memcpy for non-overlapping, offset-chunked for overlapping). Iterate packed flag bytes directly instead of allocating Vec<bool>. Profile: add lzseqr pipeline to the profiling harness. Profiling shows the decode bottleneck is the CPU rANS decode loop (stage 0), not LzSeq reconstruction (stage 1), so the match copy improvement has minimal end-to-end impact (~0% on bench.sh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Documents why CPU-only LzSeqR (328 MB/s, 32% ratio) outperforms GPU-accelerated paths on AMD Radeon Pro 5500M due to PCIe overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds fuzz testing for all standalone algorithms (BWT, rANS, FSE, Huffman, LZ77, LZ78, LZSS, LzSeq, RLE, MTF) and both pipeline-level targets (roundtrip and crash-resistance decompress). Each target tests encode/decode roundtrip correctness and feeds arbitrary bytes to decode paths to verify crash resistance. Requires nightly + cargo-fuzz to run. Completes milestone M5.3 (12/12 milestones). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update ARCHITECTURE.md, QUALITY.md, and tech-debt-tracker to reflect completed fuzz testing infrastructure (12 targets, all algorithms and pipelines). 24h CI campaign still pending. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pre-allocate the full output buffer and reuse per-chunk parsing buffers (initial_states, word_counts, word_slices) instead of creating new Vecs on each of ~160 chunk iterations. Add rans_decode_4way_into and rans_decode_interleaved_into variants that write directly into a provided buffer. No measurable throughput change (bottleneck is rANS state machine, not allocation), but reduces memory churn and GC pressure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Covers backward DP algorithm, cost model (literal/match overhead, distance-aware LzSeq costs), GPU top-K handoff, MatchTable layout, and tuning parameters. Updates index, QUALITY.md grades, and tech-debt-tracker to reflect completion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Analyzes how NVIDIA nvcomp achieves 90-320 GB/s on A100 and identifies 6 applicable patterns: massive batching, block-independent LZ, segmented ANS, persistent buffers, minimize transfers, and hardware-aware kernels. Includes throughput comparisons, gap analysis, and actionable recommendations ranked by impact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…gsl) nvcomp-inspired kernel: 4KB blocks, 4096-slot hash table in var<workgroup> shared memory, atomicStore last-writer-wins. 3-14x faster kernel execution but CATASTROPHIC compression quality — nearly zero matches found. Root cause: parallel BUILD phase with atomicStore stores only late positions. FIND at early positions gets candidates after them (filtered by candidate < pos). Same fundamental flaw as the earlier lz77_hash.wgsl — parallel hash build is incompatible with LZ77's sequential lookup-then-update requirement. Also tested single mega-dispatch vs ring buffer: ring's GPU/CPU overlap (44 MB/s) beats single dispatch (30-38 MB/s) by interleaving compute and readback. This commit preserves the experiment for historian reference. Will be reverted in the next commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reverts the code changes from 89db5f3 (per-workgroup shared-memory hash table experiment). The kernel produced nearly zero matches due to a fundamental flaw in parallel hash build with last-writer-wins semantics. Preserved for reference: - kernels/lz77_local.wgsl (kernel source) - docs/experiments/bulk_vs_ring.rs.txt (dispatch strategy benchmark) - docs/experiments/local_vs_coop.rs.txt (quality comparison benchmark) - docs/design-docs/experiments.md (detailed findings) See experiments.md "Failed Experiment #2" for full analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New pipeline variant Pipeline::LzSeqH = 9 that pairs the LzSeq sequence encoder with Huffman entropy coding instead of rANS. Reuses existing stage_huffman_encode/decode and LzSeq demuxer. Benchmarking showed LzSeqH decode is 24% slower than LzSeqR (59 MB/s vs 77 MB/s) because Huffman's bit-level accumulator is fundamentally slower than rANS's word-level multiply-add. The LzSeq decoder itself (58% of total time) is the bottleneck, not entropy coding. Kept as an available pipeline option since it demonstrates the modular stage architecture. - Pipeline enum variant, TryFrom, trial candidates, GPU options - Entropy encode/decode dispatch in blocks.rs - Demuxer mapping in demux.rs - Stage dispatch + empty stream handling in stages.rs - CLI argument parsing in pz.rs - 6 new round-trip tests (empty, hello, repeating, binary, all_same, large) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Profiles LzSeqR, LzSeqH, and Deflate pipeline decode, then isolates raw LzSeq decode (no entropy) and raw rANS vs Huffman entropy decode separately. Uses Canterbury alice29.txt if available, otherwise generates 150KB of mixed text+noise data. Key findings from profiling: - Raw LzSeq decode: 133 MB/s (58% of pipeline decode time) - rANS decode: 118 MB/s per stream - Huffman decode: 25 MB/s per stream (5x slower than rANS) - The LzSeq decoder is the bottleneck, not entropy coding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Chris Lundquist and others added 7 commits February 21, 2026 04:03

ChrisLundquist force-pushed the worktree-lzseq branch from 1d40f79 to 520232c Compare February 22, 2026 01:18

Chris Lundquist and others added 16 commits February 21, 2026 20:28

profile: add lzseqr pipeline to profiling harness

ee06caa

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add GPU vs CPU LzSeqR benchmark report

54d4214

Documents why CPU-only LzSeqR (328 MB/s, 32% ratio) outperforms GPU-accelerated paths on AMD Radeon Pro 5500M due to PCIe overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ChrisLundquist merged commit 3d53789 into master Feb 22, 2026
5 checks passed

ChrisLundquist deleted the worktree-lzseq branch February 25, 2026 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zstd-style sequence encoding (LzSeq)#99

Add zstd-style sequence encoding (LzSeq)#99
ChrisLundquist merged 23 commits intomasterfrom
worktree-lzseq

ChrisLundquist commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Feb 22, 2026

Summary

Benchmark results (Canterbury corpus)

Key files

Commits

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant