perf: unified scheduler telemetry, backpressure, and hot-loop optimizations by ChrisLundquist · Pull Request #103 · ChrisLundquist/libpz

ChrisLundquist · 2026-03-02T02:43:55Z

Summary

Overhauls the unified scheduler for reduced lock contention and smarter GPU/CPU work distribution, adds opt-in telemetry instrumentation, and optimizes critical LZ/FSE hot loops. Introduces a local perf-gate script for regression-guarding throughput and scheduler overhead.

Unified scheduler improvements (parallel.rs)

Reduced queue-lock churn: Workers now prepare the next task before re-acquiring the queue lock, performing a single lock/unlock for both task retirement and next-task enqueue. Previously, intermediate-stage completions held the lock across two separate critical sections (retire + enqueue), causing convoy effects under high worker counts.
GPU coordinator fairness reordering: The coordinator now processes StageN completions first, then Fused requests, then Stage0 last. Processing downstream continuations first reduces in-flight work and avoids starving completions when bursts arrive together.
Backpressure-aware Auto routing: Auto mode for stage1 GPU entropy now checks a pressure score before routing to GPU. try_send(Full) increments pressure by 2, Disconnected by 1, and successful sends decrement by 1. When pressure exceeds workers * 2, new blocks are routed to CPU. Explicit Gpu/Cpu backend assignments bypass backpressure.
Eliminated per-block Vec<u8> clones for GPU requests: GpuRequest::Stage0 and GpuRequest::Fused now carry only a block index — the coordinator reads input data directly from the shared blocks slice, avoiding allocation + copy per GPU-routed block.
Wider sync channel: Channel depth changed from min(4, num_blocks) to min(num_blocks, workers*2).clamp(1, 16) to reduce transient try_send(Full) fallbacks without unbounded buffering.
Localized worker stage continuations: Per-block continuation state stays local in worker threads to avoid queue round-trips between intermediate stages, with direct GPU handoff when routing stays on GPU.
Extracted shared completion state machine: The duplicated worker/GPU completion logic is now a single shared helper, reducing maintenance surface.

Telemetry infrastructure

UnifiedSchedulerStats: per-run timing for stage compute, queue wait, queue admin, GPU handoff, and try_send failure counters.
Collection gated behind AtomicBool (opt-in via set_unified_scheduler_stats_enabled()) — zero overhead in production.
SchedulerRunRecorder uses Drop to flush per-invocation local atomics into global stats.
--print-scheduler-stats flag in profile example emits SCHEDULER_STATS and PROFILE_STATS machine-readable lines.

Hot-loop optimizations

FSE spread_symbols: Eliminates intermediate Vec<usize> position buffer — walks the permutation cycle in-place.
FSE build_encode_tables: Replaces skip/take loop with slice::fill() for range initialization.
FSE fse_encode_internal: SoA layout — splits Vec<(u32, u32)> into separate bit_values: Vec<u32> + bit_counts: Vec<u8>, reducing memory bandwidth in the write pass.
LZ77 find_best_match / find_top_k_matches: Raw pointer arithmetic (compare_bytes_ptr) to skip repeated slice construction in the inner match-finding loop. Local variable hoisting for prev/window_mask.
LzSeq select_best_match: Deduplicates repeat-offset comparisons when offsets are equal (common at stream start, saves up to 2 redundant calls).
simd.rs: New compare_bytes_ptr unsafe wrapper for callers with pre-validated pointer bounds.

Perf gate script

scripts/perf-gate.sh: Runs a fixed matrix of pipeline/mode combinations, collects median throughput and scheduler overhead across N repeats, and compares against a baseline TSV.
Configurable thresholds: --throughput-regression-pct (default 4%), --overhead-regression-abs (default 0.02).
Supports --threads, --cpu-only, --update-baseline, and WebGPU auto-detection.

Docs

Deletes superseded gpu-strategy.md and 6 stale feedback files
Updates unified scheduler north-star plan
Adds perf-validation exec plan

Test plan

./scripts/test.sh --quick — fmt, clippy, all tests pass
./scripts/test.sh --all — all feature combinations pass
./scripts/perf-gate.sh --update-baseline — establish baseline
./scripts/perf-gate.sh — verify no regressions from baseline
Manual: ./examples/profile --pipeline lzr --print-scheduler-stats shows SCHEDULER_STATS and PROFILE_STATS lines
WebGPU round-trip: test_lzr_backend_assignments_are_interchangeable covers cpu/cpu, gpu/cpu, cpu/gpu, auto/auto
Backpressure unit tests: test_stage1_auto_backpressure_biases_to_cpu, test_stage1_backpressure_does_not_override_explicit_backend

🤖 Generated with Claude Code

Summary - Added unified scheduler telemetry in src/pipeline/parallel.rs with opt-in aggregation for: total_ns, stage_compute_ns, queue_wait_ns, queue_admin_ns, gpu_handoff_ns, and GPU handoff fallback counters. - Exposed public pipeline API in src/pipeline/mod.rs: set_unified_scheduler_stats_enabled, reset_unified_scheduler_stats, unified_scheduler_stats, plus UnifiedSchedulerStats type export. - Extended examples/profile.rs with --print-scheduler-stats and --gpu for full-pipeline profiling. Stats are emitted in a machine-readable SCHEDULER_STATS line for tooling. - Added scripts/perf-gate.sh to run a repeatable same-laptop matrix, compute medians, compare to baseline TSV, and fail on configured throughput/overhead regressions. - Captured initial baseline and passing run artifacts in docs/generated/perf-gate-baseline.tsv and docs/generated/2026-03-01-161450-perf-gate-run.tsv. - Added/updated plan docs in docs/exec-plans/active to track progress through phases and current constraints (single unified path, local validation). - Preserved CPU/GPU stage interchangeability and added explicit coverage via test_lzr_backend_assignments_are_interchangeable for backend mixes (cpu/cpu, gpu/cpu, cpu/gpu, auto/auto).

Summary - Added pointer-based SIMD compare entry point and switched LZ77 hot loops to pointer calls to reduce slice construction overhead in find_best/find_top_k. - Optimized FSE table-build/encode loops: direct symbol spreading, contiguous lookup range fill, and lower-bandwidth bit-chunk staging. - Optimized LzSeq repeat candidate selection by deduplicating repeated offsets and using cheaper repeat-match scan indexing. - Preserved unified scheduler architecture and CPU/GPU stage interchangeability (no alternate fast path introduced). Validation - cargo test lz77::tests::test_find_match_length_bounded_deflate - cargo test lz77::tests::test_compress_lazy_with_limit_round_trip_patterns - cargo test lz77::tests::test_lazy_round_trip_large - cargo test lz77::tests::test_lazy_quality_repeated_pattern - cargo test fse::tests::test_spread_counts_match - cargo test fse::tests::test_round_trip_medium - cargo test fse::tests::test_interleaved_medium - cargo test fse::tests::test_all_accuracy_logs - cargo test lzseq::tests::test_repeat_offsets_round_trip_various - cargo test lzseq::tests::test_repeat_matches_used_on_structured_data - cargo test lzseq::tests::test_round_trip_150kb_mixed - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable Perf notes - Full-pipeline same-laptop medians remain noisy; stage-mode comparison vs bde477f indicates strong fse encode gain and flat lz77 stage throughput in this environment.

Summary - Refactored unified scheduler completion paths (CPU workers and GPU completion helper) to use a single queue lock per finished task instead of a two-lock enqueue/decrement sequence. - Preserved pending-task invariants by treating non-final success as task replacement (pending unchanged) and final/error paths as task retirement (pending decremented). - Kept unified single-path execution and CPU/GPU stage interchangeability semantics intact; no separate fast path introduced. Validation - cargo test pipeline::parallel::tests - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable - Local A/B comparison vs c502fb5 (alternating order, 1MB, 20 iters, 3 repeats): deflate: overhead -0.0663 abs, throughput -0.92% median lzr: overhead -0.0401 abs, throughput -3.62% median lzf: overhead -0.0337 abs, throughput +0.91% median lzseqr: overhead +0.0032 abs, throughput -0.71% median

Summary - Increased unified scheduler GPU request channel depth from fixed 4 to adaptive min(num_blocks, 2*worker_count) clamped to 1..16 to reduce transient try_send(Full) fallbacks. - Reordered GPU coordinator servicing to prioritize StageN and fused requests before Stage0 batches so downstream continuations are less likely to wait behind fresh Stage0 bursts. - Preserved unified single-path scheduling semantics and CPU/GPU stage interchangeability; no alternate fast path added. Validation - cargo test pipeline::parallel::tests - cargo test pipeline::parallel::tests::test_channel_full_cpu_fallback - cargo test pipeline::parallel::tests::test_heterogeneous_compress_with_gpu_entropy - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable Notes - GPU device was unavailable in this environment for perf matrix execution; this commit focuses on scheduler-side fairness/backpressure behavior and correctness coverage.

Summary - Added backpressure-aware Auto stage1 entropy routing heuristic in unified scheduler using GPU request channel pressure signals. - Pressure score is updated from worker try_send outcomes (success/full/disconnected) and consulted only for BackendAssignment::Auto. - Explicit backend assignments remain strict: Cpu is never promoted, Gpu is never demoted by pressure. - Kept unified single-path scheduler and CPU/GPU stage interchangeability semantics intact. Validation - cargo test pipeline::parallel::tests - cargo test --features webgpu pipeline::parallel::tests::test_stage1_auto_backpressure_biases_to_cpu - cargo test --features webgpu pipeline::parallel::tests::test_stage1_backpressure_does_not_override_explicit_backend - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable - ./scripts/perf-gate.sh --cpu-only --pipelines deflate,lzr,lzf,lzseqr --repeats 3 --iterations 20 --size 1048576 (pass on rerun)

- keep per-block continuation state local in worker threads to avoid queue round-trips between intermediate stages - directly hand off StageN GPU entropy work from worker continuation path when routing stays on GPU - on StageGpu send backpressure/disconnect, keep StageN payload local for CPU fallback instead of restoring/re-taking intermediate slots - retain existing unified queue invariants for task retirement/close/failure handling and existing routing semantics

- extract shared queue completion state machine to reduce duplicated worker/GPU completion logic - document Stage0 GPU request asymmetry (block index only, coordinator reads shared blocks) - add PROFILE_STATS machine-readable throughput output in profile example - add --threads to profile and perf-gate to pin thread count for local runs - update perf-gate to parse throughput from PROFILE_STATS with fallback for older profile output

The previous commit rewrote the plan to a speculative pre-implementation state, erasing the status table, GPU rANS empirical findings (0.77x CPU encode, 0.54x decode), phase completion annotations, and recommended next actions. Restore the master version which accurately reflects what was tried, measured, and found wanting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Chris Lundquist and others added 8 commits March 1, 2026 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: unified scheduler telemetry, backpressure, and hot-loop optimizations#103

perf: unified scheduler telemetry, backpressure, and hot-loop optimizations#103
ChrisLundquist wants to merge 8 commits intomasterfrom
codex/unified-scheduler-perf-plan

ChrisLundquist commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Mar 2, 2026

Summary

Unified scheduler improvements (parallel.rs)

Telemetry infrastructure

Hot-loop optimizations

Perf gate script

Docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant