perf: unified scheduler telemetry, backpressure, and hot-loop optimizations#103
Open
ChrisLundquist wants to merge 8 commits intomasterfrom
Open
perf: unified scheduler telemetry, backpressure, and hot-loop optimizations#103ChrisLundquist wants to merge 8 commits intomasterfrom
ChrisLundquist wants to merge 8 commits intomasterfrom
Conversation
Summary - Added unified scheduler telemetry in src/pipeline/parallel.rs with opt-in aggregation for: total_ns, stage_compute_ns, queue_wait_ns, queue_admin_ns, gpu_handoff_ns, and GPU handoff fallback counters. - Exposed public pipeline API in src/pipeline/mod.rs: set_unified_scheduler_stats_enabled, reset_unified_scheduler_stats, unified_scheduler_stats, plus UnifiedSchedulerStats type export. - Extended examples/profile.rs with --print-scheduler-stats and --gpu for full-pipeline profiling. Stats are emitted in a machine-readable SCHEDULER_STATS line for tooling. - Added scripts/perf-gate.sh to run a repeatable same-laptop matrix, compute medians, compare to baseline TSV, and fail on configured throughput/overhead regressions. - Captured initial baseline and passing run artifacts in docs/generated/perf-gate-baseline.tsv and docs/generated/2026-03-01-161450-perf-gate-run.tsv. - Added/updated plan docs in docs/exec-plans/active to track progress through phases and current constraints (single unified path, local validation). - Preserved CPU/GPU stage interchangeability and added explicit coverage via test_lzr_backend_assignments_are_interchangeable for backend mixes (cpu/cpu, gpu/cpu, cpu/gpu, auto/auto).
Summary - Added pointer-based SIMD compare entry point and switched LZ77 hot loops to pointer calls to reduce slice construction overhead in find_best/find_top_k. - Optimized FSE table-build/encode loops: direct symbol spreading, contiguous lookup range fill, and lower-bandwidth bit-chunk staging. - Optimized LzSeq repeat candidate selection by deduplicating repeated offsets and using cheaper repeat-match scan indexing. - Preserved unified scheduler architecture and CPU/GPU stage interchangeability (no alternate fast path introduced). Validation - cargo test lz77::tests::test_find_match_length_bounded_deflate - cargo test lz77::tests::test_compress_lazy_with_limit_round_trip_patterns - cargo test lz77::tests::test_lazy_round_trip_large - cargo test lz77::tests::test_lazy_quality_repeated_pattern - cargo test fse::tests::test_spread_counts_match - cargo test fse::tests::test_round_trip_medium - cargo test fse::tests::test_interleaved_medium - cargo test fse::tests::test_all_accuracy_logs - cargo test lzseq::tests::test_repeat_offsets_round_trip_various - cargo test lzseq::tests::test_repeat_matches_used_on_structured_data - cargo test lzseq::tests::test_round_trip_150kb_mixed - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable Perf notes - Full-pipeline same-laptop medians remain noisy; stage-mode comparison vs bde477f indicates strong fse encode gain and flat lz77 stage throughput in this environment.
Summary - Refactored unified scheduler completion paths (CPU workers and GPU completion helper) to use a single queue lock per finished task instead of a two-lock enqueue/decrement sequence. - Preserved pending-task invariants by treating non-final success as task replacement (pending unchanged) and final/error paths as task retirement (pending decremented). - Kept unified single-path execution and CPU/GPU stage interchangeability semantics intact; no separate fast path introduced. Validation - cargo test pipeline::parallel::tests - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable - Local A/B comparison vs c502fb5 (alternating order, 1MB, 20 iters, 3 repeats): deflate: overhead -0.0663 abs, throughput -0.92% median lzr: overhead -0.0401 abs, throughput -3.62% median lzf: overhead -0.0337 abs, throughput +0.91% median lzseqr: overhead +0.0032 abs, throughput -0.71% median
Summary - Increased unified scheduler GPU request channel depth from fixed 4 to adaptive min(num_blocks, 2*worker_count) clamped to 1..16 to reduce transient try_send(Full) fallbacks. - Reordered GPU coordinator servicing to prioritize StageN and fused requests before Stage0 batches so downstream continuations are less likely to wait behind fresh Stage0 bursts. - Preserved unified single-path scheduling semantics and CPU/GPU stage interchangeability; no alternate fast path added. Validation - cargo test pipeline::parallel::tests - cargo test pipeline::parallel::tests::test_channel_full_cpu_fallback - cargo test pipeline::parallel::tests::test_heterogeneous_compress_with_gpu_entropy - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable Notes - GPU device was unavailable in this environment for perf matrix execution; this commit focuses on scheduler-side fairness/backpressure behavior and correctness coverage.
Summary - Added backpressure-aware Auto stage1 entropy routing heuristic in unified scheduler using GPU request channel pressure signals. - Pressure score is updated from worker try_send outcomes (success/full/disconnected) and consulted only for BackendAssignment::Auto. - Explicit backend assignments remain strict: Cpu is never promoted, Gpu is never demoted by pressure. - Kept unified single-path scheduler and CPU/GPU stage interchangeability semantics intact. Validation - cargo test pipeline::parallel::tests - cargo test --features webgpu pipeline::parallel::tests::test_stage1_auto_backpressure_biases_to_cpu - cargo test --features webgpu pipeline::parallel::tests::test_stage1_backpressure_does_not_override_explicit_backend - cargo test --features webgpu pipeline::parallel::tests::test_lzr_backend_assignments_are_interchangeable - ./scripts/perf-gate.sh --cpu-only --pipelines deflate,lzr,lzf,lzseqr --repeats 3 --iterations 20 --size 1048576 (pass on rerun)
- keep per-block continuation state local in worker threads to avoid queue round-trips between intermediate stages - directly hand off StageN GPU entropy work from worker continuation path when routing stays on GPU - on StageGpu send backpressure/disconnect, keep StageN payload local for CPU fallback instead of restoring/re-taking intermediate slots - retain existing unified queue invariants for task retirement/close/failure handling and existing routing semantics
- extract shared queue completion state machine to reduce duplicated worker/GPU completion logic - document Stage0 GPU request asymmetry (block index only, coordinator reads shared blocks) - add PROFILE_STATS machine-readable throughput output in profile example - add --threads to profile and perf-gate to pin thread count for local runs - update perf-gate to parse throughput from PROFILE_STATS with fallback for older profile output
The previous commit rewrote the plan to a speculative pre-implementation state, erasing the status table, GPU rANS empirical findings (0.77x CPU encode, 0.54x decode), phase completion annotations, and recommended next actions. Restore the master version which accurately reflects what was tried, measured, and found wanting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Overhauls the unified scheduler for reduced lock contention and smarter GPU/CPU work distribution, adds opt-in telemetry instrumentation, and optimizes critical LZ/FSE hot loops. Introduces a local perf-gate script for regression-guarding throughput and scheduler overhead.
Unified scheduler improvements (parallel.rs)
Automode for stage1 GPU entropy now checks a pressure score before routing to GPU.try_send(Full)increments pressure by 2,Disconnectedby 1, and successful sends decrement by 1. When pressure exceedsworkers * 2, new blocks are routed to CPU. ExplicitGpu/Cpubackend assignments bypass backpressure.Vec<u8>clones for GPU requests:GpuRequest::Stage0andGpuRequest::Fusednow carry only a block index — the coordinator reads input data directly from the sharedblocksslice, avoiding allocation + copy per GPU-routed block.min(4, num_blocks)tomin(num_blocks, workers*2).clamp(1, 16)to reduce transienttry_send(Full)fallbacks without unbounded buffering.Telemetry infrastructure
UnifiedSchedulerStats: per-run timing for stage compute, queue wait, queue admin, GPU handoff, and try_send failure counters.AtomicBool(opt-in viaset_unified_scheduler_stats_enabled()) — zero overhead in production.SchedulerRunRecorderusesDropto flush per-invocation local atomics into global stats.--print-scheduler-statsflag inprofileexample emitsSCHEDULER_STATSandPROFILE_STATSmachine-readable lines.Hot-loop optimizations
spread_symbols: Eliminates intermediateVec<usize>position buffer — walks the permutation cycle in-place.build_encode_tables: Replaces skip/take loop withslice::fill()for range initialization.fse_encode_internal: SoA layout — splitsVec<(u32, u32)>into separatebit_values: Vec<u32>+bit_counts: Vec<u8>, reducing memory bandwidth in the write pass.find_best_match/find_top_k_matches: Raw pointer arithmetic (compare_bytes_ptr) to skip repeated slice construction in the inner match-finding loop. Local variable hoisting forprev/window_mask.select_best_match: Deduplicates repeat-offset comparisons when offsets are equal (common at stream start, saves up to 2 redundant calls).compare_bytes_ptrunsafe wrapper for callers with pre-validated pointer bounds.Perf gate script
scripts/perf-gate.sh: Runs a fixed matrix of pipeline/mode combinations, collects median throughput and scheduler overhead across N repeats, and compares against a baseline TSV.--throughput-regression-pct(default 4%),--overhead-regression-abs(default 0.02).--threads,--cpu-only,--update-baseline, and WebGPU auto-detection.Docs
gpu-strategy.mdand 6 stale feedback filesTest plan
./scripts/test.sh --quick— fmt, clippy, all tests pass./scripts/test.sh --all— all feature combinations pass./scripts/perf-gate.sh --update-baseline— establish baseline./scripts/perf-gate.sh— verify no regressions from baseline./examples/profile --pipeline lzr --print-scheduler-statsshowsSCHEDULER_STATSandPROFILE_STATSlinestest_lzr_backend_assignments_are_interchangeablecovers cpu/cpu, gpu/cpu, cpu/gpu, auto/autotest_stage1_auto_backpressure_biases_to_cpu,test_stage1_backpressure_does_not_override_explicit_backend🤖 Generated with Claude Code