labeille's benchmarking system measures whole test suite execution time across
different conditions — JIT vs no-JIT, different interpreters, with/without coverage,
varying resource constraints. It is not a microbenchmark tool: it runs the same test
suites you use with labeille run, collects wall/user/sys time and peak RSS for each
iteration, and produces statistical comparisons.
This answers questions like:
- How much overhead does the JIT add to test suite execution?
- Is a new CPython build faster or slower than the previous one?
- Which packages show the most JIT overhead?
- Are there performance regressions over time?
Compare JIT-enabled vs JIT-disabled with two inline conditions:
labeille bench run \
--condition "jit:target_python=/opt/cpython-jit/python,env.PYTHON_JIT=1" \
--condition "nojit:target_python=/opt/cpython-jit/python,env.PYTHON_JIT=0" \
--work-dir ~/bench-work \
--packages requests,click,flaskFor a fast sanity check — 3 iterations, no warmup, top 20 packages:
labeille bench run --quick \
--condition "jit:target_python=/opt/cpython-jit/python,env.PYTHON_JIT=1" \
--condition "nojit:target_python=/opt/cpython-jit/python,env.PYTHON_JIT=0" \
--work-dir ~/bench-worklabeille bench run --profile jit-overhead.yaml \
--work-dir ~/bench-work# Display results
labeille bench show results/bench_20260303_140000
# Compare conditions within a run
labeille bench compare results/bench_20260303_140000
# Compare across runs
labeille bench compare results/bench_run1 results/bench_run2A condition is a named configuration that defines how to run tests. Each benchmark compares one or more conditions. A condition can specify:
target_python— path to the Python interpreterenv— environment variables (e.g.,PYTHON_JIT=1)extra_deps— additional packages to installtest_command_prefix— prepend to test commands (e.g.,coverage run -m)test_command_suffix— append to test commandstest_command_override— replace test commands entirelyinstall_command— override install commandsconstraints— resource limits (memory, CPU affinity, CPU time)
Each package's test suite runs multiple times per condition:
- Warmup iterations (default: 1) — not included in statistics, allows caches and JIT compilation to stabilize
- Measured iterations (default: 5, minimum: 3) — collected for statistical analysis
More iterations improve statistical confidence but increase total runtime linearly.
When comparing multiple conditions:
- Alternating (default for multi-condition): runs condition A then B for package 1, then A then B for package 2, etc. Reduces systematic bias from time-varying factors (thermal throttling, background load).
- Block: runs all iterations of condition A for all packages, then all of condition B. Faster but more susceptible to systematic bias.
- Interleaved (
--interleave): interleaves packages across iterations. Useful when you want to distribute cache/memory effects across the run.
A profile defines conditions and shared settings in a YAML file:
name: JIT overhead measurement
description: Compare JIT-enabled vs JIT-disabled CPython
iterations: 7
warmup: 2
timeout: 600
conditions:
jit:
target_python: /opt/cpython-jit/python
env:
PYTHON_JIT: "1"
nojit:
target_python: /opt/cpython-jit/python
env:
PYTHON_JIT: "0"
# Shared settings applied to all conditions unless overridden
default_env:
ASAN_OPTIONS: "detect_leaks=0"
default_extra_deps:
- pytest-timeout
# Optional: package filtering
packages:
- requests
- click
- flask
# Optional: resource constraints applied to all conditions
default_constraints:
cpu_affinity: "0,1"
memory_limit_mb: 4096The --condition flag uses the format name:key=value,key=value:
--condition "jit:target_python=/opt/python,env.PYTHON_JIT=1"
--condition "nojit:target_python=/opt/python,env.PYTHON_JIT=0"Supported keys: target_python, env.KEY, extra_deps, test_command_prefix,
test_command_suffix, test_command_override, install_command.
For reliable results, ensure the system is quiet:
# Check stability before starting (warns if load is high)
labeille bench run --check-stability --profile profile.yaml ...
# Wait for system to stabilize before starting
labeille bench run --wait-for-stability --profile profile.yaml ...Same options as labeille run:
--packages requests,click # Specific packages
--top 50 # Top N by downloadsReuse repos and venvs across benchmark runs:
--work-dir ~/bench-work # Sets both repos-dir and venvs-dir
--repos-dir ~/repos # Or set individually
--venvs-dir ~/venvsControl resource usage per iteration:
# Memory limit (ulimit -v)
labeille bench run --memory-limit 4096 ...
# CPU affinity (taskset) — pin to specific cores
labeille bench run --cpu-affinity "0,1" ...
# CPU time limit (ulimit -t)
labeille bench run --cpu-time-limit 300 ...Constraints can also be set per-condition in a YAML profile.
Capture individual test timings via pytest --durations=0:
labeille bench run --per-test-timing --profile profile.yaml ...This enables --per-test in bench show and bench compare to identify which
specific tests contribute most to overhead.
For cold-start benchmarks, drop filesystem caches between iterations:
# First: set up the cache-drop helper (requires sudo configuration)
labeille bench setup-cache-drop
# Then run with cache dropping
labeille bench run --drop-caches --profile profile.yaml ...
# Or compare warm vs cold automatically
labeille bench run --warm-vs-cold --profile profile.yaml ...Display results from a benchmark run:
labeille bench show results/bench_20260303_140000Shows system profile, conditions defined, and a per-package table with median wall time, IQR, coefficient of variation, and status for each condition.
Flag measurement anomalies — high variance, bimodal distributions, outliers:
labeille bench show results/bench_20260303_140000 --anomaliesAnomaly types: high_cv (coefficient of variation > threshold), bimodal
(suspected multimodal distribution), outlier_heavy (many outlier iterations),
status_mixed (some iterations pass, some fail), trend (monotonic drift).
Show individual test timings for a specific package:
labeille bench show results/bench_20260303_140000 --per-test requestsCompare conditions defined in the same benchmark:
labeille bench compare results/bench_20260303_140000Shows overhead percentage, confidence intervals, and statistical significance (Welch's t-test) for each package.
Compare results from different benchmark executions:
labeille bench compare results/bench_run1 results/bench_run2Identify which tests contribute most to overhead:
labeille bench compare results/bench_20260303_140000 --per-test requests--metric wall # Wall clock time (default)
--metric cpu # User + sys CPU time
--metric rss # Peak resident set sizeTrack benchmark performance over time with tracking series.
labeille bench track init jit-perf --description "JIT performance over CPython commits"labeille bench track add jit-perf results/bench_20260303_140000 \
--notes "CPython main @ abc1234" \
--commit sha=abc1234,branch=mainlabeille bench track show jit-perf # All runs
labeille bench track show jit-perf --last 5 # Last 5 runsPin a specific run as the reference point for trend analysis:
labeille bench track pin jit-perf bench_20260301_100000
labeille bench track unpin jit-perf # Remove pinDetect performance trends and regressions across the series:
labeille bench track trend jit-perf
labeille bench track trend jit-perf --condition jit --format markdownClassifies each package as: stable, improving, regressing, or volatile.
Thresholds are configurable:
--regression-threshold 0.02— per-run change threshold (fraction)--trend-threshold 0.05— overall slope threshold for classification
Check for new regressions compared to the baseline and previous run:
labeille bench track alert jit-perflabeille bench track listOne row per package per condition per iteration — for pandas, R, or spreadsheets:
labeille bench export results/bench_20260303_140000 --format csv
labeille bench export results/bench_20260303_140000 --format csv -o data.csvOne row per package per condition with aggregated statistics:
labeille bench export results/bench_20260303_140000 --format csv-summarySummary table suitable for GitHub issues and reports:
labeille bench export results/bench_20260303_140000 --format markdownA benchmark run produces:
results/bench_20260303_140000/
├── bench_meta.json # System profile, Python profile, conditions, timing
└── bench_results.jsonl # One JSON line per package with per-condition data
bench_meta.json contains:
- System characterization (CPU, RAM, OS, kernel)
- Python profile for each condition (version, JIT status, GIL, build flags)
- Condition definitions as resolved
- Execution timestamps
bench_results.jsonl contains per-package:
- Per-condition iteration timings (wall, user, sys, RSS)
- Descriptive statistics (mean, median, std, percentiles, IQR, CV)
- Outlier flags
- Per-test timings (if
--per-test-timingwas used)
Print system characterization for documentation and reproducibility:
labeille bench system
labeille bench system --target-python /opt/cpython/python
labeille bench system --json# Create a profile
cat > jit-profile.yaml << 'EOF'
name: JIT overhead
iterations: 7
warmup: 2
conditions:
jit:
target_python: /opt/cpython-jit/python
env: { PYTHON_JIT: "1" }
nojit:
target_python: /opt/cpython-jit/python
env: { PYTHON_JIT: "0" }
EOF
# Run the benchmark
labeille bench run --profile jit-profile.yaml \
--work-dir ~/bench-work --top 30
# View results
labeille bench show results/bench_*
# Compare conditions
labeille bench compare results/bench_*# Initialize a series
labeille bench track init jit-tracking -d "Track JIT overhead across CPython commits"
# After each CPython build, run and add:
labeille bench run --profile jit-profile.yaml --work-dir ~/bench
labeille bench track add jit-tracking results/bench_* --commit sha=$(git -C ~/cpython rev-parse HEAD)
# Check for regressions
labeille bench track trend jit-tracking
labeille bench track alert jit-tracking- Close other applications and background processes
- Use
--check-stabilityor--wait-for-stability - Pin CPU cores with
--cpu-affinityto avoid migration - Check for thermal throttling (sustained heavy loads)
- Increase
--iterationsfor better statistical confidence
Flaky tests pollute timing data. Use --anomalies with bench show to identify
packages with status_mixed anomalies. Consider adding --test-command-suffix "-k 'not flaky_test'" to exclude known flaky tests.
- ASAN-enabled builds use ~2-3x memory; increase
--memory-limitor use a non-ASAN build - Reduce
--workersif running other processes - Check
oom_detectedfield in results for confirmation