This document explains how to run and interpret typegrade benchmarks.
Score all packages in the train corpus and evaluate assertions:
pnpm benchmark:trainThis scores every package in benchmarks/manifest.train.json, runs pairwise assertions from benchmarks/assertions.ts, and reports pass/fail results.
Score the holdout corpus (reserved packages not used for tuning):
pnpm benchmark:holdoutRun the calibration tool for detailed diagnostics:
pnpm benchmark:calibrateThis produces:
- Sorted scores by
consumerApi,agentReadiness, andtypeSafety. - Per-assertion pass/fail with deltas and
minDeltaenforcement. - Ranking loss calculation:
failed / evaluated. - Per-dimension concordance analysis.
- Tie analysis (packages with delta < 2).
- False equivalence analysis (different-tier packages with delta < 3).
- Margin analysis (must-pass assertions with delta < 5).
- Weight sensitivity analysis (simulates +/-20% perturbation).
- Delta histogram by assertion class.
- Top misranked packages.
Search for better weights using train data only:
pnpm benchmark:optimizepnpm gate:train # Run train gate (must-pass assertions + ranking loss)
pnpm gate:holdout # Run holdout gate (reserved validation set)
pnpm gate:eval # Run eval gate (main branch only — Pareto, seed robustness, drift)
pnpm gate:shadow # Run shadow gate (judge-only — random npm package generalization)The benchmark corpus is split to prevent overfitting:
- Train: packages used for weight tuning and calibration. Builder agents may read and modify.
- Eval-fixed: stable set for generalization testing. Builder agents must NOT read.
- Eval-pool: large pool for stratified sampling. Builder agents must NOT read.
- Holdout: reserved for final validation. Builder agents may read results but must not tune against them.
See Benchmark Policy for quarantine rules.
Benchmark results are stored in split-specific subdirectories to prevent cross-contamination:
- Train results:
benchmarks/results/train/ - Holdout results:
benchmarks/results/holdout/ - Eval results:
benchmarks-output/eval-raw/(judge-only, never inbenchmarks/results/) - Shadow results:
benchmarks-output/shadow-raw/(judge-only, never inbenchmarks/results/)
Each assertion is a pairwise ranking claim: "package A should score higher than package B on composite X."
- PASS: the assertion holds, with delta meeting any
minDeltarequirement. - FAIL: the higher-expected package scored lower.
- MARGIN: the assertion direction is correct, but delta is below
minDelta.
ranking loss = failed assertions / evaluated assertions
Target: < 5%. Above this, calibration will suggest weight adjustments.
Benchmarks can show:
- Relative ranking correctness (does a type-safe library outscore a loosely-typed one?).
- Weight stability (do small weight changes flip important assertions?).
- Cross-version consistency (does a library's score track expected improvements?).
Benchmarks cannot show:
- Absolute quality (a score of 70 does not mean "70% good").
- Future performance (scores depend on the current analyzer implementations).
- Universal ordering (different composites may rank the same libraries differently, and that's correct).
Some benchmark packages may be undersampled (few public declarations, fallback glob resolution). Check coverageDiagnostics and confidenceSummary in benchmark output:
- Undersampled packages get confidence caps and should not be used as strong ranking anchors.
- Fallback glob packages have confidence capped at 0.55.
Two additional gates validate typegrade's own output quality:
Runs typegrade on its own codebase and checks that minimum scores are met:
consumerApi>= 40typeSafety>= 40
This ensures the tool's own type quality does not regress below acceptable thresholds.
Validates that the self-analyze agent report has a consistent structure. Checks that the report contains:
issues— detected type quality issuesfixBatches— grouped fix recommendationsstopConditions— criteria for halting the agent loopverificationSteps— steps to verify applied fixes
This gate prevents structural regressions in the agent-facing output format.
The shadow validation track (pnpm benchmark:shadow / pnpm gate:shadow) measures generalization on random npm packages sampled from the eval pool. It is a judge-only command: raw per-package results are written to benchmarks-output/shadow-raw/, and only a RedactedShadowSummary is builder-visible.
Shadow validation scores a random sample of npm packages and measures aggregate trust contract behavior:
| Metric | Description |
|---|---|
comparableRate |
Proportion of packages producing comparable (trusted or directional) results |
abstentionCorrectnessRate |
Proportion of packages where abstention (degraded/not-comparable) was appropriate |
falseAuthoritativeRate |
Proportion of packages that received trusted classification despite poor data |
installFailureRate |
Proportion of packages that failed to install |
fallbackGlobRate |
Proportion of packages that fell back to glob resolution |
domainOverreachRate |
Proportion of domain inferences on packages without clear domain signals |
scenarioOverreachRate |
Proportion of scenario scores on packages without matching domain |
scoreCompressionRate |
Proportion of scores clustering in a narrow band |
crossRunStability |
Stability coefficient across repeated runs (1.0 = perfectly stable) |
The RedactedShadowSummary also includes 99% confidence bounds for key metrics and stratification breakdowns by types source, size band, and module kind. No package names or per-package scores are included.
The pnpm gate:shadow command reads the saved benchmarks-output/shadow-summary.json and checks it against gate thresholds. It does not re-run the shadow benchmark — it validates the most recent summary. Gate results are emitted as pass/fail with aggregate metrics only.
Checks include:
- Summary freshness (must be within 24 hours).
- All shadow gates from the saved summary passed.
- Aggregate assertions: degraded rate, comparable rate, domain coverage, and scenario coverage.
Holdout and shadow tracks use aggregate assertions (AggregateAssertion in benchmarks/types.ts) instead of per-package pairwise assertions. These are quarantine-compliant: they check only aggregate rates (degraded rate, fallback rate, comparable rate, domain/scenario coverage) with no package-specific logic.
- Holdout assertions (
HOLDOUT_ASSERTIONS): strict, zero-tolerance — all holdout packages must produce complete, non-degraded, non-fallback, fully-comparable results. - Shadow assertions (
SHADOW_ASSERTIONS): relaxed thresholds for random npm packages — degraded rate below 10%, comparable rate above 40%, domain coverage above 70%, scenario coverage above 35%.
Both RedactedEvalSummary and RedactedShadowSummary include an optional analysisSchemaVersion field that records which version of the analysis schema was used for the benchmark run. This enables tracking score changes across schema versions and detecting when benchmark results were produced with an outdated schema.
Each benchmark run saves per-package results to split-specific subdirectories under benchmarks/results/. Each result includes:
- Dimension scores, confidence, and metrics.
- Graph stats and dedup stats.
- Domain inference with signals and matched rules.
- Top issues.
- Coverage diagnostics.
| Command | What it does |
|---|---|
pnpm benchmark:train |
Score train corpus, run assertions |
pnpm benchmark:holdout |
Score holdout corpus |
pnpm benchmark:eval |
Score eval corpus (CI/judge only) |
pnpm benchmark:pool |
Score eval pool with stratified sampling (CI/judge only) |
pnpm benchmark:calibrate |
Detailed calibration diagnostics |
pnpm benchmark:optimize |
Weight search on train data |
pnpm benchmark:judge |
Evaluate eval results, emit redacted summary (CI/judge only) |
pnpm benchmark:shadow |
Shadow validation on random npm packages (CI/judge only) |
pnpm gate:train |
Train gate check |
pnpm gate:holdout |
Holdout gate check |
pnpm gate:eval |
Eval gate check (CI/judge only) |
pnpm gate:shadow |
Shadow gate check (reads saved summary, builder-accessible) |
pnpm perf |
Run all performance benchmarks |
pnpm perf:cli |
Measure CLI cold-start (--version, --help) |
pnpm perf:score |
Measure analyze, boundaries, fix-plan latency |
pnpm perf:benchmark |
Measure full train benchmark throughput |
pnpm perf:ci |
Run all perf benchmarks with regression checks |
The performance harness (perf/run.ts) measures built artifacts directly via node dist/bin.js subprocess spawning.
| Command | Target (p50) |
|---|---|
typegrade --version |
≤ 40ms |
typegrade --help |
≤ 80ms |
typegrade analyze <fixture> --json |
≤ 450ms |
typegrade boundaries <fixture> --json |
≤ 250ms |
typegrade fix-plan <fixture> --json |
≤ 450ms |
pnpm perf:ci compares current measurements against the most recent baseline in benchmarks-output/perf/ and fails if:
- Any measurement regresses by more than 20% from the previous baseline.
- Any measurement exceeds its target threshold.
Results are written to benchmarks-output/perf/perf-<timestamp>.json.