feat(ci): gate release workflow on resolution precision/recall thresholds#886
feat(ci): gate release workflow on resolution precision/recall thresholds#886carlos-alm merged 6 commits intomainfrom
Conversation
…olds (#875) Add resolution quality gates to the benchmark pipeline so regressions are caught before publishing: - benchmark.yml: run vitest resolution test after the benchmark script, failing the workflow if any language drops below its threshold - update-benchmark-report.ts: warn on precision >5pp or recall >10pp drop per language between releases - regression-guard.test.ts: hard-fail CI on precision/recall regressions across releases, with KNOWN_REGRESSIONS exemption support
|
Claude finished @carlos-alm's task —— View job Root Cause AnalysisProblem Being Solved: This PR addresses issue #875 by adding CI gates to prevent resolution precision/recall regressions. The codebase runs resolution benchmarks but had no CI enforcement to block PRs when resolution quality degrades below thresholds. Why This Approach: The implementation follows a three-pronged strategy:
Risk Assessment: LOW - This is defensive infrastructure that prevents quality regressions rather than introducing new functionality. Backlog Compliance
Technical ReviewWorkflow Changes (.github/workflows/benchmark.yml:109-111)✅ Clean integration - New gate step runs between resolution benchmark and merge step Update Script Changes (scripts/update-benchmark-report.ts:323-354)✅ Consistent thresholds - 5pp precision drop, 10pp recall drop match test suite values Minor issue: Lines 336, 347 - Consider adding console.error(`::warning Regression (${lang})::${msg}`);Regression Guard Test (tests/benchmarks/regression-guard.test.ts:515-595)✅ Well-structured - Follows existing patterns from build/query/incremental regression guards Potential improvements:
Code Quality✅ No configuration relaxation - No linting rules disabled or test timeouts increased Critical ConcernsNONE - This is a well-implemented defensive feature that strengthens CI quality gates. Minor Recommendations
Final Recommendation
The PR correctly implements CI gating for resolution quality without introducing configuration relaxation, test weakening, or breaking changes. It follows established patterns and provides proper visibility through both hard-fail tests and warning annotations. Todo Checklist
|
Greptile SummaryThis PR adds resolution precision/recall gating to the benchmark workflow: a new CI step runs Confidence Score: 5/5Safe to merge; all prior feedback addressed and no blocking issues remain. The three previous P1/P2 concerns (constant duplication, redundant file read, missing step timeout) are all resolved. The only remaining observation is a P2 design note about the gate testing local source rather than the npm-benchmarked version, which is acceptable on release runs where they are identical. No files require special attention for merge. Important Files Changed
Sequence DiagramsequenceDiagram
participant CI as GitHub Actions
participant BM as benchmark.ts (npm/local)
participant RM as resolution-benchmark.ts (npm/local)
participant GT as resolution-benchmark.test.ts (local src)
participant MG as merge step
participant RPT as update-benchmark-report.ts
participant RG as regression-guard.test.ts (every PR)
CI->>BM: Run build benchmark → benchmark-result.json
CI->>RM: Run resolution benchmark → resolution-result.json
CI->>GT: Gate: vitest resolution thresholds (local source only)
GT-->>CI: pass / fail on per-language threshold
CI->>MG: Merge resolution-result.json into benchmark-result.json
CI->>RPT: Update BUILD-BENCHMARKS.md + emit ::warning on drops >5pp/10pp
note over RG: Runs on every PR (not in benchmark workflow)
RG->>RPT: Read BUILD-BENCHMARKS.md history
RG-->>CI: fail if precision/recall dropped vs previous release
Reviews (3): Last reviewed commit: "fix: document resolution key format in K..." | Re-trigger Greptile |
| const PRECISION_DROP_PP = 0.05; | ||
| const RECALL_DROP_PP = 0.1; |
There was a problem hiding this comment.
Threshold constants duplicated across files
The 5 pp / 10 pp drop limits are defined independently here and again as PRECISION_DROP_THRESHOLD / RECALL_DROP_THRESHOLD in scripts/update-benchmark-report.ts (lines 325–326). If one side is tightened (e.g. recall narrowed from 10 pp to 5 pp), the ::warning signal and the hard-fail gate silently diverge — engineers would see no warning but a failing test, with no obvious reason for the mismatch. Consider extracting a shared constants file, or at minimum adding a cross-reference comment pointing to the mirror definition.
There was a problem hiding this comment.
Fixed — added bidirectional SYNC comments to both regression-guard.test.ts (pointing to the script) and update-benchmark-report.ts (pointing to the test). Extracting a shared constants file was considered but the script runs as a standalone node invocation in CI, so cross-reference comments are the pragmatic choice here.
| const fullHistory = extractJsonData<BuildEntryWithResolution>( | ||
| path.join(BENCHMARKS_DIR, 'BUILD-BENCHMARKS.md'), | ||
| 'BENCHMARK_DATA', | ||
| ); |
There was a problem hiding this comment.
fullHistory reads BUILD-BENCHMARKS.md with the same 'BENCHMARK_DATA' marker as buildHistory (line 338–342). Since BuildEntryWithResolution extends BuildEntry, the earlier array can be widened with a cast instead of re-parsing the file:
| const fullHistory = extractJsonData<BuildEntryWithResolution>( | |
| path.join(BENCHMARKS_DIR, 'BUILD-BENCHMARKS.md'), | |
| 'BENCHMARK_DATA', | |
| ); | |
| const fullHistory = buildHistory as BuildEntryWithResolution[]; |
There was a problem hiding this comment.
Fixed — replaced the second extractJsonData call with buildHistory as BuildEntryWithResolution[] since both read the same file (BUILD-BENCHMARKS.md) with the same marker (BENCHMARK_DATA). BuildEntryWithResolution extends BuildEntry so the widening cast is safe.
| - name: Gate on resolution thresholds | ||
| if: steps.existing.outputs.skip != 'true' | ||
| run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose |
There was a problem hiding this comment.
The resolution test suite builds graphs for every language fixture (~30 + languages, 60 s beforeAll budget each). Neither this step nor the build-benchmark job has a timeout-minutes cap, so a hanging WASM build can stall the entire job indefinitely. Consider adding a step-level timeout-minutes: 30 to bound the gate:
| - name: Gate on resolution thresholds | |
| if: steps.existing.outputs.skip != 'true' | |
| run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose | |
| - name: Gate on resolution thresholds | |
| if: steps.existing.outputs.skip != 'true' | |
| timeout-minutes: 30 | |
| run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose |
There was a problem hiding this comment.
Fixed — added timeout-minutes: 30 to the gate step. 30 minutes is generous enough for the full language fixture suite while bounding worst-case CI stalls.
…ant file read (#886) Add cross-reference SYNC comments between regression-guard.test.ts and update-benchmark-report.ts so the duplicated precision/recall thresholds stay in lockstep. Replace the second extractJsonData call with a type cast of buildHistory since both read the same file and marker.
Prevents a hanging WASM build from stalling the entire benchmark job indefinitely. 30-minute cap is generous enough for the full language fixture suite while still bounding worst-case CI time.
|
Addressed the P2 nit from the second Greptile review: added resolution-specific key format documentation to the KNOWN_REGRESSIONS comment block ("version:resolution precision" / "version:resolution recall"). Also resolved merge conflicts with main (PR #885 landed on main with type map confidence-aware dedup changes). |
Summary
Closes #875.
Gate on resolution thresholdsstep that runs the vitest resolution test after the benchmark script — fails the workflow if any language drops below its per-language precision/recall threshold::warningannotations in CIresolution benchmarksdescribe block that compares precision/recall across releases and hard-fails CI on regressions, withKNOWN_REGRESSIONSexemption support (key format:version:resolution <lang> precision/recall)Test plan
regression-guard.test.tspasses locally (17/17 tests including 2 new resolution tests)