Skip to content

feat(ci): gate release workflow on resolution precision/recall thresholds#886

Merged
carlos-alm merged 6 commits intomainfrom
feat/resolution-ci-gate
Apr 6, 2026
Merged

feat(ci): gate release workflow on resolution precision/recall thresholds#886
carlos-alm merged 6 commits intomainfrom
feat/resolution-ci-gate

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Closes #875.

  • benchmark.yml: Added Gate on resolution thresholds step that runs the vitest resolution test after the benchmark script — fails the workflow if any language drops below its per-language precision/recall threshold
  • update-benchmark-report.ts: Added resolution-specific regression warnings (precision >5pp drop, recall >10pp drop per language) emitted as ::warning annotations in CI
  • regression-guard.test.ts: Added resolution benchmarks describe block that compares precision/recall across releases and hard-fails CI on regressions, with KNOWN_REGRESSIONS exemption support (key format: version:resolution <lang> precision/recall)

Test plan

  • regression-guard.test.ts passes locally (17/17 tests including 2 new resolution tests)
  • CI passes on this PR
  • Next release benchmark run exercises the new threshold gate step

…olds (#875)

Add resolution quality gates to the benchmark pipeline so regressions
are caught before publishing:

- benchmark.yml: run vitest resolution test after the benchmark script,
  failing the workflow if any language drops below its threshold
- update-benchmark-report.ts: warn on precision >5pp or recall >10pp
  drop per language between releases
- regression-guard.test.ts: hard-fail CI on precision/recall regressions
  across releases, with KNOWN_REGRESSIONS exemption support
@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Claude finished @carlos-alm's task —— View job


Root Cause Analysis

Problem Being Solved: This PR addresses issue #875 by adding CI gates to prevent resolution precision/recall regressions. The codebase runs resolution benchmarks but had no CI enforcement to block PRs when resolution quality degrades below thresholds.

Why This Approach: The implementation follows a three-pronged strategy:

  1. CI gate: New workflow step runs the resolution test suite after benchmarks, failing if any language drops below thresholds
  2. Warning annotations: Update script emits GitHub Actions warnings when precision/recall drops significantly
  3. Regression guard: Test suite with hard-fail logic and exemption support for known issues

Risk Assessment: LOW - This is defensive infrastructure that prevents quality regressions rather than introducing new functionality.

Backlog Compliance

  • Zero-dep: ✓ — No new runtime dependencies added
  • Foundation-aligned: ✓ — Improves detection accuracy (P1: always-current) by preventing resolution regressions
  • Problem-fit: 4 — Directly addresses AI accuracy by preventing hallucinated dependencies from resolution degradation
  • Breaking: No — Purely additive CI infrastructure
  • Tier: 1

Technical Review

Workflow Changes (.github/workflows/benchmark.yml:109-111)

Clean integration - New gate step runs between resolution benchmark and merge step
Proper conditional logic - Only runs when benchmarks aren't skipped
Simple implementation - Uses existing vitest test runner

Update Script Changes (scripts/update-benchmark-report.ts:323-354)

Consistent thresholds - 5pp precision drop, 10pp recall drop match test suite values
Proper CI formatting - Uses ::warning annotations for GitHub Actions visibility
Backwards compatible - Only adds new warning logic, doesn't change existing behavior

Minor issue: Lines 336, 347 - Consider adding title attribute to warnings for better grouping:

console.error(`::warning Regression (${lang})::${msg}`);

Regression Guard Test (tests/benchmarks/regression-guard.test.ts:515-595)

Well-structured - Follows existing patterns from build/query/incremental regression guards
Proper thresholds - 5pp/10pp drops are reasonable for bounded [0,1] metrics
Exemption system - KNOWN_REGRESSIONS format matches existing pattern: version:resolution <lang> <metric>
Clear test organization - Dedicated describe('resolution benchmarks') block

Potential improvements:

  1. Line 564-565, 574-575: The exemption key format (resolution ${lang} precision/recall) differs slightly from comment examples. Consider documenting the exact format in comments.
  2. Line 547: Could add version gap checking like other benchmark types to avoid spurious failures across major feature changes.

Code Quality

No configuration relaxation - No linting rules disabled or test timeouts increased
Test integrity maintained - New tests follow existing patterns without weakening assertions
Error handling - Proper null checks and graceful degradation when resolution data unavailable

Critical Concerns

NONE - This is a well-implemented defensive feature that strengthens CI quality gates.

Minor Recommendations

  1. Consistency: Update warning titles in update-benchmark-report.ts for better CI log grouping
  2. Documentation: Clarify exemption key format in regression-guard.test.ts comments
  3. Gap checking: Consider adding version gap logic to resolution regression checks

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE WITH CONDITIONS
  • Reasoning: Solid implementation of important quality infrastructure. The minor formatting and documentation issues don't block the core functionality. This directly prevents the kind of silent resolution regressions that would hurt AI agent accuracy.

The PR correctly implements CI gating for resolution quality without introducing configuration relaxation, test weakening, or breaking changes. It follows established patterns and provides proper visibility through both hard-fail tests and warning annotations.

Todo Checklist

  • Read and analyze all changed files
  • Understand the problem being solved and approach
  • Review configuration changes and test integrity
  • Assess backlog compliance (zero-dep, foundation-aligned, problem-fit, breaking)
  • Provide comprehensive review feedback

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 6, 2026

Greptile Summary

This PR adds resolution precision/recall gating to the benchmark workflow: a new CI step runs resolution-benchmark.test.ts against fixture graphs after each benchmark run, and update-benchmark-report.ts now emits ::warning annotations when precision drops >5pp or recall >10pp vs the previous release. The regression-guard.test.ts gains a complementary resolution benchmarks describe block that performs the same comparison against the historical data embedded in BUILD-BENCHMARKS.md, running on every PR. All three previous review concerns (constant duplication, redundant file read, missing step timeout) have been addressed.

Confidence Score: 5/5

Safe to merge; all prior feedback addressed and no blocking issues remain.

The three previous P1/P2 concerns (constant duplication, redundant file read, missing step timeout) are all resolved. The only remaining observation is a P2 design note about the gate testing local source rather than the npm-benchmarked version, which is acceptable on release runs where they are identical.

No files require special attention for merge.

Important Files Changed

Filename Overview
.github/workflows/benchmark.yml Adds Gate on resolution thresholds step (timeout-minutes: 30) running resolution-benchmark.test.ts after the resolution benchmark script; gate always tests local source code regardless of which npm version is being benchmarked.
scripts/update-benchmark-report.ts Adds resolution regression warning logic emitting ::warning annotations when precision drops >5pp or recall drops >10pp between releases; SYNC comments link to matching constants in regression-guard.test.ts.
tests/benchmarks/regression-guard.test.ts Adds resolution benchmarks describe block that checks historical precision/recall drops from BUILD-BENCHMARKS.md; reuses buildHistory cast as BuildEntryWithResolution to avoid redundant file read.

Sequence Diagram

sequenceDiagram
    participant CI as GitHub Actions
    participant BM as benchmark.ts (npm/local)
    participant RM as resolution-benchmark.ts (npm/local)
    participant GT as resolution-benchmark.test.ts (local src)
    participant MG as merge step
    participant RPT as update-benchmark-report.ts
    participant RG as regression-guard.test.ts (every PR)

    CI->>BM: Run build benchmark → benchmark-result.json
    CI->>RM: Run resolution benchmark → resolution-result.json
    CI->>GT: Gate: vitest resolution thresholds (local source only)
    GT-->>CI: pass / fail on per-language threshold
    CI->>MG: Merge resolution-result.json into benchmark-result.json
    CI->>RPT: Update BUILD-BENCHMARKS.md + emit ::warning on drops >5pp/10pp
    note over RG: Runs on every PR (not in benchmark workflow)
    RG->>RPT: Read BUILD-BENCHMARKS.md history
    RG-->>CI: fail if precision/recall dropped vs previous release
Loading

Reviews (3): Last reviewed commit: "fix: document resolution key format in K..." | Re-trigger Greptile

Comment on lines +525 to +526
const PRECISION_DROP_PP = 0.05;
const RECALL_DROP_PP = 0.1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Threshold constants duplicated across files

The 5 pp / 10 pp drop limits are defined independently here and again as PRECISION_DROP_THRESHOLD / RECALL_DROP_THRESHOLD in scripts/update-benchmark-report.ts (lines 325–326). If one side is tightened (e.g. recall narrowed from 10 pp to 5 pp), the ::warning signal and the hard-fail gate silently diverge — engineers would see no warning but a failing test, with no obvious reason for the mismatch. Consider extracting a shared constants file, or at minimum adding a cross-reference comment pointing to the mirror definition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added bidirectional SYNC comments to both regression-guard.test.ts (pointing to the script) and update-benchmark-report.ts (pointing to the test). Extracting a shared constants file was considered but the script runs as a standalone node invocation in CI, so cross-reference comments are the pragmatic choice here.

Comment on lines +542 to +545
const fullHistory = extractJsonData<BuildEntryWithResolution>(
path.join(BENCHMARKS_DIR, 'BUILD-BENCHMARKS.md'),
'BENCHMARK_DATA',
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant file read

fullHistory reads BUILD-BENCHMARKS.md with the same 'BENCHMARK_DATA' marker as buildHistory (line 338–342). Since BuildEntryWithResolution extends BuildEntry, the earlier array can be widened with a cast instead of re-parsing the file:

Suggested change
const fullHistory = extractJsonData<BuildEntryWithResolution>(
path.join(BENCHMARKS_DIR, 'BUILD-BENCHMARKS.md'),
'BENCHMARK_DATA',
);
const fullHistory = buildHistory as BuildEntryWithResolution[];

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced the second extractJsonData call with buildHistory as BuildEntryWithResolution[] since both read the same file (BUILD-BENCHMARKS.md) with the same marker (BENCHMARK_DATA). BuildEntryWithResolution extends BuildEntry so the widening cast is safe.

Comment on lines +109 to +111
- name: Gate on resolution thresholds
if: steps.existing.outputs.skip != 'true'
run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No timeout on the gate step

The resolution test suite builds graphs for every language fixture (~30 + languages, 60 s beforeAll budget each). Neither this step nor the build-benchmark job has a timeout-minutes cap, so a hanging WASM build can stall the entire job indefinitely. Consider adding a step-level timeout-minutes: 30 to bound the gate:

Suggested change
- name: Gate on resolution thresholds
if: steps.existing.outputs.skip != 'true'
run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose
- name: Gate on resolution thresholds
if: steps.existing.outputs.skip != 'true'
timeout-minutes: 30
run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added timeout-minutes: 30 to the gate step. 30 minutes is generous enough for the full language fixture suite while bounding worst-case CI stalls.

…ant file read (#886)

Add cross-reference SYNC comments between regression-guard.test.ts and
update-benchmark-report.ts so the duplicated precision/recall thresholds
stay in lockstep. Replace the second extractJsonData call with a type
cast of buildHistory since both read the same file and marker.
Prevents a hanging WASM build from stalling the entire benchmark job
indefinitely. 30-minute cap is generous enough for the full language
fixture suite while still bounding worst-case CI time.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed the P2 nit from the second Greptile review: added resolution-specific key format documentation to the KNOWN_REGRESSIONS comment block ("version:resolution precision" / "version:resolution recall"). Also resolved merge conflicts with main (PR #885 landed on main with type map confidence-aware dedup changes).

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 5ee0070 into main Apr 6, 2026
13 checks passed
@carlos-alm carlos-alm deleted the feat/resolution-ci-gate branch April 6, 2026 23:42
@github-actions github-actions bot locked and limited conversation to collaborators Apr 6, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run resolution precision/recall benchmark in CI release workflow

1 participant