feat(ci): gate release workflow on resolution precision/recall thresholds by carlos-alm · Pull Request #886 · optave/ops-codegraph-tool

carlos-alm · 2026-04-06T22:28:49Z

Summary

Closes #875.

benchmark.yml: Added Gate on resolution thresholds step that runs the vitest resolution test after the benchmark script — fails the workflow if any language drops below its per-language precision/recall threshold
update-benchmark-report.ts: Added resolution-specific regression warnings (precision >5pp drop, recall >10pp drop per language) emitted as ::warning annotations in CI
regression-guard.test.ts: Added resolution benchmarks describe block that compares precision/recall across releases and hard-fails CI on regressions, with KNOWN_REGRESSIONS exemption support (key format: version:resolution <lang> precision/recall)

Test plan

regression-guard.test.ts passes locally (17/17 tests including 2 new resolution tests)
CI passes on this PR
Next release benchmark run exercises the new threshold gate step

…olds (#875) Add resolution quality gates to the benchmark pipeline so regressions are caught before publishing: - benchmark.yml: run vitest resolution test after the benchmark script, failing the workflow if any language drops below its threshold - update-benchmark-report.ts: warn on precision >5pp or recall >10pp drop per language between releases - regression-guard.test.ts: hard-fail CI on precision/recall regressions across releases, with KNOWN_REGRESSIONS exemption support

claude · 2026-04-06T22:29:07Z

Claude finished @carlos-alm's task —— View job

Root Cause Analysis

Problem Being Solved: This PR addresses issue #875 by adding CI gates to prevent resolution precision/recall regressions. The codebase runs resolution benchmarks but had no CI enforcement to block PRs when resolution quality degrades below thresholds.

Why This Approach: The implementation follows a three-pronged strategy:

CI gate: New workflow step runs the resolution test suite after benchmarks, failing if any language drops below thresholds
Warning annotations: Update script emits GitHub Actions warnings when precision/recall drops significantly
Regression guard: Test suite with hard-fail logic and exemption support for known issues

Risk Assessment: LOW - This is defensive infrastructure that prevents quality regressions rather than introducing new functionality.

Backlog Compliance

Zero-dep: ✓ — No new runtime dependencies added
Foundation-aligned: ✓ — Improves detection accuracy (P1: always-current) by preventing resolution regressions
Problem-fit: 4 — Directly addresses AI accuracy by preventing hallucinated dependencies from resolution degradation
Breaking: No — Purely additive CI infrastructure
Tier: 1

Technical Review

Workflow Changes (.github/workflows/benchmark.yml:109-111)

✅ Clean integration - New gate step runs between resolution benchmark and merge step
✅ Proper conditional logic - Only runs when benchmarks aren't skipped
✅ Simple implementation - Uses existing vitest test runner

Update Script Changes (scripts/update-benchmark-report.ts:323-354)

✅ Consistent thresholds - 5pp precision drop, 10pp recall drop match test suite values
✅ Proper CI formatting - Uses ::warning annotations for GitHub Actions visibility
✅ Backwards compatible - Only adds new warning logic, doesn't change existing behavior

Minor issue: Lines 336, 347 - Consider adding title attribute to warnings for better grouping:

console.error(`::warning Regression (${lang})::${msg}`);

Regression Guard Test (tests/benchmarks/regression-guard.test.ts:515-595)

✅ Well-structured - Follows existing patterns from build/query/incremental regression guards
✅ Proper thresholds - 5pp/10pp drops are reasonable for bounded [0,1] metrics
✅ Exemption system - KNOWN_REGRESSIONS format matches existing pattern: version:resolution <lang> <metric>
✅ Clear test organization - Dedicated describe('resolution benchmarks') block

Potential improvements:

Line 564-565, 574-575: The exemption key format (resolution ${lang} precision/recall) differs slightly from comment examples. Consider documenting the exact format in comments.
Line 547: Could add version gap checking like other benchmark types to avoid spurious failures across major feature changes.

Code Quality

✅ No configuration relaxation - No linting rules disabled or test timeouts increased
✅ Test integrity maintained - New tests follow existing patterns without weakening assertions
✅ Error handling - Proper null checks and graceful degradation when resolution data unavailable

Critical Concerns

NONE - This is a well-implemented defensive feature that strengthens CI quality gates.

Minor Recommendations

Consistency: Update warning titles in update-benchmark-report.ts for better CI log grouping
Documentation: Clarify exemption key format in regression-guard.test.ts comments
Gap checking: Consider adding version gap logic to resolution regression checks

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE WITH CONDITIONS
Reasoning: Solid implementation of important quality infrastructure. The minor formatting and documentation issues don't block the core functionality. This directly prevents the kind of silent resolution regressions that would hurt AI agent accuracy.

The PR correctly implements CI gating for resolution quality without introducing configuration relaxation, test weakening, or breaking changes. It follows established patterns and provides proper visibility through both hard-fail tests and warning annotations.

Todo Checklist

Read and analyze all changed files
Understand the problem being solved and approach
Review configuration changes and test integrity
Assess backlog compliance (zero-dep, foundation-aligned, problem-fit, breaking)
Provide comprehensive review feedback

greptile-apps · 2026-04-06T22:36:00Z

Greptile Summary

This PR adds resolution precision/recall gating to the benchmark workflow: a new CI step runs resolution-benchmark.test.ts against fixture graphs after each benchmark run, and update-benchmark-report.ts now emits ::warning annotations when precision drops >5pp or recall >10pp vs the previous release. The regression-guard.test.ts gains a complementary resolution benchmarks describe block that performs the same comparison against the historical data embedded in BUILD-BENCHMARKS.md, running on every PR. All three previous review concerns (constant duplication, redundant file read, missing step timeout) have been addressed.

Confidence Score: 5/5

Safe to merge; all prior feedback addressed and no blocking issues remain.

The three previous P1/P2 concerns (constant duplication, redundant file read, missing step timeout) are all resolved. The only remaining observation is a P2 design note about the gate testing local source rather than the npm-benchmarked version, which is acceptable on release runs where they are identical.

No files require special attention for merge.

Important Files Changed

Filename	Overview
.github/workflows/benchmark.yml	Adds Gate on resolution thresholds step (timeout-minutes: 30) running resolution-benchmark.test.ts after the resolution benchmark script; gate always tests local source code regardless of which npm version is being benchmarked.
scripts/update-benchmark-report.ts	Adds resolution regression warning logic emitting ::warning annotations when precision drops >5pp or recall drops >10pp between releases; SYNC comments link to matching constants in regression-guard.test.ts.
tests/benchmarks/regression-guard.test.ts	Adds resolution benchmarks describe block that checks historical precision/recall drops from BUILD-BENCHMARKS.md; reuses buildHistory cast as BuildEntryWithResolution to avoid redundant file read.

Sequence Diagram

sequenceDiagram
    participant CI as GitHub Actions
    participant BM as benchmark.ts (npm/local)
    participant RM as resolution-benchmark.ts (npm/local)
    participant GT as resolution-benchmark.test.ts (local src)
    participant MG as merge step
    participant RPT as update-benchmark-report.ts
    participant RG as regression-guard.test.ts (every PR)

    CI->>BM: Run build benchmark → benchmark-result.json
    CI->>RM: Run resolution benchmark → resolution-result.json
    CI->>GT: Gate: vitest resolution thresholds (local source only)
    GT-->>CI: pass / fail on per-language threshold
    CI->>MG: Merge resolution-result.json into benchmark-result.json
    CI->>RPT: Update BUILD-BENCHMARKS.md + emit ::warning on drops >5pp/10pp
    note over RG: Runs on every PR (not in benchmark workflow)
    RG->>RPT: Read BUILD-BENCHMARKS.md history
    RG-->>CI: fail if precision/recall dropped vs previous release

_{Reviews (3): Last reviewed commit: "fix: document resolution key format in K..." | Re-trigger Greptile}

greptile-apps · 2026-04-06T22:36:03Z

tests/benchmarks/regression-guard.test.ts

+    const PRECISION_DROP_PP = 0.05;
+    const RECALL_DROP_PP = 0.1;


Threshold constants duplicated across files

The 5 pp / 10 pp drop limits are defined independently here and again as PRECISION_DROP_THRESHOLD / RECALL_DROP_THRESHOLD in scripts/update-benchmark-report.ts (lines 325–326). If one side is tightened (e.g. recall narrowed from 10 pp to 5 pp), the ::warning signal and the hard-fail gate silently diverge — engineers would see no warning but a failing test, with no obvious reason for the mismatch. Consider extracting a shared constants file, or at minimum adding a cross-reference comment pointing to the mirror definition.

Fixed — added bidirectional SYNC comments to both regression-guard.test.ts (pointing to the script) and update-benchmark-report.ts (pointing to the test). Extracting a shared constants file was considered but the script runs as a standalone node invocation in CI, so cross-reference comments are the pragmatic choice here.

greptile-apps · 2026-04-06T22:36:04Z

tests/benchmarks/regression-guard.test.ts

+    const fullHistory = extractJsonData<BuildEntryWithResolution>(
+      path.join(BENCHMARKS_DIR, 'BUILD-BENCHMARKS.md'),
+      'BENCHMARK_DATA',
+    );


Redundant file read

fullHistory reads BUILD-BENCHMARKS.md with the same 'BENCHMARK_DATA' marker as buildHistory (line 338–342). Since BuildEntryWithResolution extends BuildEntry, the earlier array can be widened with a cast instead of re-parsing the file:

Suggested change

const fullHistory = extractJsonData<BuildEntryWithResolution>(

path.join(BENCHMARKS_DIR, 'BUILD-BENCHMARKS.md'),

'BENCHMARK_DATA',

);

const fullHistory = buildHistory as BuildEntryWithResolution[];

Fixed — replaced the second extractJsonData call with buildHistory as BuildEntryWithResolution[] since both read the same file (BUILD-BENCHMARKS.md) with the same marker (BENCHMARK_DATA). BuildEntryWithResolution extends BuildEntry so the widening cast is safe.

greptile-apps · 2026-04-06T22:36:05Z

.github/workflows/benchmark.yml

+      - name: Gate on resolution thresholds
+        if: steps.existing.outputs.skip != 'true'
+        run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose


No timeout on the gate step

The resolution test suite builds graphs for every language fixture (~30 + languages, 60 s beforeAll budget each). Neither this step nor the build-benchmark job has a timeout-minutes cap, so a hanging WASM build can stall the entire job indefinitely. Consider adding a step-level timeout-minutes: 30 to bound the gate:

Suggested change

- name: Gate on resolution thresholds

if: steps.existing.outputs.skip != 'true'

run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose

- name: Gate on resolution thresholds

if: steps.existing.outputs.skip != 'true'

timeout-minutes: 30

run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose

Fixed — added timeout-minutes: 30 to the gate step. 30 minutes is generous enough for the full language fixture suite while bounding worst-case CI stalls.

…ant file read (#886) Add cross-reference SYNC comments between regression-guard.test.ts and update-benchmark-report.ts so the duplicated precision/recall thresholds stay in lockstep. Replace the second extractJsonData call with a type cast of buildHistory since both read the same file and marker.

Prevents a hanging WASM build from stalling the entire benchmark job indefinitely. 30-minute cap is generous enough for the full language fixture suite while still bounding worst-case CI time.

carlos-alm · 2026-04-06T22:48:26Z

@greptileai

carlos-alm · 2026-04-06T23:28:43Z

Addressed the P2 nit from the second Greptile review: added resolution-specific key format documentation to the KNOWN_REGRESSIONS comment block ("version:resolution precision" / "version:resolution recall"). Also resolved merge conflicts with main (PR #885 landed on main with type map confidence-aware dedup changes).

carlos-alm · 2026-04-06T23:28:53Z

@greptileai

carlos-alm added 2 commits April 6, 2026 16:28

style: fix biome formatting in regression guard

0583f57

greptile-apps bot reviewed Apr 6, 2026

View reviewed changes

carlos-alm added 2 commits April 6, 2026 16:46

fix: add timeout-minutes to resolution gate step (#886)

8b532b2

Prevents a hanging WASM build from stalling the entire benchmark job indefinitely. 30-minute cap is generous enough for the full language fixture suite while still bounding worst-case CI time.

carlos-alm added 2 commits April 6, 2026 17:26

fix: resolve merge conflicts with main

4c4aa03

fix: document resolution key format in KNOWN_REGRESSIONS comment (#886)

54bb452

carlos-alm merged commit 5ee0070 into main Apr 6, 2026
13 checks passed

carlos-alm deleted the feat/resolution-ci-gate branch April 6, 2026 23:42

github-actions bot locked and limited conversation to collaborators Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): gate release workflow on resolution precision/recall thresholds#886

feat(ci): gate release workflow on resolution precision/recall thresholds#886
carlos-alm merged 6 commits intomainfrom
feat/resolution-ci-gate

carlos-alm commented Apr 6, 2026

Uh oh!

claude bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Apr 6, 2026

Uh oh!

carlos-alm Apr 6, 2026

Uh oh!

greptile-apps bot Apr 6, 2026

Uh oh!

carlos-alm Apr 6, 2026

Uh oh!

greptile-apps bot Apr 6, 2026

Uh oh!

carlos-alm Apr 6, 2026

Uh oh!

carlos-alm commented Apr 6, 2026

Uh oh!

carlos-alm commented Apr 6, 2026

Uh oh!

carlos-alm commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 6, 2026

Summary

Test plan

Uh oh!

claude bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause Analysis

Backlog Compliance

Technical Review

Workflow Changes (.github/workflows/benchmark.yml:109-111)

Update Script Changes (scripts/update-benchmark-report.ts:323-354)

Regression Guard Test (tests/benchmarks/regression-guard.test.ts:515-595)

Code Quality

Critical Concerns

Minor Recommendations

Final Recommendation

Todo Checklist

Uh oh!

greptile-apps bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Apr 6, 2026

Uh oh!

carlos-alm commented Apr 6, 2026

Uh oh!

carlos-alm commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 6, 2026 •

edited

Loading

greptile-apps bot commented Apr 6, 2026 •

edited

Loading