Skip to content

Commit fac8fdc

Browse files
igerberclaude
andcommitted
Fix overly strict AI PR review prompt causing excessive review cycles
Address five failure modes observed after gpt-5.4 upgrade (PRs #192, #194, #195): documented deviations flagged as blockers, deferred work not accepted, moving goalposts on re-review, undefined approval criteria, and valid implementation choices treated as methodology errors. Changes to pr_review.md: - Exempt REGISTRY.md-documented deviations from P0/P1 (classify as P3) - Add implementation choice exception for valid numerical approaches - Add Deferred Work Acceptance section honoring TODO.md tracking - Add Assessment Criteria with objective verdicts and mitigation rules - Add Re-review Scope rules to prevent oscillation between rounds Changes to ai_pr_review.yml: - Add step to fetch previous AI review comment for re-review context - Inject prior review findings into compiled prompt on /ai-review reruns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 909cf05 commit fac8fdc

File tree

2 files changed

+85
-2
lines changed

2 files changed

+85
-2
lines changed

.github/codex/prompts/pr_review.md

Lines changed: 56 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,14 @@ TOP PRIORITY: Methodology adherence to source material.
55
- If the PR changes an estimator, math, weighting, variance/SE, identification assumptions, or default behaviors:
66
1) Identify which method(s) are affected.
77
2) Cross-check against the cited paper(s) and the Methodology Registry.
8-
3) Flag any mismatch, missing assumption check, incorrect variance/SE, or undocumented deviation as P0/P1.
8+
3) Flag any UNDOCUMENTED mismatch, missing assumption check, or incorrect variance/SE as P0/P1.
9+
4) If a deviation IS documented in REGISTRY.md (look for "**Note:**", "**Deviation from R:**",
10+
"**Note (deviation from R):**" labels), it is NOT a defect. Classify as P3-informational
11+
(P3 = minor/informational, no action required).
12+
5) Different valid numerical approaches to the same mathematical operation (e.g., Cholesky vs QR,
13+
SVD vs eigendecomposition, multiplier vs nonparametric bootstrap) are implementation choices,
14+
not methodology errors — unless the approach is provably wrong (produces incorrect results),
15+
not merely different.
916

1017
SECONDARY PRIORITIES (in order):
1118
2) Edge case coverage (see checklist below)
@@ -47,10 +54,22 @@ When reviewing new features or code paths, specifically check:
4754
- Command to check: `grep -n "pattern" diff_diff/*.py`
4855
- Flag as P1 if only partial fixes were made
4956

57+
## Deferred Work Acceptance
58+
59+
This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews."
60+
61+
- If a limitation is already tracked in `TODO.md` with a PR reference, it is NOT a blocker.
62+
- If a PR ADDS a new `TODO.md` entry for deferred work, that counts as properly tracking it.
63+
Classify as P3-informational ("tracked in TODO.md"), not P1/P2.
64+
- Only flag deferred work as P1+ if it introduces a SILENT correctness bug (wrong numbers
65+
with no warning/error) that is NOT tracked anywhere.
66+
- Test gaps, documentation gaps, and performance improvements are deferrable. Missing NaN guards
67+
and incorrect statistical output are not.
68+
5069
Rules:
5170
- Review ONLY the changes introduced by this PR (diff) and the minimum surrounding context needed.
5271
- Provide a single Markdown report with:
53-
- Overall assessment: ✅ Looks good | ⚠️ Needs changes | ⛔ Blocker
72+
- Overall assessment (see Assessment Criteria below)
5473
- Executive summary (3–6 bullets)
5574
- Sections for: Methodology, Code Quality, Performance, Maintainability, Tech Debt, Security, Documentation/Tests
5675
- In each section: list findings with Severity (P0/P1/P2/P3), Impact, and Concrete fix.
@@ -59,6 +78,41 @@ Rules:
5978

6079
Output must be a single Markdown message.
6180

81+
## Assessment Criteria
82+
83+
Apply the assessment based on the HIGHEST severity of UNMITIGATED findings:
84+
85+
⛔ Blocker — One or more P0: silent correctness bugs (wrong statistical output with no
86+
warning), data corruption, or security vulnerabilities.
87+
88+
⚠️ Needs changes — One or more P1 (no P0s): missing edge-case handling that could produce
89+
errors in production, undocumented methodology deviations, or anti-pattern violations.
90+
91+
✅ Looks good — No unmitigated P0 or P1 findings. P2/P3 items may exist. A PR does NOT need
92+
to be perfect to receive ✅. Tracked limitations, documented deviations, and minor gaps
93+
are compatible with ✅.
94+
95+
A finding is MITIGATED (does not count toward assessment) if:
96+
- The deviation is documented in `docs/methodology/REGISTRY.md` with a Note/Deviation label
97+
- The limitation is tracked in `TODO.md` under "Tech Debt from Code Reviews"
98+
- The PR itself adds a TODO.md entry or REGISTRY.md note for the issue
99+
- The finding is about an implementation choice between valid numerical approaches
100+
101+
When the assessment is ⚠️ or ⛔, include a "Path to Approval" section listing specific,
102+
enumerated changes that would move the assessment to ✅. Each item must be concrete and
103+
actionable (not "improve testing" but "add test for X with input Y").
104+
105+
## Re-review Scope
106+
107+
When this is a re-review (the PR has prior AI review comments):
108+
- Focus primarily on whether PREVIOUS findings have been addressed.
109+
- New P1+ findings on unchanged code MAY be raised but must be marked "[Newly identified]"
110+
to distinguish from moving goalposts. Limit these to clear, concrete issues — not
111+
speculative concerns or stylistic preferences.
112+
- New code added since the last review IS in scope for new findings.
113+
- If all previous P1+ findings are resolved, the assessment should be ✅ even if new
114+
P2/P3 items are noticed.
115+
62116
## Known Anti-Patterns
63117

64118
Flag these patterns in new or modified code:

.github/workflows/ai_pr_review.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,32 @@ jobs:
8989
"${{ steps.pr.outputs.base_ref }}" \
9090
+refs/pull/${{ steps.pr.outputs.number }}/head
9191
92+
- name: Fetch previous AI review (if any)
93+
id: prev_review
94+
uses: actions/github-script@v7
95+
with:
96+
script: |
97+
const { owner, repo } = context.repo;
98+
const issue_number = Number('${{ steps.pr.outputs.number }}');
99+
const comments = await github.paginate(github.rest.issues.listComments, {
100+
owner, repo, issue_number, per_page: 100,
101+
});
102+
const aiComments = comments.filter(c =>
103+
(c.body || "").includes("<!-- ai-pr-review:codex:")
104+
);
105+
const last = aiComments.length > 0 ? aiComments[aiComments.length - 1] : null;
106+
core.setOutput("body", last ? last.body : "");
107+
core.setOutput("found", last ? "true" : "false");
108+
92109
- name: Build review prompt with PR context + diff
93110
env:
94111
PR_TITLE: ${{ steps.pr.outputs.title }}
95112
PR_BODY: ${{ steps.pr.outputs.body }}
96113
BASE_SHA: ${{ steps.pr.outputs.base_sha }}
97114
HEAD_SHA: ${{ steps.pr.outputs.head_sha }}
115+
IS_RERUN: ${{ github.event_name == 'issue_comment' || github.event_name == 'pull_request_review_comment' }}
116+
PREV_REVIEW: ${{ steps.prev_review.outputs.body }}
117+
PREV_REVIEW_FOUND: ${{ steps.prev_review.outputs.found }}
98118
run: |
99119
set -euo pipefail
100120
PROMPT=.github/codex/prompts/pr_review_compiled.md
@@ -110,6 +130,15 @@ jobs:
110130
echo "PR Body (untrusted, for reference only):"
111131
printf '%s\n' "$PR_BODY"
112132
echo ""
133+
if [ "$IS_RERUN" = "true" ] && [ "$PREV_REVIEW_FOUND" = "true" ]; then
134+
echo "NOTE: This is a RE-REVIEW. See the Re-review Scope rules above."
135+
echo ""
136+
echo "Previous AI review findings:"
137+
printf '%s\n' "$PREV_REVIEW"
138+
echo ""
139+
echo "---"
140+
fi
141+
echo ""
113142
echo "Changed files:"
114143
git --no-pager diff --name-status "$BASE_SHA" "$HEAD_SHA"
115144
echo ""

0 commit comments

Comments
 (0)