-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Problem
The current approach makes one API call per open PR to fetch its changed files and patch data (GET /repos/{owner}/{repo}/pulls/{number}/files). For a repo with 500 open PRs, that's 500 API calls just for file data — before any verification calls. Across an organization with dozens of active repos, this adds up quickly and can hit rate limits (5,000 requests/hour for PATs, 15,000 for GitHub App installations).
Proposed Approach
Instead of fetching file diffs via the API, clone the repository locally and use git diff to compute changed files and line ranges between each PR's head branch and the base branch.
How It Would Work
-
Shallow clone the repo with all open PR head refs:
git clone --bare --filter=blob:none https://github.com/{owner}/{repo}.git git fetch origin +refs/pull/*/head:refs/pull/*/headUsing a blobless clone (
--filter=blob:none) keeps the initial clone fast — only tree objects are fetched eagerly, and blobs are fetched on demand whengit diffneeds them. -
Compute diffs locally for each open PR:
git diff --name-only origin/main...refs/pull/{number}/head git diff -U0 origin/main...refs/pull/{number}/head -- {file}Parse the unified diff output to extract the same line range data we currently get from the API's patch field.
-
Run conflict detection using the same
find_file_overlaps()algorithm — only the data source changes.
API Savings
| Step | Current (API) | Proposed (Local) |
|---|---|---|
| List open PRs | 1 call per 100 PRs | Same (still needed) |
| Fetch PR files/patches | 1 call per PR | 0 calls |
| Verify conflicts (optional) | 1 call per conflict pair | Same or could use local merge |
| Total for 500 PRs | ~506 calls | ~6 calls |
Additional Benefits
- True merge conflict detection: With a local clone, we could run
git merge-tree(Git 2.38+) to simulate merges without any API calls, replacing the current Phase 2VERIFY_CONFLICTSAPI check. This would give us actual merge conflict detection rather than the heuristic-based overlap analysis.git merge-tree --write-tree refs/pull/1/head refs/pull/2/head # Exit code 1 = conflict, exit code 0 = clean merge - No rate limit concerns: Once cloned, all analysis is CPU-bound, not API-bound.
- Richer diff data: Local diffs give full context, not just the first 300 files (API pagination limit per PR).
Trade-offs to Consider
- Disk space: A blobless clone of a large monorepo could still be significant. Should be cleaned up after the run.
- Clone time: Initial clone + ref fetch takes time, especially for large repos. For small repos with few PRs, the API approach may actually be faster. This argues for making it configurable or auto-selecting based on PR count.
- Git dependency: Requires
gitto be available in the container (already present in the Docker image's basepython:3.14-slim, but would needgitinstalled). - GitHub Actions disk limits: Runners have ~14GB of free space. Most repos will be fine, but extremely large monorepos could be tight with blobless clones.
Configuration
ANALYSIS_MODE: "local" # "api" (default, current behavior) or "local" (clone-based)Could also support "auto" which uses the API for repos with <50 open PRs and local clone for larger ones.
Acceptance Criteria
- Local clone mode fetches all open PR head refs efficiently
- Diff parsing produces identical results to the current API-based approach
- API call count is reduced to only what's needed (list PRs, create issues)
- Disk is cleaned up after each repo is processed
- Falls back to API mode gracefully if clone fails
- Performance benchmarks comparing API vs local for small/medium/large repos
- Documentation updated with new config option