Skip to content

Feature: Reduce API usage by cloning repos and analyzing diffs locally #5

@zkoppert

Description

@zkoppert

Problem

The current approach makes one API call per open PR to fetch its changed files and patch data (GET /repos/{owner}/{repo}/pulls/{number}/files). For a repo with 500 open PRs, that's 500 API calls just for file data — before any verification calls. Across an organization with dozens of active repos, this adds up quickly and can hit rate limits (5,000 requests/hour for PATs, 15,000 for GitHub App installations).

Proposed Approach

Instead of fetching file diffs via the API, clone the repository locally and use git diff to compute changed files and line ranges between each PR's head branch and the base branch.

How It Would Work

  1. Shallow clone the repo with all open PR head refs:

    git clone --bare --filter=blob:none https://github.com/{owner}/{repo}.git
    git fetch origin +refs/pull/*/head:refs/pull/*/head

    Using a blobless clone (--filter=blob:none) keeps the initial clone fast — only tree objects are fetched eagerly, and blobs are fetched on demand when git diff needs them.

  2. Compute diffs locally for each open PR:

    git diff --name-only origin/main...refs/pull/{number}/head
    git diff -U0 origin/main...refs/pull/{number}/head -- {file}

    Parse the unified diff output to extract the same line range data we currently get from the API's patch field.

  3. Run conflict detection using the same find_file_overlaps() algorithm — only the data source changes.

API Savings

Step Current (API) Proposed (Local)
List open PRs 1 call per 100 PRs Same (still needed)
Fetch PR files/patches 1 call per PR 0 calls
Verify conflicts (optional) 1 call per conflict pair Same or could use local merge
Total for 500 PRs ~506 calls ~6 calls

Additional Benefits

  • True merge conflict detection: With a local clone, we could run git merge-tree (Git 2.38+) to simulate merges without any API calls, replacing the current Phase 2 VERIFY_CONFLICTS API check. This would give us actual merge conflict detection rather than the heuristic-based overlap analysis.
    git merge-tree --write-tree refs/pull/1/head refs/pull/2/head
    # Exit code 1 = conflict, exit code 0 = clean merge
  • No rate limit concerns: Once cloned, all analysis is CPU-bound, not API-bound.
  • Richer diff data: Local diffs give full context, not just the first 300 files (API pagination limit per PR).

Trade-offs to Consider

  • Disk space: A blobless clone of a large monorepo could still be significant. Should be cleaned up after the run.
  • Clone time: Initial clone + ref fetch takes time, especially for large repos. For small repos with few PRs, the API approach may actually be faster. This argues for making it configurable or auto-selecting based on PR count.
  • Git dependency: Requires git to be available in the container (already present in the Docker image's base python:3.14-slim, but would need git installed).
  • GitHub Actions disk limits: Runners have ~14GB of free space. Most repos will be fine, but extremely large monorepos could be tight with blobless clones.

Configuration

ANALYSIS_MODE: "local"  # "api" (default, current behavior) or "local" (clone-based)

Could also support "auto" which uses the API for repos with <50 open PRs and local clone for larger ones.

Acceptance Criteria

  • Local clone mode fetches all open PR head refs efficiently
  • Diff parsing produces identical results to the current API-based approach
  • API call count is reduced to only what's needed (list PRs, create issues)
  • Disk is cleaned up after each repo is processed
  • Falls back to API mode gracefully if clone fails
  • Performance benchmarks comparing API vs local for small/medium/large repos
  • Documentation updated with new config option

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions