Skip to content

fix: reduce stop hook API timeout from 10m to 90s#567

Merged
greynewell merged 2 commits intomainfrom
claude/issue-api-outage-20260309-0000
Mar 9, 2026
Merged

fix: reduce stop hook API timeout from 10m to 90s#567
greynewell merged 2 commits intomainfrom
claude/issue-api-outage-20260309-0000

Conversation

@greynewell
Copy link
Contributor

@greynewell greynewell commented Mar 9, 2026

Problem

When the Supermodel API is unreachable, the Stop hook hangs for up to 10 minutes before giving up. During this window, Claude Code sessions are effectively frozen.

Root cause: runHandler and runWithoutCache both create a 10-minute context for the API call. When the API is down, pollJob retries immediately-failing connections every 10 seconds — so with a 10-minute context, the hook blocks for ~600 seconds before calling silentExit().

Fix

Reduce the API fetch timeout in the run command from 10 minutes to 90 seconds.

  • API unreachable → Stop hook gives up after ~90s instead of ~10 minutes ✓
  • API slow (large repo, first run) → 90s limit, then graceful silent exit; pregen (background, 20-minute timeout) warms the cache for the next compaction ✓
  • Fresh/stale cache hit → API is never called; behaviour unchanged ✓
  • Background stale refresh goroutine and pregen both retain their 20-minute timeouts ✓

The 10-minute timeout was designed for large repos on first run, but that use case is now handled by the background pregen hook — making the shorter Stop hook timeout safe.

Test plan

  • Simulate API outage (block the API host via /etc/hosts) — Stop hook should give up within ~90s with no output
  • Fresh-cache scenario — hook still serves instantly from cache
  • Stale-cache scenario — hook serves stale immediately, background refresh attempts (and fails silently) in parallel
  • go build ./... and go vet ./... pass

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced connection error handling: The API client now immediately returns errors for connection-level failures (such as network problems or DNS issues) instead of attempting automatic retries with delays. This provides faster feedback when the service is unreachable. HTTP-level errors continue to be handled as before.

…blocking

When the Supermodel API is unreachable, pollJob retries failed connections
every 10 seconds. With the previous 10-minute context, the Stop hook would
hang for up to ~10 minutes before giving up — making Claude Code sessions
unusable during API outages.

Reduce the API fetch timeout in runHandler and runWithoutCache to 90 seconds.
Long-running first-time fetches for large repos are already handled by the
background pregen hook (20-minute timeout), so the stop hook can fail fast
and gracefully on API outage without disrupting sessions.

Co-Authored-By: Grey Newell <greyshipscode@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4773fee2-030c-47fc-bc66-176a89aaa004

📥 Commits

Reviewing files that changed from the base of the PR and between d01464b and dc468ea.

📒 Files selected for processing (1)
  • internal/api/client.go

Walkthrough

The API client's connection error handling was simplified. Instead of logging a warning and retrying with backoff delays when connection-level errors occur (DNS failures, refused connections, network issues), the function now immediately returns an error signal. HTTP-level errors continue using the existing retry logic.

Changes

Cohort / File(s) Summary
Connection Error Handling
internal/api/client.go
Removed retry/backoff logic for connection-level errors (DNS, connection refused, network timeouts); now immediately returns unreachable API error. HTTP-level error handling unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🔌 When networks fail and DNS breaks,
No more waiting, no more shakes,
Fail fast now, cut to the chase,
Connection errors show their face.
Simple paths, cleaner trace! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title mentions reducing API timeout from 10m to 90s, but the actual change in internal/api/client.go is about making connection-level errors fail fast instead of retrying—a different fix than the timeout reduction. Update the title to reflect the actual change: something like 'fix: fail fast on connection-level errors in pollJob' would better describe what the code change actually does.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude/issue-api-outage-20260309-0000

Comment @coderabbitai help to get the list of available commands and usage tips.

…nutes

When the Supermodel API is unreachable, pollJob was retrying connection
errors (connection refused, DNS failure, network down) every 10 seconds
for the full context duration — up to 10 minutes — before giving up.
This blocked the Claude Code Stop hook for the entire outage window.

Connection errors are fundamentally different from job-processing delays:
- "pending"/"processing" status → API is working, polling makes sense
- Connection error → API is unreachable, retrying won't help

Change pollJob to return immediately on connection-level errors so the
Stop hook can call silentExit() and unblock the session without waiting
for the context deadline. 5xx errors, rate limits, and job-in-progress
responses continue to be retried as before.

Co-Authored-By: Grey Newell <greyshipscode@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/run.go`:
- Around line 229-235: The current 90s timeout only wraps the Supermodel fetch
and doesn't bound the full Stop hook; create a single parent context with a
total deadline for the run path (e.g., in the function that calls
runWithoutCache) and pass that parent ctx into runWithoutCache so all
sub-operations (project.Detect, GetWorkingMemory, and the Supermodel fetch)
derive sub-contexts from it (use context.WithTimeout/WithDeadline on the parent
to create shorter child contexts where needed and replace the local
context.WithTimeout in the Supermodel fetch with a child of the parent). Ensure
functions like runWithoutCache, project.Detect, and GetWorkingMemory accept and
use the passed parent context so the whole end-to-end Stop hook is capped by the
single parent deadline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5585285b-f478-418c-8877-9b6653e90172

📥 Commits

Reviewing files that changed from the base of the PR and between 317cc93 and d01464b.

📒 Files selected for processing (1)
  • cmd/run.go

cmd/run.go Outdated
Comment on lines +229 to +235
// If no cache or forced refresh, fetch from API.
// Use a short timeout so the Stop hook never blocks a Claude Code session
// for more than ~90 seconds during an API outage. Long-running first-time
// fetches for large repos are handled by the background pregen hook.
if graph == nil || forceRefresh {
logFn("[debug] fetching from Supermodel API...")
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The 90s cap only applies to the fetch phase, not the whole Stop hook.

Right now the command can still spend up to 5s in project.Detect, 15s in GetWorkingMemory, and then another 90s here, so the real wall-clock cap is still closer to ~110s. If the PR goal is a true end-to-end Stop-hook limit, give the whole run path a parent deadline and derive these sub-contexts from it.

Possible shape of the fix
 func runHandler(cmd *cobra.Command, args []string) error {
+	runCtx, runCancel := context.WithTimeout(context.Background(), 90*time.Second)
+	defer runCancel()
+
 	...
-	gitCtx, gitCancel := context.WithTimeout(context.Background(), 5*time.Second)
+	gitCtx, gitCancel := context.WithTimeout(runCtx, 5*time.Second)
 	defer gitCancel()

 	...
-	wmCtx, wmCancel := context.WithTimeout(context.Background(), 15*time.Second)
+	wmCtx, wmCancel := context.WithTimeout(runCtx, 15*time.Second)
 	defer wmCancel()

 	...
-	ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second)
+	ctx, cancel := context.WithTimeout(runCtx, 90*time.Second)
 	defer cancel()

You'd want to thread that same parent context into runWithoutCache too.

Also applies to: 475-479

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/run.go` around lines 229 - 235, The current 90s timeout only wraps the
Supermodel fetch and doesn't bound the full Stop hook; create a single parent
context with a total deadline for the run path (e.g., in the function that calls
runWithoutCache) and pass that parent ctx into runWithoutCache so all
sub-operations (project.Detect, GetWorkingMemory, and the Supermodel fetch)
derive sub-contexts from it (use context.WithTimeout/WithDeadline on the parent
to create shorter child contexts where needed and replace the local
context.WithTimeout in the Supermodel fetch with a child of the parent). Ensure
functions like runWithoutCache, project.Detect, and GetWorkingMemory accept and
use the passed parent context so the whole end-to-end Stop hook is capped by the
single parent deadline.

@greynewell greynewell merged commit 57756c9 into main Mar 9, 2026
3 checks passed
@greynewell greynewell deleted the claude/issue-api-outage-20260309-0000 branch March 9, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant