fix: reduce stop hook API timeout from 10m to 90s#567
Conversation
…blocking When the Supermodel API is unreachable, pollJob retries failed connections every 10 seconds. With the previous 10-minute context, the Stop hook would hang for up to ~10 minutes before giving up — making Claude Code sessions unusable during API outages. Reduce the API fetch timeout in runHandler and runWithoutCache to 90 seconds. Long-running first-time fetches for large repos are already handled by the background pregen hook (20-minute timeout), so the stop hook can fail fast and gracefully on API outage without disrupting sessions. Co-Authored-By: Grey Newell <greyshipscode@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe API client's connection error handling was simplified. Instead of logging a warning and retrying with backoff delays when connection-level errors occur (DNS failures, refused connections, network issues), the function now immediately returns an error signal. HTTP-level errors continue using the existing retry logic. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
…nutes When the Supermodel API is unreachable, pollJob was retrying connection errors (connection refused, DNS failure, network down) every 10 seconds for the full context duration — up to 10 minutes — before giving up. This blocked the Claude Code Stop hook for the entire outage window. Connection errors are fundamentally different from job-processing delays: - "pending"/"processing" status → API is working, polling makes sense - Connection error → API is unreachable, retrying won't help Change pollJob to return immediately on connection-level errors so the Stop hook can call silentExit() and unblock the session without waiting for the context deadline. 5xx errors, rate limits, and job-in-progress responses continue to be retried as before. Co-Authored-By: Grey Newell <greyshipscode@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/run.go`:
- Around line 229-235: The current 90s timeout only wraps the Supermodel fetch
and doesn't bound the full Stop hook; create a single parent context with a
total deadline for the run path (e.g., in the function that calls
runWithoutCache) and pass that parent ctx into runWithoutCache so all
sub-operations (project.Detect, GetWorkingMemory, and the Supermodel fetch)
derive sub-contexts from it (use context.WithTimeout/WithDeadline on the parent
to create shorter child contexts where needed and replace the local
context.WithTimeout in the Supermodel fetch with a child of the parent). Ensure
functions like runWithoutCache, project.Detect, and GetWorkingMemory accept and
use the passed parent context so the whole end-to-end Stop hook is capped by the
single parent deadline.
cmd/run.go
Outdated
| // If no cache or forced refresh, fetch from API. | ||
| // Use a short timeout so the Stop hook never blocks a Claude Code session | ||
| // for more than ~90 seconds during an API outage. Long-running first-time | ||
| // fetches for large repos are handled by the background pregen hook. | ||
| if graph == nil || forceRefresh { | ||
| logFn("[debug] fetching from Supermodel API...") | ||
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute) | ||
| ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second) |
There was a problem hiding this comment.
The 90s cap only applies to the fetch phase, not the whole Stop hook.
Right now the command can still spend up to 5s in project.Detect, 15s in GetWorkingMemory, and then another 90s here, so the real wall-clock cap is still closer to ~110s. If the PR goal is a true end-to-end Stop-hook limit, give the whole run path a parent deadline and derive these sub-contexts from it.
Possible shape of the fix
func runHandler(cmd *cobra.Command, args []string) error {
+ runCtx, runCancel := context.WithTimeout(context.Background(), 90*time.Second)
+ defer runCancel()
+
...
- gitCtx, gitCancel := context.WithTimeout(context.Background(), 5*time.Second)
+ gitCtx, gitCancel := context.WithTimeout(runCtx, 5*time.Second)
defer gitCancel()
...
- wmCtx, wmCancel := context.WithTimeout(context.Background(), 15*time.Second)
+ wmCtx, wmCancel := context.WithTimeout(runCtx, 15*time.Second)
defer wmCancel()
...
- ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second)
+ ctx, cancel := context.WithTimeout(runCtx, 90*time.Second)
defer cancel()You'd want to thread that same parent context into runWithoutCache too.
Also applies to: 475-479
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/run.go` around lines 229 - 235, The current 90s timeout only wraps the
Supermodel fetch and doesn't bound the full Stop hook; create a single parent
context with a total deadline for the run path (e.g., in the function that calls
runWithoutCache) and pass that parent ctx into runWithoutCache so all
sub-operations (project.Detect, GetWorkingMemory, and the Supermodel fetch)
derive sub-contexts from it (use context.WithTimeout/WithDeadline on the parent
to create shorter child contexts where needed and replace the local
context.WithTimeout in the Supermodel fetch with a child of the parent). Ensure
functions like runWithoutCache, project.Detect, and GetWorkingMemory accept and
use the passed parent context so the whole end-to-end Stop hook is capped by the
single parent deadline.
Problem
When the Supermodel API is unreachable, the Stop hook hangs for up to 10 minutes before giving up. During this window, Claude Code sessions are effectively frozen.
Root cause:
runHandlerandrunWithoutCacheboth create a 10-minute context for the API call. When the API is down,pollJobretries immediately-failing connections every 10 seconds — so with a 10-minute context, the hook blocks for ~600 seconds before callingsilentExit().Fix
Reduce the API fetch timeout in the
runcommand from 10 minutes to 90 seconds.pregenboth retain their 20-minute timeouts ✓The 10-minute timeout was designed for large repos on first run, but that use case is now handled by the background
pregenhook — making the shorter Stop hook timeout safe.Test plan
/etc/hosts) — Stop hook should give up within ~90s with no outputgo build ./...andgo vet ./...pass🤖 Generated with Claude Code
Summary by CodeRabbit