Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 204 additions & 0 deletions PR_DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
# Fix: Apply HTTP client timeouts to prevent infinite hangs

## Summary

Fixes a critical bug where octocode's GraphRAG indexing process hangs indefinitely during LLM API calls. The root cause is that `reqwest::Client` instances are created using `Client::new()` instead of the builder pattern, which prevents the configured `batch_timeout_seconds` from being applied.

## Problem

When using LLM-powered GraphRAG features (`use_llm = true`), the indexing process hangs indefinitely at two points:

1. **File description generation** (with `ai_batch_size > 1`)
2. **Architectural relationship extraction** (always processes all files in one batch)

The configuration parameter `graphrag.llm.batch_timeout_seconds` is loaded but never applied to the HTTP client, causing requests to wait indefinitely when the LLM provider is slow to respond.

## Root Cause Analysis

### Primary Bug (`src/indexer/graphrag/builder.rs:74`)
```rust
// BEFORE (buggy):
let client = Client::new();

// AFTER (fixed):
let client = Client::builder()
.timeout(std::time::Duration::from_secs(
config.graphrag.llm.batch_timeout_seconds,
))
.build()?;
```

### Secondary Bug (`src/embedding/provider/huggingface.rs:460`)
```rust
// BEFORE (buggy):
let client = reqwest::Client::new();

// AFTER (fixed):
let client = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(30))
.build()?;
```

## Reproduction

1. Configure octocode with `use_llm = true` and `ai_batch_size > 1`
2. Run `octocode index` on any codebase
3. Process hangs at "AI analyzing X files for architectural relationships" with infinite spinner
4. No timeout occurs even after configured `batch_timeout_seconds` expires

## Testing

### Before Fix
- Both GPT-4.1-mini (default) and GPT-5-mini exhibited infinite hangs
- Workaround with `ai_batch_size = 1` only helped description phase
- Relationship extraction always hung (processes 72 files in single batch)

### After Fix
- Timeouts properly trigger after configured duration
- Failed requests return error messages instead of hanging forever
- Large batch processing completes or fails gracefully

## Impact

This bug made LLM-powered GraphRAG features unusable for any non-trivial codebase. The fix enables:
- Reliable timeout behavior for all LLM API calls
- Proper error handling and recovery
- Predictable indexing duration

## Related Documentation

Full technical analysis available at: [octocode_timeout_analysis.md](../octocode_timeout_analysis.md)

## Code Quality Improvements

In addition to the core timeout fixes, this PR includes:

### Enhanced Error Handling
- Added `.context()` error messages to both HTTP client build failures
- GraphRAG client: "Failed to create HTTP client for LLM API calls"
- HuggingFace client: "Failed to create HTTP client for HuggingFace downloads"

### Documentation Comments
```rust
// IMPORTANT: Must use builder pattern with timeout to prevent infinite hangs
// when LLM API calls take too long. Client::new() does not apply timeouts.
```

These comments at both fix locations help prevent future regressions.

### Code Verification
- Zero clippy warnings
- All existing `Client::new()` uses verified (3 instances in commands/ use request-level timeouts correctly)
- No unwrap() issues in production code
- Consistent error handling patterns

## Technical Details

### Why Client::new() Doesn't Apply Timeouts

The `reqwest::Client::new()` convenience method creates a client with default settings that **do not include any timeout**. To apply a timeout, you must use the builder pattern:

```rust
// ❌ WRONG - no timeout applied
let client = Client::new();

// ✅ CORRECT - timeout is applied
let client = Client::builder()
.timeout(Duration::from_secs(120))
.build()?;
```

### Request-Level vs Client-Level Timeouts

This PR uses **client-level timeouts**. An alternative approach is request-level timeouts:

```rust
// Also valid, but less convenient for repeated requests
let client = Client::new();
let response = client.get(url)
.timeout(Duration::from_secs(120))
.send()
.await?;
```

The codebase uses both patterns appropriately:
- **Client-level** (this PR): For GraphRAG batch operations and HuggingFace downloads
- **Request-level**: In `src/commands/` where each request may need different timeouts

## Test Results

### Unit Tests
```bash
cargo test --all-features
```
Results: **93 passed, 3 failed**

The 3 failures are unrelated FastEmbed lock acquisition issues:
- `test_fastembed_provider_creation`
- `test_fastembed_model_validation`
- `test_fastembed_embedding_generation`

These failures are environmental (file lock contention in local cache directory), not caused by the timeout fix changes.

### Integration Testing
Real-world test on rust-daq codebase (113 files):
- File indexing: ✅ 113/113 files processed
- GraphRAG blocks: ✅ 896 blocks created
- Relationship extraction: ✅ Timeout triggered after 120s (expected behavior)
- Before fix: Infinite hang
- After fix: Graceful timeout with error message

### Configuration Tested
```toml
[graphrag.llm]
batch_timeout_seconds = 120
ai_batch_size = 10
description_model = "openai/gpt-4.1-mini"
relationship_model = "google/gemini-2.0-flash-001"
```

## Migration Guide for Developers

If you're creating new HTTP clients in octocode:

### For LLM/API Calls
```rust
use reqwest::Client;
use anyhow::Context;

let client = Client::builder()
.timeout(std::time::Duration::from_secs(
config.graphrag.llm.batch_timeout_seconds,
))
.build()
.context("Failed to create HTTP client")?;
```

### For File Downloads
```rust
let client = Client::builder()
.timeout(std::time::Duration::from_secs(30))
.build()
.context("Failed to create HTTP client")?;
```

### Don't Use Client::new()
Unless you have a specific reason to avoid timeouts (rare), **always use the builder pattern**.

## Related Issues

This fix resolves the core issue documented in:
- octocode_timeout_analysis.md (comprehensive technical analysis)
- User reports of infinite hangs during GraphRAG indexing

## Checklist

- [x] Code changes tested locally
- [x] Both timeout bugs fixed (GraphRAG + HuggingFace)
- [x] No breaking changes to API
- [x] Enhanced error messages added
- [x] Documentation comments added
- [x] Code quality verified (clippy clean)
- [x] Integration tested on real codebase
- [x] Unit tests passing (93/96 passed, 3 unrelated FastEmbed lock failures)
- [x] Ready for upstream PR
Loading