feat: add per-MCP server graceful degradation#2371
feat: add per-MCP server graceful degradation#2371jpshackelford wants to merge 2 commits intomainfrom
Conversation
Implement graceful degradation for MCP server initialization so that conversations can start even when some MCP servers fail, while only the failing servers are disabled. Changes: - Add MCPServerError exception for per-server failures - Add MCPToolsResult dataclass to return both tools and errors - Add create_mcp_tools_graceful() function that initializes each MCP server individually and continues on failure - Update AgentBase._initialize() to use graceful degradation and log warnings for failed servers instead of blocking conversation startup This addresses the issue where FastMCP 3.x's stricter error handling blocks conversation startup when any MCP server fails. Now only the failing servers are unavailable while others work normally. Related: OpenHands/OpenHands#13321 Co-authored-by: openhands <openhands@all-hands.dev>
API breakage checks (Griffe)Result: Passed |
Agent server REST API breakage checks (OpenAPI)Result: Failed Log excerpt (first 1000 characters) |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good Taste - Clean Solution to Real Problem
Linus-Style Assessment: This is pragmatic engineering. You identified a real production problem (FastMCP 3.x strict error handling breaking conversations) and solved it with simple, clean data structures. No unnecessary abstraction, no special cases in the core logic.
What I Like:
- Data Structure:
MCPToolsResultis exactly right—tools + errors, nothing fancy - Solves Real Problem: Production systems fail when ANY MCP server is misconfigured; this fixes that
- Backward Compatible: Old
create_mcp_tools()still exists, no breaking changes - Comprehensive Tests: 9 new tests covering success, failure, and mixed scenarios
Minor Notes:
-
The separate
mcp_futurehandling inagent/base.py(lines 332-349) looks like special-case complexity at first glance, but it's justified—MCP returns error information we need to log. 🟢 Acceptable given the constraint. -
Testing Gap (not blocking): Tests cover
create_mcp_tools_graceful()thoroughly, but there's no integration test verifying agent initialization succeeds with partial MCP failures. Given the simplicity of the integration code, this is acceptable risk.
This PR modifies agent initialization behavior and tool loading, which falls under the repo's "eval risk" category per the review guidelines:
"Do NOT submit an APPROVE review when the PR changes agent behavior or anything that could plausibly affect benchmark/evaluation performance."
Recommendation: Run lightweight MCP integration tests to confirm agents function correctly with:
- All MCP servers working
- Some MCP servers failing
- All MCP servers failing
Once confirmed by a human maintainer (with or without eval runs), this is ready to merge.
Verdict: 🟡 Clean implementation, needs eval confirmation before merge
Key Insight: Solving server reliability at the data structure level (per-server error tracking) is better than trying to handle it with complex control flow.
|
@OpenHands address failing ci checks |
|
I'm on it! jpshackelford can track my progress at all-hands.dev |
The test was still referencing the old create_mcp_tools function, but the code was changed to use create_mcp_tools_graceful with MCPToolsResult. Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI've fixed the failing CI check for PR #2371. Root CauseThe test Fix AppliedUpdated the test to use the new function signature:
Checklist
CommitThe changes are minimal and focused - only 1 file modified with 10 insertions and 5 deletions, directly addressing the test incompatibility. |
Problem
When V1 conversation creation fails due to MCP server errors (e.g., MCP authentication failures like a 401 from Atlassian), the entire conversation startup fails even if other MCP servers are working correctly.
This is because FastMCP 3.x has stricter error handling - if ANY MCP server fails to connect, ALL servers fail and the conversation cannot start.
Solution
This PR implements per-MCP server graceful degradation so that conversations can start even when some MCP servers fail, while only the failing servers are disabled.
Changes
openhands-sdk/openhands/sdk/mcp/exceptions.py: AddMCPServerErrorexception for per-server failures withserver_nameandcauseattributesopenhands-sdk/openhands/sdk/mcp/utils.py:MCPToolsResultdataclass that returns bothtoolsanderrorscreate_mcp_tools_graceful()function that initializes each MCP server individually and continues on failure_create_single_server_tools()helper functionopenhands-sdk/openhands/sdk/mcp/__init__.py: Export new classes and functionopenhands-sdk/openhands/sdk/agent/base.py: UpdateAgentBase._initialize()to use graceful degradation and log warnings for failed servers instead of blocking conversation startuptests/sdk/mcp/test_create_mcp_tools_graceful.py: Comprehensive tests for graceful degradation scenariosBehavior Change
Before:
After:
Example
Context
Related to OpenHands/OpenHands#13321 which addresses surfacing detailed MCP errors. This PR implements the server-side SDK changes needed for graceful degradation as described in that PR's implementation guide.
Testing
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:b750bd7-pythonRun
All tags pushed for this build
About Multi-Architecture Support
b750bd7-python) is a multi-arch manifest supporting both amd64 and arm64b750bd7-python-amd64) are also available if needed