Skip to content

feat: add per-MCP server graceful degradation#2371

Open
jpshackelford wants to merge 2 commits intomainfrom
feat/mcp-graceful-degradation
Open

feat: add per-MCP server graceful degradation#2371
jpshackelford wants to merge 2 commits intomainfrom
feat/mcp-graceful-degradation

Conversation

@jpshackelford
Copy link
Contributor

@jpshackelford jpshackelford commented Mar 9, 2026

Problem

When V1 conversation creation fails due to MCP server errors (e.g., MCP authentication failures like a 401 from Atlassian), the entire conversation startup fails even if other MCP servers are working correctly.

This is because FastMCP 3.x has stricter error handling - if ANY MCP server fails to connect, ALL servers fail and the conversation cannot start.

Solution

This PR implements per-MCP server graceful degradation so that conversations can start even when some MCP servers fail, while only the failing servers are disabled.

Changes

  • openhands-sdk/openhands/sdk/mcp/exceptions.py: Add MCPServerError exception for per-server failures with server_name and cause attributes

  • openhands-sdk/openhands/sdk/mcp/utils.py:

    • Add MCPToolsResult dataclass that returns both tools and errors
    • Add create_mcp_tools_graceful() function that initializes each MCP server individually and continues on failure
    • Add _create_single_server_tools() helper function
  • openhands-sdk/openhands/sdk/mcp/__init__.py: Export new classes and function

  • openhands-sdk/openhands/sdk/agent/base.py: Update AgentBase._initialize() to use graceful degradation and log warnings for failed servers instead of blocking conversation startup

  • tests/sdk/mcp/test_create_mcp_tools_graceful.py: Comprehensive tests for graceful degradation scenarios

Behavior Change

Before:

  • MCP server A fails → ALL MCP servers fail → Conversation cannot start

After:

  • MCP server A fails → Warning logged → Tools from MCP servers B, C are available → Conversation starts successfully

Example

result = create_mcp_tools_graceful(config)
if result.has_errors:
    logger.warning(result.error_summary())
for tool in result.tools:
    # use tool - only from successful servers

Context

Related to OpenHands/OpenHands#13321 which addresses surfacing detailed MCP errors. This PR implements the server-side SDK changes needed for graceful degradation as described in that PR's implementation guide.

Testing

  • Added 9 new tests for graceful degradation scenarios
  • All 71 existing MCP tests pass
  • All 201 agent tests pass

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:b750bd7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-b750bd7-python \
  ghcr.io/openhands/agent-server:b750bd7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:b750bd7-golang-amd64
ghcr.io/openhands/agent-server:b750bd7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:b750bd7-golang-arm64
ghcr.io/openhands/agent-server:b750bd7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:b750bd7-java-amd64
ghcr.io/openhands/agent-server:b750bd7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:b750bd7-java-arm64
ghcr.io/openhands/agent-server:b750bd7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:b750bd7-python-amd64
ghcr.io/openhands/agent-server:b750bd7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:b750bd7-python-arm64
ghcr.io/openhands/agent-server:b750bd7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:b750bd7-golang
ghcr.io/openhands/agent-server:b750bd7-java
ghcr.io/openhands/agent-server:b750bd7-python

About Multi-Architecture Support

  • Each variant tag (e.g., b750bd7-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., b750bd7-python-amd64) are also available if needed

Implement graceful degradation for MCP server initialization so that
conversations can start even when some MCP servers fail, while only
the failing servers are disabled.

Changes:
- Add MCPServerError exception for per-server failures
- Add MCPToolsResult dataclass to return both tools and errors
- Add create_mcp_tools_graceful() function that initializes each MCP
  server individually and continues on failure
- Update AgentBase._initialize() to use graceful degradation and log
  warnings for failed servers instead of blocking conversation startup

This addresses the issue where FastMCP 3.x's stricter error handling
blocks conversation startup when any MCP server fails. Now only the
failing servers are unavailable while others work normally.

Related: OpenHands/OpenHands#13321

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

API breakage checks (Griffe)

Result: Passed

Action log

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Agent server REST API breakage checks (OpenAPI)

Result: Failed

Log excerpt (first 1000 characters)
{"asctime": "2026-03-10 02:10:47,032", "levelname": "WARNING", "name": "openhands.agent_server.config", "filename": "config.py", "lineno": 173, "message": "\u26a0\ufe0f OH_SECRET_KEY was not defined. Secrets will not be persisted between restarts."}
::error title=openhands-agent-server REST API::Breaking REST API change detected without MINOR version bump (1.12.0 -> 1.12.0).

Breaking REST API changes detected compared to baseline release:
- the 'file' request property type/format changed from 'string'/'' to 'string'/'binary'
/home/runner/work/software-agent-sdk/software-agent-sdk/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:66: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()

Action log

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good Taste - Clean Solution to Real Problem

Linus-Style Assessment: This is pragmatic engineering. You identified a real production problem (FastMCP 3.x strict error handling breaking conversations) and solved it with simple, clean data structures. No unnecessary abstraction, no special cases in the core logic.

What I Like:

  • Data Structure: MCPToolsResult is exactly right—tools + errors, nothing fancy
  • Solves Real Problem: Production systems fail when ANY MCP server is misconfigured; this fixes that
  • Backward Compatible: Old create_mcp_tools() still exists, no breaking changes
  • Comprehensive Tests: 9 new tests covering success, failure, and mixed scenarios

Minor Notes:

  1. The separate mcp_future handling in agent/base.py (lines 332-349) looks like special-case complexity at first glance, but it's justified—MCP returns error information we need to log. 🟢 Acceptable given the constraint.

  2. Testing Gap (not blocking): Tests cover create_mcp_tools_graceful() thoroughly, but there's no integration test verifying agent initialization succeeds with partial MCP failures. Given the simplicity of the integration code, this is acceptable risk.

⚠️ Review Policy - Eval Risk Category:

This PR modifies agent initialization behavior and tool loading, which falls under the repo's "eval risk" category per the review guidelines:

"Do NOT submit an APPROVE review when the PR changes agent behavior or anything that could plausibly affect benchmark/evaluation performance."

Recommendation: Run lightweight MCP integration tests to confirm agents function correctly with:

  • All MCP servers working
  • Some MCP servers failing
  • All MCP servers failing

Once confirmed by a human maintainer (with or without eval runs), this is ready to merge.


Verdict: 🟡 Clean implementation, needs eval confirmation before merge
Key Insight: Solving server reliability at the data structure level (per-server error tracking) is better than trying to handle it with complex control flow.

@jpshackelford
Copy link
Contributor Author

@OpenHands address failing ci checks

@openhands-ai
Copy link

openhands-ai bot commented Mar 10, 2026

I'm on it! jpshackelford can track my progress at all-hands.dev

The test was still referencing the old create_mcp_tools function, but
the code was changed to use create_mcp_tools_graceful with MCPToolsResult.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/agent
   base.py2002587%200, 257–259, 289, 293–297, 349–350, 354, 363–365, 375, 385, 393–394, 504, 541–542, 552–553
openhands-sdk/openhands/sdk/mcp
   utils.py891088%53–54, 57, 60, 113–114, 141–142, 150–151
TOTAL19866576271% 

@openhands-ai
Copy link

openhands-ai bot commented Mar 10, 2026

Summary

I've fixed the failing CI check for PR #2371.

Root Cause

The test test_plugin_mcp_config_is_initialized in tests/sdk/conversation/test_local_conversation_plugins.py was trying to monkeypatch create_mcp_tools, but the PR changed the code to use the new create_mcp_tools_graceful function, which returns an MCPToolsResult dataclass instead of a plain list.

Fix Applied

Updated the test to use the new function signature:

  • Imported MCPToolsResult from openhands.sdk.mcp
  • Renamed the mock from mock_create_mcp_tools to mock_create_mcp_tools_graceful
  • Changed the return value from [] to MCPToolsResult(tools=[], errors=[])
  • Updated the monkeypatch target from create_mcp_tools to create_mcp_tools_graceful
  • Updated the related comment

Checklist

  • ✅ Identified the failing test (AttributeError: module has no attribute 'create_mcp_tools')
  • ✅ Fixed the test to use the new create_mcp_tools_graceful API
  • ✅ Ran pre-commit hooks successfully
  • ✅ Verified all 2579 SDK tests pass locally
  • ✅ Committed and pushed the fix to feat/mcp-graceful-degradation branch
  • ✅ CI sdk-tests check is now passing

Commit

fix: update test to use create_mcp_tools_graceful

The test was still referencing the old create_mcp_tools function, but
the code was changed to use create_mcp_tools_graceful with MCPToolsResult.

Co-authored-by: openhands <openhands@all-hands.dev>

The changes are minimal and focused - only 1 file modified with 10 insertions and 5 deletions, directly addressing the test incompatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants