feat: add per-MCP server graceful degradation by jpshackelford · Pull Request #2371 · OpenHands/software-agent-sdk

jpshackelford · 2026-03-09T22:51:34Z

Problem

When V1 conversation creation fails due to MCP server errors (e.g., MCP authentication failures like a 401 from Atlassian), the entire conversation startup fails even if other MCP servers are working correctly.

This is because FastMCP 3.x has stricter error handling - if ANY MCP server fails to connect, ALL servers fail and the conversation cannot start.

Solution

This PR implements per-MCP server graceful degradation so that conversations can start even when some MCP servers fail, while only the failing servers are disabled.

Changes

openhands-sdk/openhands/sdk/mcp/exceptions.py: Add MCPServerError exception for per-server failures with server_name and cause attributes
openhands-sdk/openhands/sdk/mcp/utils.py:
- Add MCPToolsResult dataclass that returns both tools and errors
- Add create_mcp_tools_graceful() function that initializes each MCP server individually and continues on failure
- Add _create_single_server_tools() helper function
openhands-sdk/openhands/sdk/mcp/__init__.py: Export new classes and function
openhands-sdk/openhands/sdk/agent/base.py: Update AgentBase._initialize() to use graceful degradation and log warnings for failed servers instead of blocking conversation startup
tests/sdk/mcp/test_create_mcp_tools_graceful.py: Comprehensive tests for graceful degradation scenarios

Behavior Change

Before:

MCP server A fails → ALL MCP servers fail → Conversation cannot start

After:

MCP server A fails → Warning logged → Tools from MCP servers B, C are available → Conversation starts successfully

Example

result = create_mcp_tools_graceful(config)
if result.has_errors:
    logger.warning(result.error_summary())
for tool in result.tools:
    # use tool - only from successful servers

Context

Related to OpenHands/OpenHands#13321 which addresses surfacing detailed MCP errors. This PR implements the server-side SDK changes needed for graceful degradation as described in that PR's implementation guide.

Testing

Added 9 new tests for graceful degradation scenarios
All 71 existing MCP tests pass
All 201 agent tests pass

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:b750bd7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-b750bd7-python \
  ghcr.io/openhands/agent-server:b750bd7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:b750bd7-golang-amd64
ghcr.io/openhands/agent-server:b750bd7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:b750bd7-golang-arm64
ghcr.io/openhands/agent-server:b750bd7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:b750bd7-java-amd64
ghcr.io/openhands/agent-server:b750bd7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:b750bd7-java-arm64
ghcr.io/openhands/agent-server:b750bd7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:b750bd7-python-amd64
ghcr.io/openhands/agent-server:b750bd7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:b750bd7-python-arm64
ghcr.io/openhands/agent-server:b750bd7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:b750bd7-golang
ghcr.io/openhands/agent-server:b750bd7-java
ghcr.io/openhands/agent-server:b750bd7-python

About Multi-Architecture Support

Each variant tag (e.g., b750bd7-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., b750bd7-python-amd64) are also available if needed

Implement graceful degradation for MCP server initialization so that conversations can start even when some MCP servers fail, while only the failing servers are disabled. Changes: - Add MCPServerError exception for per-server failures - Add MCPToolsResult dataclass to return both tools and errors - Add create_mcp_tools_graceful() function that initializes each MCP server individually and continues on failure - Update AgentBase._initialize() to use graceful degradation and log warnings for failed servers instead of blocking conversation startup This addresses the issue where FastMCP 3.x's stricter error handling blocks conversation startup when any MCP server fails. Now only the failing servers are unavailable while others work normally. Related: OpenHands/OpenHands#13321 Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-03-09T22:51:59Z

API breakage checks (Griffe)

Result: Passed

Action log

github-actions · 2026-03-09T22:52:17Z

Agent server REST API breakage checks (OpenAPI)

Result: Failed

Log excerpt (first 1000 characters)

{"asctime": "2026-03-10 02:10:47,032", "levelname": "WARNING", "name": "openhands.agent_server.config", "filename": "config.py", "lineno": 173, "message": "\u26a0\ufe0f OH_SECRET_KEY was not defined. Secrets will not be persisted between restarts."}
::error title=openhands-agent-server REST API::Breaking REST API change detected without MINOR version bump (1.12.0 -> 1.12.0).

Breaking REST API changes detected compared to baseline release:
- the 'file' request property type/format changed from 'string'/'' to 'string'/'binary'
/home/runner/work/software-agent-sdk/software-agent-sdk/.venv/lib/python3.13/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:66: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()

Action log

all-hands-bot

🟢 Good Taste - Clean Solution to Real Problem

Linus-Style Assessment: This is pragmatic engineering. You identified a real production problem (FastMCP 3.x strict error handling breaking conversations) and solved it with simple, clean data structures. No unnecessary abstraction, no special cases in the core logic.

What I Like:

Data Structure: MCPToolsResult is exactly right—tools + errors, nothing fancy
Solves Real Problem: Production systems fail when ANY MCP server is misconfigured; this fixes that
Backward Compatible: Old create_mcp_tools() still exists, no breaking changes
Comprehensive Tests: 9 new tests covering success, failure, and mixed scenarios

Minor Notes:

The separate mcp_future handling in agent/base.py (lines 332-349) looks like special-case complexity at first glance, but it's justified—MCP returns error information we need to log. 🟢 Acceptable given the constraint.
Testing Gap (not blocking): Tests cover create_mcp_tools_graceful() thoroughly, but there's no integration test verifying agent initialization succeeds with partial MCP failures. Given the simplicity of the integration code, this is acceptable risk.

⚠️ Review Policy - Eval Risk Category:

This PR modifies agent initialization behavior and tool loading, which falls under the repo's "eval risk" category per the review guidelines:

"Do NOT submit an APPROVE review when the PR changes agent behavior or anything that could plausibly affect benchmark/evaluation performance."

Recommendation: Run lightweight MCP integration tests to confirm agents function correctly with:

All MCP servers working
Some MCP servers failing
All MCP servers failing

Once confirmed by a human maintainer (with or without eval runs), this is ready to merge.

Verdict: 🟡 Clean implementation, needs eval confirmation before merge
Key Insight: Solving server reliability at the data structure level (per-server error tracking) is better than trying to handle it with complex control flow.

jpshackelford · 2026-03-10T02:01:00Z

@OpenHands address failing ci checks

openhands-ai · 2026-03-10T02:01:13Z

I'm on it! jpshackelford can track my progress at all-hands.dev

The test was still referencing the old create_mcp_tools function, but the code was changed to use create_mcp_tools_graceful with MCPToolsResult. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-03-10T02:12:33Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/agent
base.py	200	25	87%	200, 257–259, 289, 293–297, 349–350, 354, 363–365, 375, 385, 393–394, 504, 541–542, 552–553
openhands-sdk/openhands/sdk/mcp
utils.py	89	10	88%	53–54, 57, 60, 113–114, 141–142, 150–151
TOTAL	19866	5762	71%

openhands-ai · 2026-03-10T02:16:21Z

Summary

I've fixed the failing CI check for PR #2371.

Root Cause

The test test_plugin_mcp_config_is_initialized in tests/sdk/conversation/test_local_conversation_plugins.py was trying to monkeypatch create_mcp_tools, but the PR changed the code to use the new create_mcp_tools_graceful function, which returns an MCPToolsResult dataclass instead of a plain list.

Fix Applied

Updated the test to use the new function signature:

Imported MCPToolsResult from openhands.sdk.mcp
Renamed the mock from mock_create_mcp_tools to mock_create_mcp_tools_graceful
Changed the return value from [] to MCPToolsResult(tools=[], errors=[])
Updated the monkeypatch target from create_mcp_tools to create_mcp_tools_graceful
Updated the related comment

Checklist

✅ Identified the failing test (AttributeError: module has no attribute 'create_mcp_tools')
✅ Fixed the test to use the new create_mcp_tools_graceful API
✅ Ran pre-commit hooks successfully
✅ Verified all 2579 SDK tests pass locally
✅ Committed and pushed the fix to feat/mcp-graceful-degradation branch
✅ CI sdk-tests check is now passing

Commit

fix: update test to use create_mcp_tools_graceful

The test was still referencing the old create_mcp_tools function, but
the code was changed to use create_mcp_tools_graceful with MCPToolsResult.

Co-authored-by: openhands <openhands@all-hands.dev>

The changes are minimal and focused - only 1 file modified with 10 insertions and 5 deletions, directly addressing the test incompatibility.

all-hands-bot reviewed Mar 9, 2026

View reviewed changes

jpshackelford mentioned this pull request Mar 10, 2026

fix: surface detailed MCP errors in V1 conversation start failures OpenHands/OpenHands#13321

Draft

fix: update test to use create_mcp_tools_graceful

138af9c

The test was still referencing the old create_mcp_tools function, but the code was changed to use create_mcp_tools_graceful with MCPToolsResult. Co-authored-by: openhands <openhands@all-hands.dev>

cbagwell mentioned this pull request Mar 10, 2026

[Feature]: STDIO MCP Usage is not currently scalable. OpenHands/OpenHands#13271

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add per-MCP server graceful degradation#2371

feat: add per-MCP server graceful degradation#2371
jpshackelford wants to merge 2 commits intomainfrom
feat/mcp-graceful-degradation

jpshackelford commented Mar 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

jpshackelford commented Mar 10, 2026

Uh oh!

openhands-ai bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

openhands-ai bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jpshackelford commented Mar 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Behavior Change

Example

Context

Testing

Uh oh!

github-actions bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API breakage checks (Griffe)

Uh oh!

github-actions bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent server REST API breakage checks (OpenAPI)

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

🟢 Good Taste - Clean Solution to Real Problem

Uh oh!

jpshackelford commented Mar 10, 2026

Uh oh!

openhands-ai bot commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

openhands-ai bot commented Mar 10, 2026

Summary

Root Cause

Fix Applied

Checklist

Commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpshackelford commented Mar 9, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Mar 9, 2026 •

edited

Loading

github-actions bot commented Mar 9, 2026 •

edited

Loading