Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions skills/openclaw-native/mcp-health-checker/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
name: mcp-health-checker
version: "1.0"
category: openclaw-native
description: Monitors MCP server connections for health, latency, and availability — detects stale connections, timeouts, and unreachable servers before they cause silent tool failures.
stateful: true
cron: "0 */6 * * *"
---

# MCP Health Checker

## What it does

MCP (Model Context Protocol) servers are how OpenClaw connects to external tools — but connections go stale silently. A crashed MCP server doesn't throw an error until the agent tries to use it, causing confusing mid-task failures.

MCP Health Checker proactively monitors all configured MCP connections. It pings servers, measures latency, tracks uptime history, and alerts you before a stale connection causes a problem.

Inspired by OpenLobster's MCP connection health monitoring and OAuth 2.1+PKCE token refresh tracking.

## When to invoke

- Automatically every 6 hours (cron) — silent background health check
- Manually before starting a task that depends on MCP tools
- When an MCP tool call fails unexpectedly — diagnose the connection
- After restarting MCP servers — verify all connections restored

## Health checks performed

| Check | What it tests | Severity on failure |
|---|---|---|
| REACHABLE | Server responds to connection probe | CRITICAL |
| LATENCY | Response time under threshold (default: 5s) | HIGH |
| STALE | Connection age exceeds max (default: 24h) | HIGH |
| TOOL_COUNT | Server exposes expected number of tools | MEDIUM |
| CONFIG_VALID | MCP config entry has required fields | MEDIUM |
| AUTH_EXPIRY | OAuth/API token approaching expiration | HIGH |

## How to use

```bash
python3 check.py --ping # Ping all configured MCP servers
python3 check.py --ping --server <name> # Ping a specific server
python3 check.py --ping --timeout 3 # Custom timeout in seconds
python3 check.py --status # Last check summary from state
python3 check.py --history # Show past check results
python3 check.py --config # Validate MCP config entries
python3 check.py --format json # Machine-readable output
```

## Cron wakeup behaviour

Every 6 hours:

1. Read MCP server configuration from `~/.openclaw/config/` (YAML/JSON)
2. For each configured server:
- Attempt connection probe (TCP or HTTP depending on transport)
- Measure response latency
- Check connection age against staleness threshold
- Verify tool listing matches expected count (if tracked)
- Check auth token expiry (if applicable)
3. Update state with per-server health records
4. Print summary: healthy / degraded / unreachable counts
5. Exit 1 if any CRITICAL findings

## Procedure

**Step 1 — Run a health check**

```bash
python3 check.py --ping
```

Review the output. Healthy servers show a green check. Degraded servers show latency warnings. Unreachable servers show a critical alert.

**Step 2 — Diagnose a specific server**

```bash
python3 check.py --ping --server filesystem
```

Detailed output for a single server: latency, last seen, tool count, auth status.

**Step 3 — Validate configuration**

```bash
python3 check.py --config
```

Checks that all MCP config entries have the required fields (`command`, `args` or `url` depending on transport type).

**Step 4 — Review history**

```bash
python3 check.py --history
```

Shows uptime trends over the last 20 checks. Spot servers that are intermittently failing.

## State

Per-server health records and check history stored in `~/.openclaw/skill-state/mcp-health-checker/state.yaml`.

Fields: `last_check_at`, `servers` list, `check_history`.

## Notes

- Does not modify MCP configuration — read-only monitoring
- Connection probes use the same transport as the MCP server (stdio subprocess spawn or HTTP GET)
- For stdio servers: probes verify the process can start and respond to `initialize`
- For HTTP/SSE servers: probes send a health-check HTTP request
- Latency threshold configurable via `--timeout` (default: 5s)
- Staleness threshold configurable via `--max-age` (default: 24h)
31 changes: 31 additions & 0 deletions skills/openclaw-native/mcp-health-checker/STATE_SCHEMA.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
version: "1.0"
description: MCP server health records, per-server status, and check history.
fields:
last_check_at:
type: datetime
servers:
type: list
description: Per-server health status from the most recent check
items:
name: { type: string, description: "Server name from config" }
transport: { type: string, description: "stdio or http" }
status: { type: enum, values: [healthy, degraded, unreachable, unknown] }
latency_ms: { type: integer, description: "Response time in milliseconds" }
last_seen_at: { type: datetime, description: "Last successful probe" }
tool_count: { type: integer, description: "Number of tools exposed" }
findings:
type: list
items:
check: { type: string }
severity: { type: string }
detail: { type: string }
checked_at: { type: datetime }
check_history:
type: list
description: Rolling log of past checks (last 20)
items:
checked_at: { type: datetime }
servers_checked: { type: integer }
healthy: { type: integer }
degraded: { type: integer }
unreachable: { type: integer }
Loading
Loading