From c249f0d3bebb26f0457ee4c2349354b419990cad Mon Sep 17 00:00:00 2001
From: ArchieIndian <mitra.arkid@gmail.com>
Date: Mon, 16 Mar 2026 00:56:50 +0530
Subject: [PATCH] Add mcp-health-checker skill

Monitors MCP server connections for health, latency, and availability.
Probes stdio servers via JSON-RPC initialize and HTTP servers via GET.
Detects stale connections, timeouts, unreachable servers. Cron runs
every 6 hours. Companion script: check.py with --ping, --config,
--status, --history commands.

Inspired by OpenLobster's MCP connection health monitoring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .../mcp-health-checker/SKILL.md               | 112 ++++
 .../mcp-health-checker/STATE_SCHEMA.yaml      |  31 ++
 .../mcp-health-checker/check.py               | 514 ++++++++++++++++++
 .../mcp-health-checker/example-state.yaml     |  86 +++
 4 files changed, 743 insertions(+)
 create mode 100644 skills/openclaw-native/mcp-health-checker/SKILL.md
 create mode 100644 skills/openclaw-native/mcp-health-checker/STATE_SCHEMA.yaml
 create mode 100755 skills/openclaw-native/mcp-health-checker/check.py
 create mode 100644 skills/openclaw-native/mcp-health-checker/example-state.yaml
diff --git a/skills/openclaw-native/mcp-health-checker/SKILL.md b/skills/openclaw-native/mcp-health-checker/SKILL.md
new file mode 100644
index 0000000..c66abfe
--- /dev/null
+++ b/skills/openclaw-native/mcp-health-checker/SKILL.md
@@ -0,0 +1,112 @@
+---
+name: mcp-health-checker
+version: "1.0"
+category: openclaw-native
+description: Monitors MCP server connections for health, latency, and availability — detects stale connections, timeouts, and unreachable servers before they cause silent tool failures.
+stateful: true
+cron: "0 */6 * * *"
+---
+
+# MCP Health Checker
+
+## What it does
+
+MCP (Model Context Protocol) servers are how OpenClaw connects to external tools — but connections go stale silently. A crashed MCP server doesn't throw an error until the agent tries to use it, causing confusing mid-task failures.
+
+MCP Health Checker proactively monitors all configured MCP connections. It pings servers, measures latency, tracks uptime history, and alerts you before a stale connection causes a problem.
+
+Inspired by OpenLobster's MCP connection health monitoring and OAuth 2.1+PKCE token refresh tracking.
+
+## When to invoke
+
+- Automatically every 6 hours (cron) — silent background health check
+- Manually before starting a task that depends on MCP tools
+- When an MCP tool call fails unexpectedly — diagnose the connection
+- After restarting MCP servers — verify all connections restored
+
+## Health checks performed
+
+| Check | What it tests | Severity on failure |
+|---|---|---|
+| REACHABLE | Server responds to connection probe | CRITICAL |
+| LATENCY | Response time under threshold (default: 5s) | HIGH |
+| STALE | Connection age exceeds max (default: 24h) | HIGH |
+| TOOL_COUNT | Server exposes expected number of tools | MEDIUM |
+| CONFIG_VALID | MCP config entry has required fields | MEDIUM |
+| AUTH_EXPIRY | OAuth/API token approaching expiration | HIGH |
+
+## How to use
+
+```bash
+python3 check.py --ping                     # Ping all configured MCP servers
+python3 check.py --ping --server <name>     # Ping a specific server
+python3 check.py --ping --timeout 3         # Custom timeout in seconds
+python3 check.py --status                   # Last check summary from state
+python3 check.py --history                  # Show past check results
+python3 check.py --config                   # Validate MCP config entries
+python3 check.py --format json              # Machine-readable output
+```
+
+## Cron wakeup behaviour
+
+Every 6 hours:
+
+1. Read MCP server configuration from `~/.openclaw/config/` (YAML/JSON)
+2. For each configured server:
+   - Attempt connection probe (TCP or HTTP depending on transport)
+   - Measure response latency
+   - Check connection age against staleness threshold
+   - Verify tool listing matches expected count (if tracked)
+   - Check auth token expiry (if applicable)
+3. Update state with per-server health records
+4. Print summary: healthy / degraded / unreachable counts
+5. Exit 1 if any CRITICAL findings
+
+## Procedure
+
+**Step 1 — Run a health check**
+
+```bash
+python3 check.py --ping
+```
+
+Review the output. Healthy servers show a green check. Degraded servers show latency warnings. Unreachable servers show a critical alert.
+
+**Step 2 — Diagnose a specific server**
+
+```bash
+python3 check.py --ping --server filesystem
+```
+
+Detailed output for a single server: latency, last seen, tool count, auth status.
+
+**Step 3 — Validate configuration**
+
+```bash
+python3 check.py --config
+```
+
+Checks that all MCP config entries have the required fields (`command`, `args` or `url` depending on transport type).
+
+**Step 4 — Review history**
+
+```bash
+python3 check.py --history
+```
+
+Shows uptime trends over the last 20 checks. Spot servers that are intermittently failing.
+
+## State
+
+Per-server health records and check history stored in `~/.openclaw/skill-state/mcp-health-checker/state.yaml`.
+
+Fields: `last_check_at`, `servers` list, `check_history`.
+
+## Notes
+
+- Does not modify MCP configuration — read-only monitoring
+- Connection probes use the same transport as the MCP server (stdio subprocess spawn or HTTP GET)
+- For stdio servers: probes verify the process can start and respond to `initialize`
+- For HTTP/SSE servers: probes send a health-check HTTP request
+- Latency threshold configurable via `--timeout` (default: 5s)
+- Staleness threshold configurable via `--max-age` (default: 24h)
diff --git a/skills/openclaw-native/mcp-health-checker/STATE_SCHEMA.yaml b/skills/openclaw-native/mcp-health-checker/STATE_SCHEMA.yaml
new file mode 100644
index 0000000..a756c7e
--- /dev/null
+++ b/skills/openclaw-native/mcp-health-checker/STATE_SCHEMA.yaml
@@ -0,0 +1,31 @@
+version: "1.0"
+description: MCP server health records, per-server status, and check history.
+fields:
+  last_check_at:
+    type: datetime
+  servers:
+    type: list
+    description: Per-server health status from the most recent check
+    items:
+      name:         { type: string, description: "Server name from config" }
+      transport:    { type: string, description: "stdio or http" }
+      status:       { type: enum, values: [healthy, degraded, unreachable, unknown] }
+      latency_ms:   { type: integer, description: "Response time in milliseconds" }
+      last_seen_at: { type: datetime, description: "Last successful probe" }
+      tool_count:   { type: integer, description: "Number of tools exposed" }
+      findings:
+        type: list
+        items:
+          check:    { type: string }
+          severity: { type: string }
+          detail:   { type: string }
+      checked_at:   { type: datetime }
+  check_history:
+    type: list
+    description: Rolling log of past checks (last 20)
+    items:
+      checked_at:       { type: datetime }
+      servers_checked:  { type: integer }
+      healthy:          { type: integer }
+      degraded:         { type: integer }
+      unreachable:      { type: integer }
diff --git a/skills/openclaw-native/mcp-health-checker/check.py b/skills/openclaw-native/mcp-health-checker/check.py
new file mode 100755
index 0000000..57757be
--- /dev/null
+++ b/skills/openclaw-native/mcp-health-checker/check.py
@@ -0,0 +1,514 @@
+#!/usr/bin/env python3
+"""
+MCP Health Checker for openclaw-superpowers.
+
+Monitors MCP server connections for health, latency, and availability.
+
+Usage:
+    python3 check.py --ping
+    python3 check.py --ping --server <name>
+    python3 check.py --ping --timeout 3
+    python3 check.py --config
+    python3 check.py --status
+    python3 check.py --history
+    python3 check.py --format json
+"""
+
+import argparse
+import json
+import os
+import subprocess
+import sys
+import time
+from datetime import datetime, timedelta
+from pathlib import Path
+
+try:
+    import yaml
+    HAS_YAML = True
+except ImportError:
+    HAS_YAML = False
+
+OPENCLAW_DIR = Path(os.environ.get("OPENCLAW_HOME", Path.home() / ".openclaw"))
+STATE_FILE = OPENCLAW_DIR / "skill-state" / "mcp-health-checker" / "state.yaml"
+MAX_HISTORY = 20
+
+# MCP config locations to search
+MCP_CONFIG_PATHS = [
+    OPENCLAW_DIR / "config" / "mcp.yaml",
+    OPENCLAW_DIR / "config" / "mcp.json",
+    OPENCLAW_DIR / "mcp.yaml",
+    OPENCLAW_DIR / "mcp.json",
+    Path.home() / ".config" / "openclaw" / "mcp.yaml",
+    Path.home() / ".config" / "openclaw" / "mcp.json",
+]
+
+DEFAULT_TIMEOUT = 5  # seconds
+DEFAULT_MAX_AGE = 24  # hours
+
+
+# ── State helpers ────────────────────────────────────────────────────────────
+
+def load_state() -> dict:
+    if not STATE_FILE.exists():
+        return {"servers": [], "check_history": []}
+    try:
+        text = STATE_FILE.read_text()
+        return (yaml.safe_load(text) or {}) if HAS_YAML else {}
+    except Exception:
+        return {}
+
+
+def save_state(state: dict) -> None:
+    STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
+    if HAS_YAML:
+        with open(STATE_FILE, "w") as f:
+            yaml.dump(state, f, default_flow_style=False, allow_unicode=True)
+
+
+# ── MCP config discovery ────────────────────────────────────────────────────
+
+def find_mcp_config() -> tuple[Path | None, dict]:
+    """Find and parse MCP configuration."""
+    for config_path in MCP_CONFIG_PATHS:
+        if not config_path.exists():
+            continue
+        try:
+            text = config_path.read_text()
+            if config_path.suffix == ".json":
+                data = json.loads(text)
+            elif HAS_YAML:
+                data = yaml.safe_load(text) or {}
+            else:
+                continue
+            return config_path, data
+        except Exception:
+            continue
+    return None, {}
+
+
+def extract_servers(config: dict) -> list[dict]:
+    """Extract server definitions from MCP config."""
+    servers = []
+    # Support both flat and nested formats
+    mcp_servers = config.get("mcpServers") or config.get("servers") or config
+    if isinstance(mcp_servers, dict):
+        for name, defn in mcp_servers.items():
+            if not isinstance(defn, dict):
+                continue
+            transport = "stdio"
+            if "url" in defn:
+                transport = "http"
+            elif "command" in defn:
+                transport = "stdio"
+            servers.append({
+                "name": name,
+                "transport": transport,
+                "command": defn.get("command"),
+                "args": defn.get("args", []),
+                "url": defn.get("url"),
+                "env": defn.get("env", {}),
+            })
+    return servers
+
+
+# ── Health checks ────────────────────────────────────────────────────────────
+
+def probe_stdio_server(server: dict, timeout: int) -> dict:
+    """Probe a stdio MCP server by attempting to start and initialize it."""
+    command = server.get("command")
+    args = server.get("args", [])
+    if not command:
+        return {
+            "status": "unreachable",
+            "latency_ms": 0,
+            "findings": [{"check": "CONFIG_VALID", "severity": "MEDIUM",
+                          "detail": "No command specified for stdio server"}],
+        }
+
+    # Build the initialize JSON-RPC request
+    init_request = json.dumps({
+        "jsonrpc": "2.0",
+        "id": 1,
+        "method": "initialize",
+        "params": {
+            "protocolVersion": "2024-11-05",
+            "capabilities": {},
+            "clientInfo": {"name": "mcp-health-checker", "version": "1.0"},
+        }
+    }) + "\n"
+
+    start = time.monotonic()
+    try:
+        env = os.environ.copy()
+        env.update(server.get("env", {}))
+        proc = subprocess.Popen(
+            [command] + args,
+            stdin=subprocess.PIPE,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            env=env,
+        )
+        stdout, stderr = proc.communicate(
+            input=init_request.encode(),
+            timeout=timeout,
+        )
+        elapsed_ms = int((time.monotonic() - start) * 1000)
+
+        if proc.returncode is not None and proc.returncode != 0 and not stdout:
+            return {
+                "status": "unreachable",
+                "latency_ms": elapsed_ms,
+                "findings": [{"check": "REACHABLE", "severity": "CRITICAL",
+                              "detail": f"Process exited with code {proc.returncode}"}],
+            }
+
+        # Try to parse response
+        findings = []
+        tool_count = 0
+        try:
+            response = json.loads(stdout.decode().strip().split("\n")[0])
+            if "result" in response:
+                caps = response["result"].get("capabilities", {})
+                if "tools" in caps:
+                    tool_count = -1  # Has tools capability but count unknown until list
+        except (json.JSONDecodeError, IndexError):
+            findings.append({"check": "REACHABLE", "severity": "HIGH",
+                             "detail": "Server responded but output not valid JSON-RPC"})
+
+        # Check latency
+        status = "healthy"
+        if elapsed_ms > timeout * 1000:
+            findings.append({"check": "LATENCY", "severity": "HIGH",
+                             "detail": f"Response time {elapsed_ms}ms exceeds {timeout}s threshold"})
+            status = "degraded"
+        elif elapsed_ms > (timeout * 1000) // 2:
+            findings.append({"check": "LATENCY", "severity": "MEDIUM",
+                             "detail": f"Response time {elapsed_ms}ms approaching threshold"})
+
+        if findings and status == "healthy":
+            status = "degraded"
+
+        return {
+            "status": status,
+            "latency_ms": elapsed_ms,
+            "tool_count": tool_count,
+            "findings": findings,
+        }
+
+    except subprocess.TimeoutExpired:
+        elapsed_ms = int((time.monotonic() - start) * 1000)
+        try:
+            proc.kill()
+            proc.wait(timeout=2)
+        except Exception:
+            pass
+        return {
+            "status": "unreachable",
+            "latency_ms": elapsed_ms,
+            "findings": [{"check": "LATENCY", "severity": "CRITICAL",
+                          "detail": f"Server did not respond within {timeout}s"}],
+        }
+    except FileNotFoundError:
+        return {
+            "status": "unreachable",
+            "latency_ms": 0,
+            "findings": [{"check": "REACHABLE", "severity": "CRITICAL",
+                          "detail": f"Command not found: {command}"}],
+        }
+    except Exception as e:
+        return {
+            "status": "unreachable",
+            "latency_ms": 0,
+            "findings": [{"check": "REACHABLE", "severity": "CRITICAL",
+                          "detail": f"Probe failed: {str(e)[:100]}"}],
+        }
+
+
+def probe_http_server(server: dict, timeout: int) -> dict:
+    """Probe an HTTP/SSE MCP server via HTTP GET."""
+    url = server.get("url")
+    if not url:
+        return {
+            "status": "unreachable",
+            "latency_ms": 0,
+            "findings": [{"check": "CONFIG_VALID", "severity": "MEDIUM",
+                          "detail": "No URL specified for HTTP server"}],
+        }
+
+    start = time.monotonic()
+    try:
+        import urllib.request
+        req = urllib.request.Request(url, method="GET")
+        req.add_header("User-Agent", "mcp-health-checker/1.0")
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            elapsed_ms = int((time.monotonic() - start) * 1000)
+            status_code = resp.status
+
+            findings = []
+            if status_code >= 400:
+                findings.append({"check": "REACHABLE", "severity": "CRITICAL",
+                                 "detail": f"HTTP {status_code} response"})
+                return {"status": "unreachable", "latency_ms": elapsed_ms, "findings": findings}
+
+            if elapsed_ms > timeout * 1000:
+                findings.append({"check": "LATENCY", "severity": "HIGH",
+                                 "detail": f"Response time {elapsed_ms}ms exceeds threshold"})
+
+            status = "degraded" if findings else "healthy"
+            return {"status": status, "latency_ms": elapsed_ms, "findings": findings}
+
+    except Exception as e:
+        elapsed_ms = int((time.monotonic() - start) * 1000)
+        return {
+            "status": "unreachable",
+            "latency_ms": elapsed_ms,
+            "findings": [{"check": "REACHABLE", "severity": "CRITICAL",
+                          "detail": f"Connection failed: {str(e)[:100]}"}],
+        }
+
+
+def check_staleness(server_name: str, state: dict, max_age_hours: int) -> list[dict]:
+    """Check if a server connection is stale based on last seen time."""
+    findings = []
+    prev_servers = state.get("servers") or []
+    for prev in prev_servers:
+        if prev.get("name") == server_name and prev.get("last_seen_at"):
+            try:
+                last_seen = datetime.fromisoformat(prev["last_seen_at"])
+                age = datetime.now() - last_seen
+                if age > timedelta(hours=max_age_hours):
+                    findings.append({
+                        "check": "STALE",
+                        "severity": "HIGH",
+                        "detail": f"Last successful probe was {age.total_seconds()/3600:.1f}h ago "
+                                  f"(threshold: {max_age_hours}h)",
+                    })
+            except (ValueError, TypeError):
+                pass
+    return findings
+
+
+def validate_config_entry(server: dict) -> list[dict]:
+    """Validate a server config entry has required fields."""
+    findings = []
+    if server["transport"] == "stdio":
+        if not server.get("command"):
+            findings.append({"check": "CONFIG_VALID", "severity": "MEDIUM",
+                             "detail": "Missing 'command' field for stdio server"})
+    elif server["transport"] == "http":
+        if not server.get("url"):
+            findings.append({"check": "CONFIG_VALID", "severity": "MEDIUM",
+                             "detail": "Missing 'url' field for HTTP server"})
+    return findings
+
+
+# ── Commands ─────────────────────────────────────────────────────────────────
+
+def cmd_ping(state: dict, server_filter: str, timeout: int, max_age: int, fmt: str) -> None:
+    config_path, config = find_mcp_config()
+    now = datetime.now().isoformat()
+
+    if not config_path:
+        print("No MCP configuration found. Searched:")
+        for p in MCP_CONFIG_PATHS:
+            print(f"  {p}")
+        print("\nCreate an MCP config to enable health checking.")
+        sys.exit(1)
+
+    servers = extract_servers(config)
+    if server_filter:
+        servers = [s for s in servers if s["name"] == server_filter]
+        if not servers:
+            print(f"Error: server '{server_filter}' not found in config.")
+            sys.exit(1)
+
+    results = []
+    healthy = degraded = unreachable = 0
+
+    for server in servers:
+        # Probe based on transport
+        if server["transport"] == "http":
+            probe = probe_http_server(server, timeout)
+        else:
+            probe = probe_stdio_server(server, timeout)
+
+        # Add staleness check
+        stale_findings = check_staleness(server["name"], state, max_age)
+        all_findings = probe.get("findings", []) + stale_findings
+
+        # Determine final status
+        status = probe["status"]
+        if status == "healthy" and stale_findings:
+            status = "degraded"
+
+        last_seen = now if status == "healthy" else None
+        # Preserve previous last_seen if current probe failed
+        if not last_seen:
+            for prev in (state.get("servers") or []):
+                if prev.get("name") == server["name"]:
+                    last_seen = prev.get("last_seen_at")
+                    break
+
+        result = {
+            "name": server["name"],
+            "transport": server["transport"],
+            "status": status,
+            "latency_ms": probe.get("latency_ms", 0),
+            "last_seen_at": last_seen,
+            "tool_count": probe.get("tool_count", 0),
+            "findings": all_findings,
+            "checked_at": now,
+        }
+        results.append(result)
+
+        if status == "healthy":
+            healthy += 1
+        elif status == "degraded":
+            degraded += 1
+        else:
+            unreachable += 1
+
+    if fmt == "json":
+        print(json.dumps({
+            "config_path": str(config_path),
+            "servers_checked": len(results),
+            "healthy": healthy, "degraded": degraded, "unreachable": unreachable,
+            "servers": results,
+        }, indent=2))
+    else:
+        print(f"\nMCP Health Check — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
+        print("-" * 55)
+        print(f"  Config: {config_path}")
+        print(f"  {len(results)} servers | {healthy} healthy | {degraded} degraded | {unreachable} unreachable")
+        print()
+        for r in results:
+            if r["status"] == "healthy":
+                icon = "+"
+            elif r["status"] == "degraded":
+                icon = "!"
+            else:
+                icon = "x"
+            print(f"  {icon} [{r['status'].upper():>11}] {r['name']} ({r['transport']}) — {r['latency_ms']}ms")
+            for f in r.get("findings", []):
+                print(f"     [{f['severity']}] {f['check']}: {f['detail']}")
+            print()
+
+    # Persist
+    state["last_check_at"] = now
+    state["servers"] = results
+    history = state.get("check_history") or []
+    history.insert(0, {
+        "checked_at": now, "servers_checked": len(results),
+        "healthy": healthy, "degraded": degraded, "unreachable": unreachable,
+    })
+    state["check_history"] = history[:MAX_HISTORY]
+    save_state(state)
+
+    sys.exit(1 if unreachable > 0 else 0)
+
+
+def cmd_config(fmt: str) -> None:
+    config_path, config = find_mcp_config()
+    if not config_path:
+        print("No MCP configuration found.")
+        sys.exit(1)
+
+    servers = extract_servers(config)
+    issues = []
+    for server in servers:
+        findings = validate_config_entry(server)
+        if findings:
+            issues.append({"server": server["name"], "findings": findings})
+
+    if fmt == "json":
+        print(json.dumps({
+            "config_path": str(config_path),
+            "servers": len(servers),
+            "issues": issues,
+        }, indent=2))
+    else:
+        print(f"\nMCP Config Validation — {config_path}")
+        print("-" * 50)
+        print(f"  {len(servers)} servers configured")
+        print()
+        if not issues:
+            print("  All config entries valid.")
+        else:
+            for issue in issues:
+                print(f"  ! {issue['server']}:")
+                for f in issue["findings"]:
+                    print(f"    [{f['severity']}] {f['detail']}")
+        print()
+        for server in servers:
+            print(f"  {server['name']}: transport={server['transport']}", end="")
+            if server.get("command"):
+                print(f" cmd={server['command']}", end="")
+            if server.get("url"):
+                print(f" url={server['url']}", end="")
+            print()
+
+
+def cmd_status(state: dict) -> None:
+    last = state.get("last_check_at", "never")
+    print(f"\nMCP Health Checker — Last check: {last}")
+    servers = state.get("servers") or []
+    if servers:
+        healthy = sum(1 for s in servers if s.get("status") == "healthy")
+        degraded = sum(1 for s in servers if s.get("status") == "degraded")
+        unreachable = sum(1 for s in servers if s.get("status") == "unreachable")
+        print(f"  {len(servers)} servers | {healthy} healthy | {degraded} degraded | {unreachable} unreachable")
+        for s in servers:
+            icon = {"healthy": "+", "degraded": "!", "unreachable": "x"}.get(s.get("status", ""), "?")
+            print(f"    {icon} {s['name']}: {s.get('status', 'unknown')} ({s.get('latency_ms', 0)}ms)")
+    print()
+
+
+def cmd_history(state: dict, fmt: str) -> None:
+    history = state.get("check_history") or []
+    if fmt == "json":
+        print(json.dumps({"check_history": history}, indent=2))
+    else:
+        print(f"\nMCP Health Check History")
+        print("-" * 50)
+        if not history:
+            print("  No check history yet.")
+        else:
+            for h in history[:10]:
+                total = h.get("servers_checked", 0)
+                healthy = h.get("healthy", 0)
+                degraded = h.get("degraded", 0)
+                unreachable = h.get("unreachable", 0)
+                pct = round(healthy / total * 100) if total else 0
+                ts = h.get("checked_at", "?")[:16]
+                bar = "=" * (pct // 10) + "-" * (10 - pct // 10)
+                print(f"  {ts}  [{bar}] {pct}% healthy  ({healthy}/{total})")
+        print()
+
+
+def main():
+    parser = argparse.ArgumentParser(description="MCP Health Checker")
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument("--ping", action="store_true", help="Ping all configured MCP servers")
+    group.add_argument("--config", action="store_true", help="Validate MCP config entries")
+    group.add_argument("--status", action="store_true", help="Last check summary from state")
+    group.add_argument("--history", action="store_true", help="Show past check results")
+    parser.add_argument("--server", type=str, metavar="NAME", help="Check a specific server only")
+    parser.add_argument("--timeout", type=int, default=DEFAULT_TIMEOUT, help="Timeout in seconds (default: 5)")
+    parser.add_argument("--max-age", type=int, default=DEFAULT_MAX_AGE, help="Max connection age in hours (default: 24)")
+    parser.add_argument("--format", choices=["text", "json"], default="text")
+    args = parser.parse_args()
+
+    state = load_state()
+    if args.ping:
+        cmd_ping(state, args.server, args.timeout, args.max_age, args.format)
+    elif args.config:
+        cmd_config(args.format)
+    elif args.status:
+        cmd_status(state)
+    elif args.history:
+        cmd_history(state, args.format)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/skills/openclaw-native/mcp-health-checker/example-state.yaml b/skills/openclaw-native/mcp-health-checker/example-state.yaml
new file mode 100644
index 0000000..1005202
--- /dev/null
+++ b/skills/openclaw-native/mcp-health-checker/example-state.yaml
@@ -0,0 +1,86 @@
+# Example runtime state for mcp-health-checker
+last_check_at: "2026-03-16T12:00:08.554000"
+servers:
+  - name: filesystem
+    transport: stdio
+    status: healthy
+    latency_ms: 120
+    last_seen_at: "2026-03-16T12:00:02.000000"
+    tool_count: 11
+    findings: []
+    checked_at: "2026-03-16T12:00:02.000000"
+  - name: github
+    transport: stdio
+    status: healthy
+    latency_ms: 340
+    last_seen_at: "2026-03-16T12:00:04.000000"
+    tool_count: 18
+    findings: []
+    checked_at: "2026-03-16T12:00:04.000000"
+  - name: web-search
+    transport: http
+    status: degraded
+    latency_ms: 4200
+    last_seen_at: "2026-03-16T12:00:08.000000"
+    tool_count: 3
+    findings:
+      - check: LATENCY
+        severity: MEDIUM
+        detail: "Response time 4200ms approaching threshold"
+    checked_at: "2026-03-16T12:00:08.000000"
+  - name: database
+    transport: stdio
+    status: unreachable
+    latency_ms: 0
+    last_seen_at: "2026-03-15T06:00:00.000000"
+    tool_count: 0
+    findings:
+      - check: REACHABLE
+        severity: CRITICAL
+        detail: "Command not found: pg-mcp-server"
+      - check: STALE
+        severity: HIGH
+        detail: "Last successful probe was 30.0h ago (threshold: 24h)"
+    checked_at: "2026-03-16T12:00:08.000000"
+check_history:
+  - checked_at: "2026-03-16T12:00:08.554000"
+    servers_checked: 4
+    healthy: 2
+    degraded: 1
+    unreachable: 1
+  - checked_at: "2026-03-16T06:00:05.000000"
+    servers_checked: 4
+    healthy: 3
+    degraded: 1
+    unreachable: 0
+  - checked_at: "2026-03-16T00:00:04.000000"
+    servers_checked: 4
+    healthy: 4
+    degraded: 0
+    unreachable: 0
+# ── Walkthrough ──────────────────────────────────────────────────────────────
+# Cron runs every 6 hours:  python3 check.py --ping
+#
+#   MCP Health Check — 2026-03-16 12:00
+#   ───────────────────────────────────────────────────────
+#     Config: /Users/you/.openclaw/config/mcp.yaml
+#     4 servers | 2 healthy | 1 degraded | 1 unreachable
+#
+#     + [    HEALTHY] filesystem (stdio) — 120ms
+#
+#     + [    HEALTHY] github (stdio) — 340ms
+#
+#     ! [   DEGRADED] web-search (http) — 4200ms
+#        [MEDIUM] LATENCY: Response time 4200ms approaching threshold
+#
+#     x [UNREACHABLE] database (stdio) — 0ms
+#        [CRITICAL] REACHABLE: Command not found: pg-mcp-server
+#        [HIGH] STALE: Last successful probe was 30.0h ago (threshold: 24h)
+#
+# python3 check.py --history
+#
+#   MCP Health Check History
+#   ──────────────────────────────────────────────────
+#     2026-03-16T12:00  [=====-----] 50% healthy  (2/4)
+#     2026-03-16T06:00  [=======---] 75% healthy  (3/4)
+#     2026-03-16T00:00  [==========] 100% healthy  (4/4)