Skip to content

Watchdog health check fails when gateway.tailscale.mode=serve (WS CLI broken) #21

@Lazydayz137

Description

@Lazydayz137

Problem

When gateway.tailscale.mode is set to "serve", all OpenClaw CLI commands that use WebSocket connections to the gateway fail with:

gateway connect failed: Error: gateway closed (1000): 
nodes status failed: Error: gateway closed (1000 normal closure): no close reason
Gateway target: ws://127.0.0.1:18789
Source: local loopback

This affects:

  • openclaw health --json (used by AlphaClaw watchdog for health checks)
  • openclaw nodes status --json (used by /api/nodes route)
  • openclaw nodes pending --json
  • openclaw devices list --json

The HTTP /health endpoint works fine — only WS-based CLI commands are broken.

Impact

  1. False health failures: AlphaClaw's watchdog uses openclaw health --json for health checks. When these fail, the UI shows the gateway as "constantly restarting" even though it's running fine.

  2. Zombie process flood: The /api/nodes dashboard route spawns openclaw nodes status/pending CLI processes that hang indefinitely (WS never completes), accumulating dozens of zombie processes that consume CPU and memory.

  3. Gateway overload: The zombie WS connections flood the gateway with handshake timeouts, degrading performance.

Root Cause

Tailscale Serve intercepts loopback traffic to the gateway port (18789) and appears to break the WS handshake for CLI subprocess connections. When tailscale.mode is set to "off", all CLI commands work instantly.

Environment

  • OpenClaw: 2026.3.13
  • AlphaClaw: 0.8.0
  • Node: v24.14.0
  • OS: Ubuntu Noble (Linux 6.8.0-106-generic)
  • Tailscale: 1.94.2

Workaround

Set gateway.tailscale.mode: "off" and use SSH tunnels or gateway.bind: "tailnet" for remote node connections instead of Tailscale Serve.

Suggested Fix

Consider using the HTTP /health endpoint for watchdog health checks instead of the WS-based CLI, since HTTP works regardless of Tailscale Serve configuration. The /api/nodes route could also use an HTTP-based approach or add a timeout to prevent zombie processes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions