Skip to content

End-to-End Observability, Failure Cascades, and Recovery After Extended Outage #148

@vrknetha

Description

@vrknetha

Summary

There is no way to trace a message across the full delivery chain, no circuit breakers to prevent cascading failures, no priority system, and limited recovery mechanisms after extended outages. These cross-cutting concerns affect every segment of the round-trip.

Full Round Trip (for context)

Agent A OpenClaw → transform → connector outbound → sign → POST to B proxy
  → B proxy auth → relay DO → WebSocket → B connector → inbox → B OpenClaw hook
  → B agent responds → reverse path back to A

Issues

🔴 P0: No end-to-end message tracing

Files: Multiple across all packages

Each hop generates its own requestId (typically a ULID). There is no correlation ID that survives the full round-trip:

Hop ID Used
Transform → Connector outbound None (fire-and-forget)
Connector → Peer proxy New requestId per relay
Proxy → DO relay session deliveryId (ULID)
DO → WebSocket deliver frame.id (ULID)
Connector inbox → OpenClaw hook requestId from frame

If Agent A says "I sent a message 10 minutes ago and B never got it", there is no way to trace where it got stuck without correlating logs across 5+ systems manually.

Fix:

  • Introduce a correlationId (or traceId) that is:
    • Generated at the originating agent (or transform)
    • Passed as a header (X-Claw-Correlation-Id) through every hop
    • Included in every log entry across connector, proxy, DO, and inbox
    • Returned in delivery receipts and status queries
  • This enables: "trace message abc-123 → left connector at T1 → reached proxy at T2 → queued in DO at T3 → delivered via WS at T4 → replayed to OpenClaw at T5"

🔴 P0: No circuit breakers anywhere in the chain

Files: Multiple

A slow or failing component cascades through the entire chain:

  1. OpenClaw hook slow (agent doing 30s+ turn) → inbox replay blocks → inbox fills up
  2. Inbox full → connector sends accepted: false ack to DO → DO queues (500 msg cap)
  3. DO queue fills → relay session returns 507 to sender proxy → sender proxy returns 502
  4. Sender connector retries → multiplies load on the already-stressed path

No component sheds load gracefully. Each layer just passes the failure upstream.

Fix:

  • Add circuit breakers at each boundary:
    • Connector → OpenClaw hook: if N consecutive failures, pause replay for M seconds (already partially handled by backoff, but no circuit state)
    • Connector → Peer proxy (outbound): if peer consistently fails, open circuit and queue locally
    • Proxy → Registry: covered in Auth Pipeline Resilience + Registry Single Point of Failure #147 but mention here for completeness
    • DO → WebSocket delivery: if connector repeatedly NAKs, back off delivery attempts (partially exists)
  • Each circuit breaker should expose state in relevant status endpoints

🟠 P1: No message priority system

Files: apps/proxy/src/agent-relay-session.ts, packages/connector/src/inbound-inbox.ts

All messages are treated equally. A flood of low-priority messages (e.g., batch status updates) can starve critical messages (e.g., auth challenges, human-initiated conversations). Both the DO queue and the inbound inbox process FIFO with no priority lanes.

Fix:

  • Add optional priority field to DeliverFrame (e.g., critical | normal | low)
  • High-priority messages: processed first in queue drains, bypass backoff delays
  • Low-priority messages: shed first when queues approach capacity
  • Default: normal (backward compatible)

🟠 P1: DO queue limits too tight for extended outages

File: apps/proxy/src/agent-relay-session.ts

  • Queue cap: 500 messages per agent (configurable via RELAY_QUEUE_MAX_MESSAGES_PER_AGENT)
  • Queue TTL: 1 hour (configurable via RELAY_QUEUE_TTL_SECONDS)

If a connector is offline for 2+ hours (e.g., laptop sleeping, server maintenance), all queued messages expire. The sender gets a 507 (queue full) or messages silently expire with no notification.

Fix:

  • Increase default TTL to at least 4-6 hours for production use
  • Add a "message expired" notification mechanism — when a queued message expires, create a delivery failure receipt that the sender can query
  • Add configurable per-peer queue limits (high-trust peers get larger queues)
  • Consider overflow to durable storage (D1) for messages exceeding DO storage limits

🟠 P1: No catch-up protocol after reconnection

Files: packages/connector/src/client.ts, apps/proxy/src/agent-relay-session.ts

When a connector reconnects after being offline, there is no handshake to reconcile state:

  • Connector does not tell the DO "I was offline since timestamp T, what did I miss?"
  • DO does not tell the connector "you have N queued messages, here is the first batch"
  • Messages that expired during the outage are silently gone

Fix:

  • Add a reconnection handshake frame:
    type: "reconnect"
    lastReceivedId: "<ULID of last successfully processed message>"
    offlineSinceMs: <timestamp>
    
  • DO responds with queue state summary:
    type: "reconnect_ack"
    queuedCount: N
    expiredCount: M
    oldestQueuedAt: "<ISO timestamp>"
    
  • This gives both sides shared state awareness on reconnection

🟡 P2: No aggregate metrics or dashboards

Files: Multiple

Individual components have status endpoints (/v1/status, connector status, DO health), but there are no aggregate metrics:

  • Message success/failure rate (per peer, per hour)
  • End-to-end delivery latency (p50, p95, p99)
  • Queue depth trends over time
  • Connection uptime percentage
  • Auth failure rate by error code
  • Registry call latency and failure rate

Fix:

  • Emit structured metrics events at each hop (connector, proxy, DO)
  • Add a metrics aggregation endpoint or export to a time-series store
  • Key metrics to track:
    • messages.outbound.total / messages.outbound.failed (per peer)
    • messages.inbound.total / messages.inbound.replayed / messages.inbound.dead_lettered
    • delivery.latency.e2e_ms (correlation ID enables this)
    • websocket.uptime_pct / websocket.reconnect_count
    • registry.call_latency_ms / registry.failure_count
    • queue.depth / queue.expired_count

🟡 P2: No sender notification for permanently failed messages

Files: Multiple

When a message permanently fails (DO queue expired, max retries exhausted, dead-lettered in inbox), the sender is never notified. Agent A thinks the message might still be in transit when it has already been discarded.

Fix:

  • Implement delivery status notifications:
    • On permanent failure, POST a status update to the sender proxy
    • Sender proxy relays to sender connector → sender OpenClaw
    • Include: original correlation ID, failure reason, timestamp
    • Enable the sender agent to inform the user: "Your message to Agent B from 2 hours ago could not be delivered: recipient was offline and message expired"

🟡 P2: No rate limiting at the agent level

Files: apps/proxy/src/server.ts

There is no per-agent rate limiting on inbound message delivery. A misconfigured or malicious agent can flood another agent with messages, filling the DO queue and causing legitimate messages to get 507 (queue full).

Fix:

  • Add per-sender rate limiting at the proxy level:
    • E.g., max 60 messages per minute per sender agent DID
    • Configurable per trust pair (high-trust peers get higher limits)
    • Return 429 with Retry-After header when exceeded

🟡 P2: No idempotency guarantee on the outbound path

File: apps/openclaw-skill/src/transforms/relay-to-peer.ts

If the transform fires twice for the same message (OpenClaw hook retry, duplicate event), the same message is sent twice to the peer. The inbound side has dedup via requestId, but the outbound transform generates no stable ID — each invocation creates a new request.

Fix:

  • Generate a deterministic message ID in the transform based on payload content hash or OpenClaw event ID
  • Pass as X-Claw-Idempotency-Key header
  • Peer proxy deduplicates based on this key (short TTL window)

Acceptance Criteria

  • Correlation ID traces a message across all hops (transform → connector → proxy → DO → connector → OpenClaw)
  • Circuit breakers at each inter-component boundary prevent cascading failures
  • Reconnection handshake reconciles state between connector and DO
  • Permanently failed messages generate sender notifications
  • Per-agent rate limiting prevents queue flooding
  • Aggregate metrics available for success rate, latency, queue depth
  • Outbound dedup prevents duplicate message delivery

Priority Order

  1. End-to-end correlation ID — foundational for debugging everything else
  2. Circuit breakers — prevents one slow component from taking down the whole chain
  3. Reconnection handshake — critical for recovery after outages
  4. Sender failure notifications — closes the "message disappeared" gap
  5. Priority system — prevents critical messages being starved
  6. Metrics + rate limiting — operational maturity

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestresilienceSystem resilience and fault tolerance

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions