End-to-End Observability, Failure Cascades, and Recovery After Extended Outage

## Summary

There is no way to trace a message across the full delivery chain, no circuit breakers to prevent cascading failures, no priority system, and limited recovery mechanisms after extended outages. These cross-cutting concerns affect every segment of the round-trip.

## Full Round Trip (for context)

```
Agent A OpenClaw → transform → connector outbound → sign → POST to B proxy
  → B proxy auth → relay DO → WebSocket → B connector → inbox → B OpenClaw hook
  → B agent responds → reverse path back to A
```

## Issues

### 🔴 P0: No end-to-end message tracing

**Files:** Multiple across all packages

Each hop generates its own `requestId` (typically a ULID). There is no correlation ID that survives the full round-trip:

| Hop | ID Used |
|-----|---------|
| Transform → Connector outbound | None (fire-and-forget) |
| Connector → Peer proxy | New requestId per relay |
| Proxy → DO relay session | deliveryId (ULID) |
| DO → WebSocket deliver | frame.id (ULID) |
| Connector inbox → OpenClaw hook | requestId from frame |

If Agent A says "I sent a message 10 minutes ago and B never got it", there is no way to trace where it got stuck without correlating logs across 5+ systems manually.

**Fix:**
- Introduce a `correlationId` (or `traceId`) that is:
  - Generated at the originating agent (or transform)
  - Passed as a header (`X-Claw-Correlation-Id`) through every hop
  - Included in every log entry across connector, proxy, DO, and inbox
  - Returned in delivery receipts and status queries
- This enables: "trace message abc-123 → left connector at T1 → reached proxy at T2 → queued in DO at T3 → delivered via WS at T4 → replayed to OpenClaw at T5"

---

### 🔴 P0: No circuit breakers anywhere in the chain

**Files:** Multiple

A slow or failing component cascades through the entire chain:

1. OpenClaw hook slow (agent doing 30s+ turn) → inbox replay blocks → inbox fills up
2. Inbox full → connector sends `accepted: false` ack to DO → DO queues (500 msg cap)
3. DO queue fills → relay session returns 507 to sender proxy → sender proxy returns 502
4. Sender connector retries → multiplies load on the already-stressed path

No component sheds load gracefully. Each layer just passes the failure upstream.

**Fix:**
- Add circuit breakers at each boundary:
  - **Connector → OpenClaw hook**: if N consecutive failures, pause replay for M seconds (already partially handled by backoff, but no circuit state)
  - **Connector → Peer proxy** (outbound): if peer consistently fails, open circuit and queue locally
  - **Proxy → Registry**: covered in #147 but mention here for completeness
  - **DO → WebSocket delivery**: if connector repeatedly NAKs, back off delivery attempts (partially exists)
- Each circuit breaker should expose state in relevant status endpoints

---

### 🟠 P1: No message priority system

**Files:** `apps/proxy/src/agent-relay-session.ts`, `packages/connector/src/inbound-inbox.ts`

All messages are treated equally. A flood of low-priority messages (e.g., batch status updates) can starve critical messages (e.g., auth challenges, human-initiated conversations). Both the DO queue and the inbound inbox process FIFO with no priority lanes.

**Fix:**
- Add optional `priority` field to `DeliverFrame` (e.g., `critical | normal | low`)
- High-priority messages: processed first in queue drains, bypass backoff delays
- Low-priority messages: shed first when queues approach capacity
- Default: `normal` (backward compatible)

---

### 🟠 P1: DO queue limits too tight for extended outages

**File:** `apps/proxy/src/agent-relay-session.ts`

- Queue cap: 500 messages per agent (configurable via `RELAY_QUEUE_MAX_MESSAGES_PER_AGENT`)
- Queue TTL: 1 hour (configurable via `RELAY_QUEUE_TTL_SECONDS`)

If a connector is offline for 2+ hours (e.g., laptop sleeping, server maintenance), all queued messages expire. The sender gets a 507 (queue full) or messages silently expire with no notification.

**Fix:**
- Increase default TTL to at least 4-6 hours for production use
- Add a "message expired" notification mechanism — when a queued message expires, create a delivery failure receipt that the sender can query
- Add configurable per-peer queue limits (high-trust peers get larger queues)
- Consider overflow to durable storage (D1) for messages exceeding DO storage limits

---

### 🟠 P1: No catch-up protocol after reconnection

**Files:** `packages/connector/src/client.ts`, `apps/proxy/src/agent-relay-session.ts`

When a connector reconnects after being offline, there is no handshake to reconcile state:
- Connector does not tell the DO "I was offline since timestamp T, what did I miss?"
- DO does not tell the connector "you have N queued messages, here is the first batch"
- Messages that expired during the outage are silently gone

**Fix:**
- Add a reconnection handshake frame:
  ```
  type: "reconnect"
  lastReceivedId: "<ULID of last successfully processed message>"
  offlineSinceMs: <timestamp>
  ```
- DO responds with queue state summary:
  ```
  type: "reconnect_ack"
  queuedCount: N
  expiredCount: M
  oldestQueuedAt: "<ISO timestamp>"
  ```
- This gives both sides shared state awareness on reconnection

---

### 🟡 P2: No aggregate metrics or dashboards

**Files:** Multiple

Individual components have status endpoints (`/v1/status`, connector status, DO health), but there are no aggregate metrics:
- Message success/failure rate (per peer, per hour)
- End-to-end delivery latency (p50, p95, p99)
- Queue depth trends over time
- Connection uptime percentage
- Auth failure rate by error code
- Registry call latency and failure rate

**Fix:**
- Emit structured metrics events at each hop (connector, proxy, DO)
- Add a metrics aggregation endpoint or export to a time-series store
- Key metrics to track:
  - `messages.outbound.total` / `messages.outbound.failed` (per peer)
  - `messages.inbound.total` / `messages.inbound.replayed` / `messages.inbound.dead_lettered`
  - `delivery.latency.e2e_ms` (correlation ID enables this)
  - `websocket.uptime_pct` / `websocket.reconnect_count`
  - `registry.call_latency_ms` / `registry.failure_count`
  - `queue.depth` / `queue.expired_count`

---

### 🟡 P2: No sender notification for permanently failed messages

**Files:** Multiple

When a message permanently fails (DO queue expired, max retries exhausted, dead-lettered in inbox), the sender is never notified. Agent A thinks the message might still be in transit when it has already been discarded.

**Fix:**
- Implement delivery status notifications:
  - On permanent failure, POST a status update to the sender proxy
  - Sender proxy relays to sender connector → sender OpenClaw
  - Include: original correlation ID, failure reason, timestamp
  - Enable the sender agent to inform the user: "Your message to Agent B from 2 hours ago could not be delivered: recipient was offline and message expired"

---

### 🟡 P2: No rate limiting at the agent level

**Files:** `apps/proxy/src/server.ts`

There is no per-agent rate limiting on inbound message delivery. A misconfigured or malicious agent can flood another agent with messages, filling the DO queue and causing legitimate messages to get 507 (queue full).

**Fix:**
- Add per-sender rate limiting at the proxy level:
  - E.g., max 60 messages per minute per sender agent DID
  - Configurable per trust pair (high-trust peers get higher limits)
  - Return 429 with `Retry-After` header when exceeded

---

### 🟡 P2: No idempotency guarantee on the outbound path

**File:** `apps/openclaw-skill/src/transforms/relay-to-peer.ts`

If the transform fires twice for the same message (OpenClaw hook retry, duplicate event), the same message is sent twice to the peer. The inbound side has dedup via requestId, but the outbound transform generates no stable ID — each invocation creates a new request.

**Fix:**
- Generate a deterministic message ID in the transform based on payload content hash or OpenClaw event ID
- Pass as `X-Claw-Idempotency-Key` header
- Peer proxy deduplicates based on this key (short TTL window)

---

## Acceptance Criteria

- [ ] Correlation ID traces a message across all hops (transform → connector → proxy → DO → connector → OpenClaw)
- [ ] Circuit breakers at each inter-component boundary prevent cascading failures
- [ ] Reconnection handshake reconciles state between connector and DO
- [ ] Permanently failed messages generate sender notifications
- [ ] Per-agent rate limiting prevents queue flooding
- [ ] Aggregate metrics available for success rate, latency, queue depth
- [ ] Outbound dedup prevents duplicate message delivery

## Priority Order

1. **End-to-end correlation ID** — foundational for debugging everything else
2. **Circuit breakers** — prevents one slow component from taking down the whole chain
3. **Reconnection handshake** — critical for recovery after outages
4. **Sender failure notifications** — closes the "message disappeared" gap
5. **Priority system** — prevents critical messages being starved
6. **Metrics + rate limiting** — operational maturity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-End Observability, Failure Cascades, and Recovery After Extended Outage #148

Summary

Full Round Trip (for context)

Issues

🔴 P0: No end-to-end message tracing

🔴 P0: No circuit breakers anywhere in the chain

🟠 P1: No message priority system

🟠 P1: DO queue limits too tight for extended outages

🟠 P1: No catch-up protocol after reconnection

🟡 P2: No aggregate metrics or dashboards

🟡 P2: No sender notification for permanently failed messages

🟡 P2: No rate limiting at the agent level

🟡 P2: No idempotency guarantee on the outbound path

Acceptance Criteria

Priority Order

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Hop	ID Used
Transform → Connector outbound	None (fire-and-forget)
Connector → Peer proxy	New requestId per relay
Proxy → DO relay session	deliveryId (ULID)
DO → WebSocket deliver	frame.id (ULID)
Connector inbox → OpenClaw hook	requestId from frame

End-to-End Observability, Failure Cascades, and Recovery After Extended Outage #148

Description

Summary

Full Round Trip (for context)

Issues

🔴 P0: No end-to-end message tracing

🔴 P0: No circuit breakers anywhere in the chain

🟠 P1: No message priority system

🟠 P1: DO queue limits too tight for extended outages

🟠 P1: No catch-up protocol after reconnection

🟡 P2: No aggregate metrics or dashboards

🟡 P2: No sender notification for permanently failed messages

🟡 P2: No rate limiting at the agent level

🟡 P2: No idempotency guarantee on the outbound path

Acceptance Criteria

Priority Order

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions