Skip to content

Auth Pipeline Resilience + Registry Single Point of Failure #147

@vrknetha

Description

@vrknetha

Summary

The proxy auth middleware makes multiple external calls to the registry on every inbound request. If the registry is slow or down, all message delivery stops with 503 errors. There is no circuit breaker, limited caching fallback, and the in-memory nonce cache is lost on Durable Object eviction.

Current Auth Pipeline (per request)

Inbound request arrives at proxy
  → Parse Authorization: Claw <AIT>
  → Verify AIT signature (needs registry signing keys — cached 1hr)
  → Check CRL for revocation (needs registry CRL endpoint — cached with TTL)
  → Verify PoP (X-Claw-Timestamp + X-Claw-Nonce + X-Claw-Signature)
  → Check nonce for replay (in-memory cache)
  → Assert agent is known/trusted (trust store — D1/KV)
  → Validate agent access token (POST to registry /v1/auth/agent/validate — NO cache)
  → Route to handler

Issues

🔴 P0: Registry is a single point of failure

File: apps/proxy/src/auth-middleware.ts

Every authenticated request requires the registry for at least one of:

  • Signing keys fetch (/.well-known/claw-keys.json)
  • CRL fetch (/v1/crl)
  • Agent access token validation (/v1/auth/agent/validate) — called on every /hooks/agent and /relay/connect request with NO caching

If registry.clawdentity.com goes down, ALL proxies return 503 on every request. Every agent-to-agent message globally stops.

The agent access validation call (line ~520) is the worst offender — it is never cached and hits the registry synchronously on every single request.

Fix:

  1. Cache agent access validation with short TTL (30-60s). Same access token + agentDid + aitJti = same result within TTL window
  2. Circuit breaker on registry calls: after N consecutive failures (e.g., 5), open circuit for M seconds (e.g., 30s). During open circuit, use cached/stale data with degraded trust
  3. Stale-while-revalidate for signing keys: serve cached keys even if refresh fails, with a maximum staleness window (e.g., 24 hours)
  4. Local fallback for CRL: if registry CRL is unavailable AND cache is stale beyond max age, make a policy decision (fail-open with logging vs fail-closed). Current staleBehavior config handles this partially but the signing keys path does not

🔴 P0: Agent access validation has zero caching

File: apps/proxy/src/auth-middleware.ts (lines ~510-545)

// This runs on EVERY /hooks/agent and /relay/connect request:
validateResponse = await fetchImpl(agentAuthValidateUrl, {
  method: "POST",
  headers: { ... },
  body: JSON.stringify({ agentDid: claims.sub, aitJti: claims.jti }),
});

This is a synchronous HTTP call to the registry with no caching. Under load, this adds latency to every message and creates a hard dependency on registry availability.

Fix:

  • Cache validation result keyed on (accessToken, agentDid, aitJti) with 30-60s TTL
  • On cache hit, skip the HTTP call
  • On cache miss or expiry, validate and cache result
  • On registry failure with valid cache entry, use cached result with warning log
const cacheKey = `${accessToken}:${claims.sub}:${claims.jti}`;
const cached = agentAccessCache.get(cacheKey);
if (cached && cached.validUntilMs > clock()) {
  // Skip registry call, use cached validation
} else {
  // Validate with registry, cache result
}

🟠 P1: Signing keys cache has no stale-while-revalidate

File: apps/proxy/src/auth-middleware.ts

The signing keys cache checks clock() - fetchedAtMs <= registryKeysCacheTtlMs (1 hour TTL). If the cache expires and the registry is down, the next request gets a hard 503.

The CRL cache has staleBehavior config, but signing keys do not. Since keys rotate very rarely (months/years), serving a 2-hour-stale key is far better than failing all auth.

Fix:

  • Add stale-while-revalidate: serve cached keys even after TTL expires, trigger background refresh
  • Add maximum staleness: reject keys only after a hard limit (e.g., 24 hours stale)
  • Log warnings when serving stale keys

🟠 P1: Nonce cache is in-memory — DO eviction loses replay protection

File: apps/proxy/src/auth-middleware.ts

createNonceCache() is an in-memory store. When the Cloudflare Durable Object is evicted (idle timeout, region migration), the nonce cache is lost. Between eviction and re-population, previously-seen nonces are accepted again, opening a replay attack window.

The window is bounded by the timestamp skew (300s), but within that window, a captured request can be replayed after DO eviction.

Fix:

  • Persist nonces to DO storage (KV or storage API) instead of in-memory only
  • Or use a hybrid: in-memory for fast lookup, periodic flush to durable storage
  • On DO wake, load recent nonces from storage before accepting requests
  • Size consideration: with 300s skew window, even at 100 req/s, that is only ~30K nonces to persist

🟠 P1: No circuit breaker on registry calls

File: apps/proxy/src/auth-middleware.ts

Each request independently calls the registry. If the registry is returning 500s, every concurrent request makes its own failing call. Under load (100 agents, 10 msgs/sec each), that is 1000 failing HTTP calls per second to a down registry — making recovery harder.

Fix:

  • Implement circuit breaker pattern with three states:
    • Closed (normal): all calls go through
    • Open (after N failures in M seconds): skip registry call, use cached data or reject fast
    • Half-open (after cooldown): allow one probe request to test recovery
  • Apply to: signing keys fetch, CRL fetch, and agent access validation independently
  • Log state transitions for observability

🟡 P2: No timeout on registry calls

File: apps/proxy/src/auth-middleware.ts

Registry HTTP calls (signing keys, CRL, agent access validation) use bare fetchImpl() with no abort signal or timeout. A slow registry (e.g., 30s response time under load) blocks the entire auth pipeline.

Fix:

  • Add AbortSignal.timeout(5_000) (or configurable) to all registry calls
  • A slow registry should fail fast, not propagate latency to every message

🟡 P2: Clock skew boundary is unmonitored

File: apps/proxy/src/auth-middleware.ts

The 300s (5 min) skew window is generous but there is no alerting or logging when agents consistently operate near the boundary. A drift trend is invisible until it crosses the threshold and breaks auth completely.

Fix:

  • Log a warning when timestamp skew exceeds 200s (approaching the 300s limit)
  • Include skew value in auth success logs for monitoring
  • Consider a /v1/time endpoint for agents to self-calibrate

🟡 P2: Trust store unavailability returns 503 — no degraded mode

File: apps/proxy/src/trust-policy.ts

If the trust store (D1/KV) is transiently unavailable, assertKnownTrustedAgent throws 503. There is no cached trust state to fall back on.

Fix:

  • Cache trust lookups in-memory with short TTL (60-120s)
  • On trust store failure with valid cache, use cached trust decision
  • On trust store failure with no cache, fail closed (403) rather than 503 (more informative for sender)

Acceptance Criteria

  • Agent access validation cached with 30-60s TTL — reduces registry load by ~95%
  • Circuit breaker on all registry calls — prevents thundering herd on registry failure
  • Signing keys cache serves stale keys during registry outage (up to 24hr)
  • Nonce cache persisted to DO storage — replay protection survives eviction
  • All registry calls have explicit timeout (5s)
  • Clock skew near-boundary logged as warning
  • Trust lookups cached with TTL for transient D1/KV failures

Test Scenarios

  1. Registry goes down for 10 min → proxy serves requests using cached keys/CRL/access validation, logs warnings, no 503 for cached agents
  2. Registry returns 500s under load → circuit breaker opens after 5 failures, stops hammering registry, probes recovery every 30s
  3. DO evicted and restarted → nonces reloaded from storage, replay protection intact
  4. Registry responds in 15s → auth timeout fires at 5s, request fails fast
  5. Clock skew reaches 250s → warning logged, admin alerted before auth breaks at 300s
  6. D1 trust store returns 500 → cached trust used for known agents, new agents rejected

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestresilienceSystem resilience and fault tolerance

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions