Skip to content

feat: Gateway heartbeat endpoint, reduced stale TTL, and optional announce webhook #147

@Jing-yilin

Description

@Jing-yilin

Summary

The AgentWorlds platform is moving to a Gateway-as-single-source-of-truth architecture for world runtime liveness. This requires three protocol-level changes in the Gateway and SDK.

Spec: https://gist.github.com/Jing-yilin/c2777c4b46fe0d52692ec159ba6e5d93 (Phase 2)


1. POST /peer/heartbeat — Lightweight liveness signal

Problem: The only liveness signal is full POST /peer/announce (Ed25519 signed, full payload with identity/endpoints/capabilities). Running it every 30s is protocol-shape overkill — using an expensive registration path as a lease-renewal path.

Solution: Add a lightweight heartbeat endpoint that only refreshes lastSeen:

// gateway/server.mjs
peer.post("/peer/heartbeat", async (req, reply) => {
  const { agentId, ts, signature } = req.body;
  const agent = registry.get(agentId);
  if (!agent?.publicKey) return reply.code(404).send({ error: "Unknown agent" });
  
  if (!verifyWithDomainSeparator(DOMAIN_SEPARATORS.HEARTBEAT, agent.publicKey, { agentId, ts }, signature)) {
    return reply.code(403).send({ error: "Invalid signature" });
  }
  
  agent.lastSeen = Date.now();
  // Do NOT trigger saveRegistry() — memory only
  return { ok: true };
});

SDK changes:

  • Add DOMAIN_SEPARATORS.HEARTBEAT = "aw:hb:" in crypto.ts
  • Add sendHeartbeat() in gateway-announce.ts
  • startGatewayAnnounce(): full announce every 10min (unchanged) + heartbeat every 30s (new)
  • createWorldServer(): automatically starts heartbeat alongside announce

2. Reduce Gateway stale TTL: 15min → 90s

Problem: With Gateway as the sole liveness source, a crashed world stays visible for up to 15 minutes. This is unacceptable for a live directory.

Solution:

const DEFAULT_STALE_TTL_MS = 90 * 1000;  // was: 15 * 60 * 1000

Persistence adjustment: With 90s TTL and 30s heartbeats, lastSeen updates are frequent. Heartbeats should only update in-memory; disk snapshots every 30-60s for crash recovery:

// Heartbeat: memory only (no saveRegistry)
// Announce: triggers saveRegistry (existing behavior)
// New: periodic snapshot every 30s for crash recovery
let _snapshotTimer = setInterval(() => {
  if (registryModified) { writeRegistry(); }
}, 30_000);

3. Optional announce webhook

Problem: When AgentWorlds deploys a world via SSM, the platform needs to know when the world has successfully registered with Gateway. Currently there is no callback mechanism.

Solution: Fire a webhook on first-seen announce (edge-triggered, idempotent):

const WEBHOOK_URL = process.env.WEBHOOK_URL || null;

// In upsertAgent():
if (isFirstSeen && WEBHOOK_URL) {
  fetch(WEBHOOK_URL, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ event: "world.announced", agentId, worldId, ts: Date.now() }),
    signal: AbortSignal.timeout(5000),
  }).catch(() => {});  // best-effort, not blocking
}
  • No WEBHOOK_URL → no webhook fired (local dev friendly)
  • Idempotent: only on first-seen after boot or after TTL expiry
  • Best-effort: fire-and-forget, not a critical path

4. Hand-written gateway/openapi.yaml

Add an OpenAPI 3.1 spec covering the 7 public Gateway endpoints:

  • GET /health
  • GET /worlds
  • GET /world/{worldId}
  • GET /agents
  • POST /peer/announce
  • POST /peer/heartbeat (new)
  • WS /ws (document as info)

This allows AgentWorlds (and other consumers) to generate TypeScript types from the spec instead of hand-writing interfaces that drift out of sync.


Checklist

  • Add DOMAIN_SEPARATORS.HEARTBEAT to SDK crypto.ts
  • Add sendHeartbeat() to SDK gateway-announce.ts
  • Integrate heartbeat into startGatewayAnnounce() (30s interval)
  • Integrate heartbeat into createWorldServer()
  • Add POST /peer/heartbeat to gateway/server.mjs
  • Reduce DEFAULT_STALE_TTL_MS to 90s in gateway/server.mjs
  • Add periodic disk snapshot (30s) in gateway/server.mjs
  • Heartbeat updates memory only, not disk
  • Add optional WEBHOOK_URL env + first-seen webhook in gateway/server.mjs
  • Create gateway/openapi.yaml
  • Tests for heartbeat endpoint
  • Tests for stale TTL pruning at 90s
  • Update gateway/Dockerfile with WEBHOOK_URL env documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions