-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Gateway heartbeat endpoint, reduced stale TTL, and optional announce webhook #147
Description
Summary
The AgentWorlds platform is moving to a Gateway-as-single-source-of-truth architecture for world runtime liveness. This requires three protocol-level changes in the Gateway and SDK.
Spec: https://gist.github.com/Jing-yilin/c2777c4b46fe0d52692ec159ba6e5d93 (Phase 2)
1. POST /peer/heartbeat — Lightweight liveness signal
Problem: The only liveness signal is full POST /peer/announce (Ed25519 signed, full payload with identity/endpoints/capabilities). Running it every 30s is protocol-shape overkill — using an expensive registration path as a lease-renewal path.
Solution: Add a lightweight heartbeat endpoint that only refreshes lastSeen:
// gateway/server.mjs
peer.post("/peer/heartbeat", async (req, reply) => {
const { agentId, ts, signature } = req.body;
const agent = registry.get(agentId);
if (!agent?.publicKey) return reply.code(404).send({ error: "Unknown agent" });
if (!verifyWithDomainSeparator(DOMAIN_SEPARATORS.HEARTBEAT, agent.publicKey, { agentId, ts }, signature)) {
return reply.code(403).send({ error: "Invalid signature" });
}
agent.lastSeen = Date.now();
// Do NOT trigger saveRegistry() — memory only
return { ok: true };
});SDK changes:
- Add
DOMAIN_SEPARATORS.HEARTBEAT = "aw:hb:"incrypto.ts - Add
sendHeartbeat()ingateway-announce.ts startGatewayAnnounce(): full announce every 10min (unchanged) + heartbeat every 30s (new)createWorldServer(): automatically starts heartbeat alongside announce
2. Reduce Gateway stale TTL: 15min → 90s
Problem: With Gateway as the sole liveness source, a crashed world stays visible for up to 15 minutes. This is unacceptable for a live directory.
Solution:
const DEFAULT_STALE_TTL_MS = 90 * 1000; // was: 15 * 60 * 1000Persistence adjustment: With 90s TTL and 30s heartbeats, lastSeen updates are frequent. Heartbeats should only update in-memory; disk snapshots every 30-60s for crash recovery:
// Heartbeat: memory only (no saveRegistry)
// Announce: triggers saveRegistry (existing behavior)
// New: periodic snapshot every 30s for crash recovery
let _snapshotTimer = setInterval(() => {
if (registryModified) { writeRegistry(); }
}, 30_000);3. Optional announce webhook
Problem: When AgentWorlds deploys a world via SSM, the platform needs to know when the world has successfully registered with Gateway. Currently there is no callback mechanism.
Solution: Fire a webhook on first-seen announce (edge-triggered, idempotent):
const WEBHOOK_URL = process.env.WEBHOOK_URL || null;
// In upsertAgent():
if (isFirstSeen && WEBHOOK_URL) {
fetch(WEBHOOK_URL, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ event: "world.announced", agentId, worldId, ts: Date.now() }),
signal: AbortSignal.timeout(5000),
}).catch(() => {}); // best-effort, not blocking
}- No
WEBHOOK_URL→ no webhook fired (local dev friendly) - Idempotent: only on first-seen after boot or after TTL expiry
- Best-effort: fire-and-forget, not a critical path
4. Hand-written gateway/openapi.yaml
Add an OpenAPI 3.1 spec covering the 7 public Gateway endpoints:
GET /healthGET /worldsGET /world/{worldId}GET /agentsPOST /peer/announcePOST /peer/heartbeat(new)WS /ws(document as info)
This allows AgentWorlds (and other consumers) to generate TypeScript types from the spec instead of hand-writing interfaces that drift out of sync.
Checklist
- Add
DOMAIN_SEPARATORS.HEARTBEATto SDKcrypto.ts - Add
sendHeartbeat()to SDKgateway-announce.ts - Integrate heartbeat into
startGatewayAnnounce()(30s interval) - Integrate heartbeat into
createWorldServer() - Add
POST /peer/heartbeattogateway/server.mjs - Reduce
DEFAULT_STALE_TTL_MSto 90s ingateway/server.mjs - Add periodic disk snapshot (30s) in
gateway/server.mjs - Heartbeat updates memory only, not disk
- Add optional
WEBHOOK_URLenv + first-seen webhook ingateway/server.mjs - Create
gateway/openapi.yaml - Tests for heartbeat endpoint
- Tests for stale TTL pruning at 90s
- Update
gateway/DockerfilewithWEBHOOK_URLenv documentation