fix: skip per-tick stale recovery for local workers by tyxben · Pull Request #233 · Conway-Research/automaton

tyxben · 2026-02-26T05:19:06Z

Summary

Fixes a regression from #227 where the per-tick stale recovery causes an infinite loop for local workers:

Local workers (in-process async tasks) remove themselves from activeWorkers on completion
The per-tick check sees them as "dead" and resets their task to pending
Task gets re-assigned immediately → new worker completes → removed → detected as dead → loop
Each turn burns ~$0.03 (~15k tokens) doing nothing

Changes

Skip local:// workers in per-tick stale recovery (orchestrator.ts) — only check remote sandbox workers
One-time startup recovery for local workers (loop.ts) — on process restart, reset stale local:// assigned tasks once (not every tick)
Validate goalId in loadState (orchestrator.ts) — if persisted goal is completed/cancelled/missing, reset to idle

Test plan

pnpm typecheck passes
All 26 orchestrator tests pass (including the 2 fixed in 823ad70)
Manual test: start automaton, observe no "Recovering stale task from dead worker" loop
Manual test: startup recovery runs once, not repeated on subsequent wake cycles

Local workers are in-process async tasks that remove themselves from the activeWorkers map on completion. The per-tick stale recovery (introduced in e099808) treats this as a dead worker and resets the task to pending, causing an infinite assign→complete→reset loop that burns ~$0.03/turn. Fix: 1. Skip local:// workers in per-tick stale recovery (remote only) 2. Add one-time startup recovery for local workers (process restart) 3. Validate goalId in loadState — reset if goal is completed/cancelled

…y-Research#233, Conway-Research#234) PR Conway-Research#233 Skip local worker stale recovery: Per-tick stale recovery now filters out local:// workers. Completed local workers remove themselves from the pool, making hasWorker() return false. Without this filter, orchestrator enters infinite assigncompletedetect-deadresetassign loop burning .03/turn. PR Conway-Research#234 Plimsoll Transaction Guard (3 defense engines): New policy rule file at priority 450 (between path-protection and financial rules). 1. Trajectory Hash: FNV-1a fingerprint of (tool, target, amount) in 60s sliding window. 3+ identical deny. 2 quarantine. Catches hallucination retry loops. 2. Capital Velocity: Cumulative spend across financial tools in 5min sliding window. > deny. >80% quarantine. Catches slow-bleed drain attacks. 3. Entropy Guard: Scans ALL tool args for Ethereum private keys (0x[hex]{64}), BIP-39 mnemonics (10+/12 words match), and high-entropy base64 blobs (Shannon >5.0 bits/char). Catches key exfiltration via exec, write_file, etc. All engines are in-memory, zero-dependency, fail-open. To disable: remove one line from policy-rules/index.ts.

…gnostics Root cause: local workers are in-memory async tasks that die when pm2 restarts the process. They have no chance to write CRASHED to the DB, leaving ghost workers (status=running) and stuck tasks (status=assigned). Fix 1 Early worker logging (local-worker.ts): Workers now write SPAWNED log to event_stream BEFORE claiming task or calling inference. If the process dies during the first inference call, we at least see the worker started in dashboard diagnostics. Fix 2 Startup recovery for local tasks (orchestrator.ts): On first tick after startup, reset ALL tasks assigned to local:// workers back to pending. Local workers don't survive restarts their tasks MUST be reassigned. Uses _localRecoveryDone flag to run only once (not every tick, avoiding Conway-Research#233 regression). Fix 3 Dashboard diagnostics (diagnostics-snapshot.js): - Stalled/zombie workers now show assigned tasks and likely cause - New 'Orphaned local tasks' counter for tasks on dead workers - Actionable message: auto-recovered on next restart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip per-tick stale recovery for local workers#233

fix: skip per-tick stale recovery for local workers#233
tyxben wants to merge 1 commit intoConway-Research:mainfrom
tyxben:fix/orchestrator-resilience

tyxben commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tyxben commented Feb 26, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant