-
Notifications
You must be signed in to change notification settings - Fork 214
Open
Description
Summary
Channel behavior can drift into two user-visible failure modes:
- background workers stall indefinitely (no completion, no progress)
- stale "Recently Completed" worker results pollute channel context, causing repeated irrelevant replies (e.g. repeated hockey output after topic change)
User-visible symptoms
- Bot says it will fetch data, then never returns results.
- Bot repeats old results (hockey) even when user asks unrelated follow-up questions.
- Background workers remain in
runningstate for long periods with no tool activity.
Observed evidence (local run)
worker_runscontained long-livedrunningrows withtool_calls=0.- Conversation timeline showed responses repeating earlier worker output after topic change.
- Assistant itself reported confusing stale "Recently Completed" state with fresh results.
Suspected root causes
A) Stalled workers without watchdog timeout
- Worker execution waits on
agent.prompt(...).awaitwith no wall-clock watchdog around the await. - If provider/tool flow hangs, worker can remain active indefinitely.
B) Stale completed worker items in status context
StatusBlock.completed_itemsis appended for worker completions and rendered into channel context.- Worker-completion retention is not bounded by recency/TTL, and trimming logic is asymmetric with branch path.
- This can keep old summaries injected and bias future responses.
C) Retrigger delivery robustness gaps
- Worker completion relies on event delivery + retrigger queueing.
- Under queue/backpressure/state edge cases, completion relay can be delayed/dropped, leaving user without expected follow-up.
Proposed fixes
- Add worker watchdog timeout + stale-worker reap policy.
- Mark timed-out workers failed/cancelled.
- Emit explicit completion event and persist terminal status.
- Bound/expire
StatusBlock.completed_itemsfor worker entries.- Keep a short window (count + age) and prune on update/render.
- Harden retrigger durability for worker completion relays.
- Ensure completion notifications are eventually delivered or persisted for replay.
- Add telemetry for queued/dropped/retried retriggers.
Acceptance criteria
- No worker remains
runningbeyond configured watchdog without a terminal transition. - Old completed worker summaries do not appear in context past TTL/window.
- Completion relay occurs within an expected bound after worker completion (or explicit failure path is surfaced).
- Regression tests cover:
- stalled worker timeout path
- stale completed-item pruning behavior
- retrigger retry/drop handling
Out of scope
- Warm-recall cache hardening work in PR Harden warm-recall degraded fallback and cache consistency #257.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels