Skip to content

Worker stalls and stale completed status cause repeated irrelevant replies #258

@vsumner

Description

@vsumner

Summary

Channel behavior can drift into two user-visible failure modes:

  1. background workers stall indefinitely (no completion, no progress)
  2. stale "Recently Completed" worker results pollute channel context, causing repeated irrelevant replies (e.g. repeated hockey output after topic change)

User-visible symptoms

  • Bot says it will fetch data, then never returns results.
  • Bot repeats old results (hockey) even when user asks unrelated follow-up questions.
  • Background workers remain in running state for long periods with no tool activity.

Observed evidence (local run)

  • worker_runs contained long-lived running rows with tool_calls=0.
  • Conversation timeline showed responses repeating earlier worker output after topic change.
  • Assistant itself reported confusing stale "Recently Completed" state with fresh results.

Suspected root causes

A) Stalled workers without watchdog timeout

  • Worker execution waits on agent.prompt(...).await with no wall-clock watchdog around the await.
  • If provider/tool flow hangs, worker can remain active indefinitely.

B) Stale completed worker items in status context

  • StatusBlock.completed_items is appended for worker completions and rendered into channel context.
  • Worker-completion retention is not bounded by recency/TTL, and trimming logic is asymmetric with branch path.
  • This can keep old summaries injected and bias future responses.

C) Retrigger delivery robustness gaps

  • Worker completion relies on event delivery + retrigger queueing.
  • Under queue/backpressure/state edge cases, completion relay can be delayed/dropped, leaving user without expected follow-up.

Proposed fixes

  1. Add worker watchdog timeout + stale-worker reap policy.
    • Mark timed-out workers failed/cancelled.
    • Emit explicit completion event and persist terminal status.
  2. Bound/expire StatusBlock.completed_items for worker entries.
    • Keep a short window (count + age) and prune on update/render.
  3. Harden retrigger durability for worker completion relays.
    • Ensure completion notifications are eventually delivered or persisted for replay.
    • Add telemetry for queued/dropped/retried retriggers.

Acceptance criteria

  • No worker remains running beyond configured watchdog without a terminal transition.
  • Old completed worker summaries do not appear in context past TTL/window.
  • Completion relay occurs within an expected bound after worker completion (or explicit failure path is surfaced).
  • Regression tests cover:
    • stalled worker timeout path
    • stale completed-item pruning behavior
    • retrigger retry/drop handling

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions