Skip to content

Bug: Duplicate observations for same tool_call_id after crash recovery #2298

@csmith49

Description

@csmith49

Summary

When a server crashes during tool execution and restarts, crash recovery emits an AgentErrorEvent. However, if run() is called again, the action is re-executed, resulting in both an AgentErrorEvent and an ObservationEvent for the same tool_call_id.

Root Cause

get_unmatched_actions() in state.py:454 only checks for ObservationEvent and UserRejectObservation:

if isinstance(event, (ObservationEvent, UserRejectObservation)):
    observed_action_ids.add(event.action_id)

It does NOT check for AgentErrorEvent, so the action remains "unmatched" even after crash recovery emits an error for it.

Complete Control Flow

  1. ActionEvent created (tool_call_id=X)
  2. Server crashes during tool execution
  3. On restart, event_service.start() detects execution_status==RUNNING
  4. Crash recovery emits AgentErrorEvent (tool_call_id=X) - event_service.py:479-488
  5. Crash recovery sets execution_status=ERROR
  6. User calls run() again
  7. run() allows ERROR status to proceed - local_conversation.py:549-554:
    if self._state.execution_status in [IDLE, PAUSED, ERROR]:
        self._state.execution_status = RUNNING
  8. agent.step() calls get_unmatched_actions() which returns the action (because AgentErrorEvent is not checked)
  9. agent.step() calls _execute_actions() on the "pending" action - agent.py:264-271
  10. Tool executes and emits ObservationEvent (tool_call_id=X)
  11. Result: BOTH AgentErrorEvent AND ObservationEvent for same tool_call_id

Potential Fixes

  1. Add AgentErrorEvent to get_unmatched_actions() - But AgentErrorEvent does not have action_id, only tool_call_id. Would need to match by tool_call_id instead, or add action_id to AgentErrorEvent.

  2. Change crash recovery to NOT allow re-execution - Either use a different status that blocks run(), or do not emit AgentErrorEvent at all.

  3. Make get_unmatched_actions() also check AgentErrorEvent by tool_call_id - This would be a behavior change but might be the cleanest fix.

Related Code

  • openhands-agent-server/openhands/agent_server/event_service.py:470-488 - Crash recovery
  • openhands-sdk/openhands/sdk/conversation/state.py:450-462 - get_unmatched_actions()
  • openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py:549-554 - ERROR status handling
  • openhands-sdk/openhands/sdk/agent/agent.py:264-271 - Pending action execution

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions