Bug: Duplicate observations for same tool_call_id after crash recovery

## Summary

When a server crashes during tool execution and restarts, crash recovery emits an `AgentErrorEvent`. However, if `run()` is called again, the action is **re-executed**, resulting in **both** an `AgentErrorEvent` and an `ObservationEvent` for the same `tool_call_id`.

## Root Cause

`get_unmatched_actions()` in `state.py:454` only checks for `ObservationEvent` and `UserRejectObservation`:

```python
if isinstance(event, (ObservationEvent, UserRejectObservation)):
    observed_action_ids.add(event.action_id)
```

It does **NOT** check for `AgentErrorEvent`, so the action remains "unmatched" even after crash recovery emits an error for it.

## Complete Control Flow

1. `ActionEvent` created (`tool_call_id=X`)
2. Server crashes during tool execution
3. On restart, `event_service.start()` detects `execution_status==RUNNING`
4. Crash recovery emits `AgentErrorEvent` (`tool_call_id=X`) - `event_service.py:479-488`
5. Crash recovery sets `execution_status=ERROR`
6. User calls `run()` again
7. `run()` allows ERROR status to proceed - `local_conversation.py:549-554`:
   ```python
   if self._state.execution_status in [IDLE, PAUSED, ERROR]:
       self._state.execution_status = RUNNING
   ```
8. `agent.step()` calls `get_unmatched_actions()` which returns the action (because `AgentErrorEvent` is not checked)
9. `agent.step()` calls `_execute_actions()` on the "pending" action - `agent.py:264-271`
10. Tool executes and emits `ObservationEvent` (`tool_call_id=X`)
11. **Result: BOTH `AgentErrorEvent` AND `ObservationEvent` for same `tool_call_id`**


## Potential Fixes

1. **Add `AgentErrorEvent` to `get_unmatched_actions()`** - But `AgentErrorEvent` does not have `action_id`, only `tool_call_id`. Would need to match by `tool_call_id` instead, or add `action_id` to `AgentErrorEvent`.

2. **Change crash recovery to NOT allow re-execution** - Either use a different status that blocks `run()`, or do not emit `AgentErrorEvent` at all.

3. **Make `get_unmatched_actions()` also check `AgentErrorEvent` by `tool_call_id`** - This would be a behavior change but might be the cleanest fix.

## Related Code

- `openhands-agent-server/openhands/agent_server/event_service.py:470-488` - Crash recovery
- `openhands-sdk/openhands/sdk/conversation/state.py:450-462` - `get_unmatched_actions()`
- `openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py:549-554` - ERROR status handling
- `openhands-sdk/openhands/sdk/agent/agent.py:264-271` - Pending action execution


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Duplicate observations for same tool_call_id after crash recovery #2298

Summary

Root Cause

Complete Control Flow

Potential Fixes

Related Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Duplicate observations for same tool_call_id after crash recovery #2298

Description

Summary

Root Cause

Complete Control Flow

Potential Fixes

Related Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions