Level 10 — Multi-step API Orchestration: dependency chains, error recovery, and mid-flow cancellation

## Objective

Build and run a scenario where the agent must complete a multi-step workflow across simulated APIs with real dependency management and state tracking, using the new simulation and world-state infrastructure.

## What Changed (v2 — March 2026)

Previous attempt (March 13) scored 1.0 on round 1 because the "state" was entirely narrative — the agent described recovery strategies rather than executing them. This revision uses the new `SimulationInterface`, `WorldState`, and action-trace evaluation so the agent interacts with an actual simulated environment.

## Scenario Design

Use `SimulationInterface` (from `scenarios/simulation.py`) + `WorldState` (from `scenarios/world_state.py`):

* Agent orchestrates a travel booking workflow via `ActionSpec` actions: search_flights, book_flight, search_hotels, book_hotel, arrange_transport
* `WorldState` tracks booking entities, dependencies, and resource states
* Mid-flow failure injection: flight cancelled after hotel booked → agent must cascade-cancel or rebook
* Evaluation is based on the **action trace** and **terminal world state**, not prose output

State is managed via `WorldEntity` objects with dependency graphs. Actions have preconditions and effects that mutate the world.

## What This Tests

* **Action-trace evaluation** — judged on what the agent *did*, not what it *said*
* **Stateful dependency tracking** — each API call depends on results from prior calls, tracked in WorldState
* **Error recovery with real rollback** — mid-flow failures require actual state mutations, not narrative recovery
* **Knowledge accumulation** — early generations fumble ordering and error handling; later generations learn retry patterns and fallback strategies through playbook evolution

## Evaluation Dimensions

* Workflow completion rate (terminal state inspection)
* Error recovery quality (orphaned bookings in world state = penalty)
* Dependency ordering correctness (action trace analysis)
* Strategy sophistication across generations (Elo progression)

## Implementation Guidance

* Build a concrete `SimulationInterface` subclass for travel-booking using `WorldState`
* Register it in the scenario registry so `autoctx run --scenario travel_booking` works
* Use `ActionSpec` with preconditions (e.g., book_hotel requires active flight booking)
* Inject failure via world-state mutation after step 2
* Run 5+ generations with live Anthropic provider
* Assert Elo improvement and playbook growth, not just pass/fail

## Success Criteria

* Scenario runs 5+ generations with measurable Elo improvement
* Action-trace evaluation catches ordering errors and orphaned state
* Playbook accumulates orchestration heuristics that transfer across seeds
* Clear before/after difference visible in generation artifacts
* Generated artifacts support inspection without reading full transcripts

## Acceptance

- [ ] SimulationInterface subclass implemented and registered
- [ ] WorldState tracks all booking entities and dependencies
- [ ] Mid-flow failure injection works via world-state mutation
- [ ] Action-trace evaluation (not prose) determines scores
- [ ] 5+ generations show Elo improvement
- [ ] Playbook accumulates transferable orchestration heuristics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Level 10 — Multi-step API Orchestration: dependency chains, error recovery, and mid-flow cancellation #378

Objective

What Changed (v2 — March 2026)

Scenario Design

What This Tests

Evaluation Dimensions

Implementation Guidance

Success Criteria

Acceptance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Level 10 — Multi-step API Orchestration: dependency chains, error recovery, and mid-flow cancellation #378

Description

Objective

What Changed (v2 — March 2026)

Scenario Design

What This Tests

Evaluation Dimensions

Implementation Guidance

Success Criteria

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions