Skip to content

Level 10 — Multi-step API Orchestration: dependency chains, error recovery, and mid-flow cancellation #378

@cirdan-greyhaven

Description

@cirdan-greyhaven

Objective

Build and run a scenario where the agent must complete a multi-step workflow across simulated APIs with real dependency management and state tracking, using the new simulation and world-state infrastructure.

What Changed (v2 — March 2026)

Previous attempt (March 13) scored 1.0 on round 1 because the "state" was entirely narrative — the agent described recovery strategies rather than executing them. This revision uses the new SimulationInterface, WorldState, and action-trace evaluation so the agent interacts with an actual simulated environment.

Scenario Design

Use SimulationInterface (from scenarios/simulation.py) + WorldState (from scenarios/world_state.py):

  • Agent orchestrates a travel booking workflow via ActionSpec actions: search_flights, book_flight, search_hotels, book_hotel, arrange_transport
  • WorldState tracks booking entities, dependencies, and resource states
  • Mid-flow failure injection: flight cancelled after hotel booked → agent must cascade-cancel or rebook
  • Evaluation is based on the action trace and terminal world state, not prose output

State is managed via WorldEntity objects with dependency graphs. Actions have preconditions and effects that mutate the world.

What This Tests

  • Action-trace evaluation — judged on what the agent did, not what it said
  • Stateful dependency tracking — each API call depends on results from prior calls, tracked in WorldState
  • Error recovery with real rollback — mid-flow failures require actual state mutations, not narrative recovery
  • Knowledge accumulation — early generations fumble ordering and error handling; later generations learn retry patterns and fallback strategies through playbook evolution

Evaluation Dimensions

  • Workflow completion rate (terminal state inspection)
  • Error recovery quality (orphaned bookings in world state = penalty)
  • Dependency ordering correctness (action trace analysis)
  • Strategy sophistication across generations (Elo progression)

Implementation Guidance

  • Build a concrete SimulationInterface subclass for travel-booking using WorldState
  • Register it in the scenario registry so autoctx run --scenario travel_booking works
  • Use ActionSpec with preconditions (e.g., book_hotel requires active flight booking)
  • Inject failure via world-state mutation after step 2
  • Run 5+ generations with live Anthropic provider
  • Assert Elo improvement and playbook growth, not just pass/fail

Success Criteria

  • Scenario runs 5+ generations with measurable Elo improvement
  • Action-trace evaluation catches ordering errors and orphaned state
  • Playbook accumulates orchestration heuristics that transfer across seeds
  • Clear before/after difference visible in generation artifacts
  • Generated artifacts support inspection without reading full transcripts

Acceptance

  • SimulationInterface subclass implemented and registered
  • WorldState tracks all booking entities and dependencies
  • Mid-flow failure injection works via world-state mutation
  • Action-trace evaluation (not prose) determines scores
  • 5+ generations show Elo improvement
  • Playbook accumulates transferable orchestration heuristics

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions