Objective
Build and run a scenario where the agent must complete a multi-step workflow across simulated APIs with real dependency management and state tracking, using the new simulation and world-state infrastructure.
What Changed (v2 — March 2026)
Previous attempt (March 13) scored 1.0 on round 1 because the "state" was entirely narrative — the agent described recovery strategies rather than executing them. This revision uses the new SimulationInterface, WorldState, and action-trace evaluation so the agent interacts with an actual simulated environment.
Scenario Design
Use SimulationInterface (from scenarios/simulation.py) + WorldState (from scenarios/world_state.py):
- Agent orchestrates a travel booking workflow via
ActionSpec actions: search_flights, book_flight, search_hotels, book_hotel, arrange_transport
WorldState tracks booking entities, dependencies, and resource states
- Mid-flow failure injection: flight cancelled after hotel booked → agent must cascade-cancel or rebook
- Evaluation is based on the action trace and terminal world state, not prose output
State is managed via WorldEntity objects with dependency graphs. Actions have preconditions and effects that mutate the world.
What This Tests
- Action-trace evaluation — judged on what the agent did, not what it said
- Stateful dependency tracking — each API call depends on results from prior calls, tracked in WorldState
- Error recovery with real rollback — mid-flow failures require actual state mutations, not narrative recovery
- Knowledge accumulation — early generations fumble ordering and error handling; later generations learn retry patterns and fallback strategies through playbook evolution
Evaluation Dimensions
- Workflow completion rate (terminal state inspection)
- Error recovery quality (orphaned bookings in world state = penalty)
- Dependency ordering correctness (action trace analysis)
- Strategy sophistication across generations (Elo progression)
Implementation Guidance
- Build a concrete
SimulationInterface subclass for travel-booking using WorldState
- Register it in the scenario registry so
autoctx run --scenario travel_booking works
- Use
ActionSpec with preconditions (e.g., book_hotel requires active flight booking)
- Inject failure via world-state mutation after step 2
- Run 5+ generations with live Anthropic provider
- Assert Elo improvement and playbook growth, not just pass/fail
Success Criteria
- Scenario runs 5+ generations with measurable Elo improvement
- Action-trace evaluation catches ordering errors and orphaned state
- Playbook accumulates orchestration heuristics that transfer across seeds
- Clear before/after difference visible in generation artifacts
- Generated artifacts support inspection without reading full transcripts
Acceptance
Objective
Build and run a scenario where the agent must complete a multi-step workflow across simulated APIs with real dependency management and state tracking, using the new simulation and world-state infrastructure.
What Changed (v2 — March 2026)
Previous attempt (March 13) scored 1.0 on round 1 because the "state" was entirely narrative — the agent described recovery strategies rather than executing them. This revision uses the new
SimulationInterface,WorldState, and action-trace evaluation so the agent interacts with an actual simulated environment.Scenario Design
Use
SimulationInterface(fromscenarios/simulation.py) +WorldState(fromscenarios/world_state.py):ActionSpecactions: search_flights, book_flight, search_hotels, book_hotel, arrange_transportWorldStatetracks booking entities, dependencies, and resource statesState is managed via
WorldEntityobjects with dependency graphs. Actions have preconditions and effects that mutate the world.What This Tests
Evaluation Dimensions
Implementation Guidance
SimulationInterfacesubclass for travel-booking usingWorldStateautoctx run --scenario travel_bookingworksActionSpecwith preconditions (e.g., book_hotel requires active flight booking)Success Criteria
Acceptance