-
Couldn't load subscription status.
- Fork 415
Description
Description
I'm trying to train a turn-based AEC (Agent-Environment-Cycle) environment from PettingZoo with two heterogeneous agents where the second agent's observation includes the first agent's action. However, my agents are not learning, and I suspect it's related to how I'm handling the state transitions and advantage computation.
Environment Setup
Environment Type: PettingZoo AEC (turn-based)
Number of Agents: 2 (heterogeneous)
Agent Dependencies: Agent 2's observation includes Agent 1's action
Reward Structure: Sparse reward only given after both agents have acted
Framework: TorchRL with PPO
Problem
During sample collection with the TorchRL collector, each sample contains data for both agents, but only one agent is active at a time (tracked via agent_mask). I filter the samples from the collectors dict by the mask. The key issue is:
The environment state only updates after the second agent acts, which means:
Agent 1 acts: current_state = S0 → next_state = S0 (unchanged!)
Agent 2 acts: current_state = S0 → next_state = S1 (finally updated)
This creates problems for Agent 1's advantage computation since V(S0) = V(next_state) when current_state == next_state.
Is there an example or other information how to train in such a setting and what to mask or adapt in comparison to a normal ParallelEnv?