diff --git a/.gitmodules b/.gitmodules
index ad8c0ea2..d1783a9d 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,6 +1,3 @@
-[submodule "experiments/gym-microrts-static-files"]
-	path = experiments/gym-microrts-static-files
-	url = https://github.com/vwxyzjn/gym-microrts-static-files
-[submodule "gym_microrts/microrts"]
-	path = gym_microrts/microrts
-	url = https://github.com/Farama-Foundation/MicroRTS.git
+[submodule "gym_microrts/microrts"]
+	path = gym_microrts/microrts
+	url = https://github.com/Farama-Foundation/MicroRTS.git
diff --git a/README.md b/README.md
index b8c59319..18ebe100 100644
--- a/README.md
+++ b/README.md
@@ -1,213 +1,234 @@
-<p align="center">
-    <img src="https://raw.githubusercontent.com/Farama-Foundation/MicroRTS-Py/master/micrortspy-text.png" width="500px"/>
-</p>
-
-Formerly Gym-μRTS/Gym-MicroRTS
-
-[<img src="https://img.shields.io/badge/discord-gym%20microrts-green?label=Discord&logo=discord&logoColor=ffffff&labelColor=7289DA&color=2c2f33">](https://discord.gg/DdJsrdry6F)
-[<img src="https://github.com/vwxyzjn/gym-microrts/workflows/build/badge.svg">](
-https://github.com/vwxyzjn/gym-microrts/actions)
-[<img src="https://badge.fury.io/py/gym-microrts.svg">](
-https://pypi.org/project/gym-microrts/)
-
-This repo contains the source code for the gym wrapper of μRTS authored by [Santiago Ontañón](https://github.com/santiontanon/microrts).
-
-MicroRTS-Py will eventually be updated, maintained, and made compliant with the standards of the Farama Foundation (https://farama.org/project_standards). However, this is currently a lower priority than other projects we're working to maintain. If you'd like to contribute to development, you can join our discord server here- https://discord.gg/jfERDCSw.
-
-![demo.gif](static/fullgame.gif)
-
-## Get Started
-
-Prerequisites:
-* Python 3.8+
-* [Poetry](https://python-poetry.org)
-* Java 8.0+
-* FFmpeg (for video recording utilities)
-
-```bash
-$ git clone --recursive https://github.com/Farama-Foundation/MicroRTS-Py.git && \
-cd MicroRTS-Py
-poetry install
-# The `poetry install` command above creates a virtual environment for us, in which all the dependencies are installed.
-# We can use `poetry shell` to create a new shell in which this environment is activated. Once we are done working with
-# MicroRTS, we can leave it again using `exit`.
-poetry shell
-# By default, the torch wheel is built with CUDA 10.2. If you are using newer NVIDIA GPUs (e.g., 3060 TI), you may need to specifically install CUDA 11.3 wheels by overriding the torch dependency with pip:
-# poetry run pip install "torch==1.12.1" --upgrade --extra-index-url https://download.pytorch.org/whl/cu113
-python hello_world.py
-```
-
-If the `poetry install` command gets stuck on a Linux machine, [it may help to first run](https://github.com/python-poetry/poetry/issues/8623): `export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring`
-
-To train an agent, run the following
-
-```bash
-cd experiments
-python ppo_gridnet.py \
-    --total-timesteps 100000000 \
-    --capture-video \
-    --seed 1
-```
-
-[![asciicast](https://asciinema.org/a/586754.svg)](https://asciinema.org/a/586754)
-
-For running a partial observable example, tune the `partial_obs` argument.
-```bash
-cd experiments
-python ppo_gridnet.py \
-    --partial-obs \
-    --capture-video \
-    --seed 1
-```
-
-## Technical Paper
-
-Before diving into the code, we highly recommend reading the preprint of our paper: [Gym-μRTS: Toward Affordable Deep Reinforcement Learning Research in Real-time Strategy Games](https://arxiv.org/abs/2105.13807).
-
-### Depreciation note
-
-Note that the experiments in the technical paper above are done with [`gym_microrts==0.3.2`](https://github.com/vwxyzjn/gym-microrts/tree/v0.3.2). As we move forward beyond `v0.4.x`, we are planing to deprecate UAS despite its better performance in the paper. This is because UAS has more complex implementation and makes it really difficult to incorporate selfplay or imitation learning in the future.
-
-
-
-## Environment Specification
-
-Here is a description of Gym-μRTS's observation and action space:
-
-* **Observation Space.** (`Box(0, 1, (h, w, 27), int32)`) Given a map of size `h x w`, the observation is a tensor of shape `(h, w, n_f)`, where `n_f` is a number of feature planes that have binary values. The observation space used in this paper uses 27 feature planes as shown in the following table. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, if there is a worker with hit points equal to 1, not carrying any resources, owner being Player 1, and currently not executing any actions, then the one-hot encoding features will look like the following:
-
-   `[0,1,0,0,0],  [1,0,0,0,0],  [1,0,0], [0,0,0,0,1,0,0,0],  [1,0,0,0,0,0]`
-
-
-    The 27 values of each feature plane for the position in the map of such worker will thus be:
-
-    `[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0]`
-
-* **Partial Observation Space.** (`Box(0, 1, (h, w, 29), int32)`) Given a map of size `h x w`, the observation is a tensor of shape `(h, w, n_f)`, where `n_f` is a number of feature planes that have binary values. The observation space for partial observability uses 29 feature planes as shown in the following table. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, if there is a worker with hit points equal to 1, not carrying any resources, owner being Player 1,  currently not executing any actions, and not visible to the opponent, then the one-hot encoding features will look like the following:
-
-   `[0,1,0,0,0],  [1,0,0,0,0],  [1,0,0], [0,0,0,0,1,0,0,0],  [1,0,0,0,0,0], [1,0]`
-
-
-    The 29 values of each feature plane for the position in the map of such worker will thus be:
-
-    `[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0]`
-
-* **Action Space.** (`MultiDiscrete(concat(h * w * [[6   4   4   4   4   7 a_r]]))`) Given a map of size `h x w` and the maximum attack range `a_r=7`, the action is an (7hw)-dimensional vector of discrete values as specified in the following table. The first 7 component of the action vector represents the actions issued to the unit at `x=0,y=0`, and the second 7 component represents actions issued to the unit at `x=0,y=1`, etc. In these 7 components, the first component is the action type, and the rest of components represent the different parameters different action types can take. Depending on which action type is selected, the game engine will use the corresponding parameters to execute the action. As an example, if the RL agent issues a move south action to the worker at $x=0, y=1$ in a 2x2 map, the action will be encoded in the following way:
-
-    `concat([0,0,0,0,0,0,0], [1,2,0,0,0,0,0], [0,0,0,0,0,0,0], [0,0,0,0,0,0,0]]`
-    `=[0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]`
-
-![image](https://user-images.githubusercontent.com/5555347/120344517-a5bf7300-c2c7-11eb-81b6-172813ba8a0b.png)
-
-## Evaluation
-
-You can evaluate trained agents against a built-in bot:
-
-```bash
-cd experiments
-python ppo_gridnet_eval.py \
-    --agent-model-path gym-microrts-static-files/agent_sota.pt \
-    --ai coacAI
-```
-
-Alternatively, you can evaluate the trained RL bots against themselves
-
-```bash
-cd experiments
-python ppo_gridnet_eval.py \
-    --agent-model-path gym-microrts-static-files/agent_sota.pt \
-    --agent2-model-path gym-microrts-static-files/agent_sota.pt
-```
-
-### Evaluate Trueskill of the agents
-
-This repository already contains a preset Trueskill database in `experiments/league.db`. To evaluate a new AI, try running the following command, which will iteratively find good matches for `agent.pt` until the engine is confident `agent.pt`'s Trueskill (by having the agent's Trueskill sigma below `--highest-sigma 1.4`).
-
-```bash
-cd experiments
-python league.py --evals gym-microrts-static-files/agent_sota.pt --highest-sigma 1.4 --update-db False
-```
-
-To recreate the preset Trueskill database, start a round-robin Trueskill evaluation among built-in AIs by removing the database in `experiments/league.db`.
-```bash
-cd experiments
-rm league.csv league.db
-python league.py --evals randomBiasedAI workerRushAI lightRushAI coacAI
-```
-
-## Multi-maps support
-
-The training script allows you to train the agents with more than one maps and evaluate with more than one maps. Try executing:
-
-```
-cd experiments
-python ppo_gridnet.py \
-    --train-maps maps/16x16/basesWorkers16x16B.xml maps/16x16/basesWorkers16x16C.xml maps/16x16/basesWorkers16x16D.xml maps/16x16/basesWorkers16x16E.xml maps/16x16/basesWorkers16x16F.xml \
-    --eval-maps maps/16x16/basesWorkers16x16B.xml maps/16x16/basesWorkers16x16C.xml maps/16x16/basesWorkers16x16D.xml maps/16x16/basesWorkers16x16E.xml maps/16x16/basesWorkers16x16F.xml
-```
-
-where `--train-maps` allows you to specify the training maps and `--eval-maps` the evaluation maps. `--train-maps` and `--eval-maps` do not have to match (so you can evaluate on maps the agent has never trained on before).
-
-## Known issues
-
-[ ] Rendering does not exactly work in macos. See https://github.com/jpype-project/jpype/issues/906
-
-## Papers written using Gym-μRTS
-
-* AIIDE 2022 Strategy Games Workshop: [Transformers as Policies for Variable Action Environments](https://arxiv.org/abs/2301.03679)
-* CoG 2021: [Gym-μRTS: Toward Affordable Deep Reinforcement Learning Research in Real-time Strategy Games](https://arxiv.org/abs/2105.13807),
-* AAAI RLG 2021: [Generalization in Deep Reinforcement Learning with Real-time Strategy Games](http://aaai-rlg.mlanctot.info/papers/AAAI21-RLG_paper_33.pdf),
-* AIIDE 2020 Strategy Games Workshop: [Action Guidance: Getting the Best of Training Agents with Sparse Rewards and Shaped Rewards](https://arxiv.org/abs/2010.03956),
-* AIIDE 2019 Strategy Games Workshop: [Comparing Observation and Action Representations for Deep Reinforcement Learning in MicroRTS](https://arxiv.org/abs/1910.12134),
-
-## PettingZoo API
-
-We wrapped our Gym-µRTS simulator into a PettingZoo environment, which is defined in `gym_microrts/pettingzoo_api.py`. An example usage of the Gym-µRTS PettingZoo environment can be found in `hello_world_pettingzoo.py`.
-
-
-## Cite this project
-
-To cite the Gym-µRTS simulator:
-
-```bibtex
-@inproceedings{huang2021gym,
-  author    = {Shengyi Huang and
-               Santiago Onta{\~{n}}{\'{o}}n and
-               Chris Bamford and
-               Lukasz Grela},
-  title     = {Gym-{\(\mathrm{\mu}\)}RTS: Toward Affordable Full Game Real-time Strategy
-               Games Research with Deep Reinforcement Learning},
-  booktitle = {2021 {IEEE} Conference on Games (CoG), Copenhagen, Denmark, August
-               17-20, 2021},
-  pages     = {671--678},
-  publisher = {{IEEE}},
-  year      = {2021},
-  url       = {https://doi.org/10.1109/CoG52621.2021.9619076},
-  doi       = {10.1109/CoG52621.2021.9619076},
-  timestamp = {Fri, 10 Dec 2021 10:41:01 +0100},
-  biburl    = {https://dblp.org/rec/conf/cig/HuangO0G21.bib},
-  bibsource = {dblp computer science bibliography, https://dblp.org}
-}
-```
-
-To cite the invalid action masking technique used in our training script:
-
-```bibtex
-@inproceedings{huang2020closer,
-  author    = {Shengyi Huang and
-               Santiago Onta{\~{n}}{\'{o}}n},
-  editor    = {Roman Bart{\'{a}}k and
-               Fazel Keshtkar and
-               Michael Franklin},
-  title     = {A Closer Look at Invalid Action Masking in Policy Gradient Algorithms},
-  booktitle = {Proceedings of the Thirty-Fifth International Florida Artificial Intelligence
-               Research Society Conference, {FLAIRS} 2022, Hutchinson Island, Jensen
-               Beach, Florida, USA, May 15-18, 2022},
-  year      = {2022},
-  url       = {https://doi.org/10.32473/flairs.v35i.130584},
-  doi       = {10.32473/flairs.v35i.130584},
-  timestamp = {Thu, 09 Jun 2022 16:44:11 +0200},
-  biburl    = {https://dblp.org/rec/conf/flairs/HuangO22.bib},
-  bibsource = {dblp computer science bibliography, https://dblp.org}
-}
-```
+<p align="center">
+    <img src="https://raw.githubusercontent.com/Farama-Foundation/MicroRTS-Py/master/micrortspy-text.png" width="500px"/>
+</p>
+
+Formerly Gym-μRTS/Gym-MicroRTS
+
+[<img src="https://img.shields.io/badge/discord-gym%20microrts-green?label=Discord&logo=discord&logoColor=ffffff&labelColor=7289DA&color=2c2f33">](https://discord.gg/DdJsrdry6F)
+[<img src="https://github.com/vwxyzjn/gym-microrts/workflows/build/badge.svg">](https://github.com/Farama-Foundation/MicroRTS-Py/actions)
+[<img src="https://badge.fury.io/py/gym-microrts.svg">](
+https://pypi.org/project/gym-microrts/)
+
+This repo contains the source code for the gym wrapper of μRTS authored by [Santiago Ontañón](https://github.com/santiontanon/microrts).
+
+MicroRTS-Py will eventually be updated, maintained, and made compliant with the standards of the Farama Foundation (https://farama.org/project_standards). However, this is currently a lower priority than other projects we're working to maintain. If you'd like to contribute to development, you can join our discord server here- https://discord.gg/jfERDCSw.
+
+![demo.gif](static/fullgame.gif)
+
+## Get Started
+
+Prerequisites:
+* Python 3.8+
+* [Poetry](https://python-poetry.org)
+* Java 8.0+
+* FFmpeg (for video recording utilities)
+
+```bash
+$ git clone --recursive https://github.com/Farama-Foundation/MicroRTS-Py.git && \
+cd MicroRTS-Py
+poetry install
+# The `poetry install` command above creates a virtual environment for us, in which all the dependencies are installed.
+# We can use `poetry shell` to create a new shell in which this environment is activated. Once we are done working with
+# MicroRTS, we can leave it again using `exit`.
+poetry shell
+# By default, the torch wheel is built with CUDA 10.2. If you are using newer NVIDIA GPUs (e.g., 3060 TI), you may need to specifically install CUDA 11.3 wheels by overriding the torch dependency with pip:
+# poetry run pip install "torch==1.12.1" --upgrade --extra-index-url https://download.pytorch.org/whl/cu113
+python hello_world.py
+```
+
+If the `poetry install` command gets stuck on a Linux machine, [it may help to first run](https://github.com/python-poetry/poetry/issues/8623): `export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring`.
+
+To train an agent, run the following
+
+```bash
+cd experiments
+python ppo_gridnet.py \
+    --total-timesteps 100000000 \
+    --capture-video \
+    --seed 1
+```
+
+[![asciicast](https://asciinema.org/a/586754.svg)](https://asciinema.org/a/586754)
+
+For running a partial observable example, tune the `partial_obs` argument.
+```bash
+cd experiments
+python ppo_gridnet.py \
+    --partial-obs \
+    --capture-video \
+    --seed 1
+```
+
+## Technical Paper
+
+Before diving into the code, we highly recommend reading the preprint of our paper: [Gym-μRTS: Toward Affordable Deep Reinforcement Learning Research in Real-time Strategy Games](https://arxiv.org/abs/2105.13807).
+
+### Depreciation notes
+
+1. Note that the experiments in the technical paper above are done with [`gym_microrts==0.3.2`](https://github.com/vwxyzjn/gym-microrts/tree/v0.3.2). As we move forward beyond `v0.4.x`, we are planning to deprecate UAS despite its better performance in the paper. This is because UAS has a more complex implementation and makes it really difficult to incorporate selfplay or imitation learning in the future.
+2. [v0.6.1](https://github.com/Farama-Foundation/MicroRTS-Py/releases/tag/v0.6.1) is the last version in which wall/terrain observations were not present in state tensors. As of December 2023, every state observation has an extra channel encoding the presence of walls, and models trained before this will therefore no longer be compatible with code in the `master` branch. Such models should use the code from `v0.6.1` instead.
+
+
+
+## Environment Specification
+
+Here is a description of Gym-μRTS's observation and action space:
+
+* **Observation Space.** (`Box(0, 1, (h, w, 29), int32)`) Given a map of size `h x w`, the observation is a tensor of shape `(h, w, n_f)`, where `n_f` is a number of feature planes that have binary values. The observation space used in the original paper used 27 feature planes. Since then, 2 more feature planes (for terrain/walls) have been added, increasing the number of feature planes to 29, as shown below. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, the unit at a cell could be encoded as follows:
+
+    * the unit has 1 hit point -> `[0,1,0,0,0]`
+    * the unit is not carrying any resources, -> `[1,0,0,0,0]`
+    * the unit is owned by Player 1 -> `[0,1,0]`
+    * the unit is a worker -> `[0,0,0,0,1,0,0,0]`
+    * the unit is not executing any actions -> `[1,0,0,0,0,0]`
+    * the unit is standing at free terrain cell -> `[1,0]`
+
+    The 29 values of each feature plane for the position in the map of such a worker will thus be:
+
+    `[0,1,0,0,0, 1,0,0,0,0, 0,1,0, 0,0,0,0,1,0,0,0, 1,0,0,0,0,0, 1,0]`
+    
+* **Partial Observation Space.** (`Box(0, 1, (h, w, 31), int32)`) under the partial observation space, there are two additional planes indicating if the unit is visible to the opponent. For example, if the unit is visible to the opponent, the feature plane will be `[0,1]`. If the unit is not visible to the opponent, the feature plane will be `[1,0]`. Using the example above and assuming that the worker unit is not visible to the opponent, then the 31 values of each feature plane for the position in the map of such worker will thus be:
+
+    `[0,1,0,0,0, 1,0,0,0,0, 0,1,0, 0,0,0,0,1,0,0,0, 1,0,0,0,0,0, 1,0, 1,0]`
+
+* **Action Space.** (`MultiDiscrete(concat(h * w * [[6   4   4   4   4   7 a_r]]))`) Given a map of size `h x w` and the maximum attack range `a_r=7`, the action is an (7hw)-dimensional vector of discrete values as specified in the following table. The first 7 component of the action vector represents the actions issued to the unit at `x=0,y=0`, and the second 7 component represents actions issued to the unit at `x=0,y=1`, etc. In these 7 components, the first component is the action type, and the rest of components represent the different parameters different action types can take. Depending on which action type is selected, the game engine will use the corresponding parameters to execute the action. As an example, if the RL agent issues a move south action to the worker at $x=0, y=1$ in a 2x2 map, the action will be encoded in the following way:
+
+    `concat([0,0,0,0,0,0,0], [1,2,0,0,0,0,0], [0,0,0,0,0,0,0], [0,0,0,0,0,0,0]]`
+    `=[0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]`
+
+<!-- ![image](https://user-images.githubusercontent.com/5555347/120344517-a5bf7300-c2c7-11eb-81b6-172813ba8a0b.png) -->
+
+Here are tables summarizing observation features and action components, where $a_r=7$ is the maximum attack range, and `-` means not applicable.
+
+| Observation Features        | Planes             | Description                                              |
+|-----------------------------|--------------------|----------------------------------------------------------|
+| Hit Points                  | 5                  | 0, 1, 2, 3, $\geq 4$                                     |
+| Resources                   | 5                  | 0, 1, 2, 3, $\geq 4$                                     |
+| Owner                       | 3                  | -,player 1, player 2                                     |
+| Unit Types                  | 8                  | -, resource, base, barrack, worker, light, heavy, ranged |
+| Current Action              | 6                  | -, move, harvest, return, produce, attack                |
+| Terrain                     | 2                  | free, wall                                               |
+
+| Action Components           | Range              | Description                                              |
+|-----------------------------|--------------------|----------------------------------------------------------|
+| Source Unit                 | $[0,h \times w-1]$ | the location of the unit selected to perform an action   |
+| Action Type                 | $[0,5]$            | NOOP, move, harvest, return, produce, attack             |
+| Move Parameter              | $[0,3]$            | north, east, south, west                                 |
+| Harvest Parameter           | $[0,3]$            | north, east, south, west                                 |
+| Return Parameter            | $[0,3]$            | north, east, south, west                                 |
+| Produce Direction Parameter | $[0,3]$            | north, east, south, west                                 |
+| Produce Type Parameter      | $[0,6]$            | resource, base, barrack, worker, light, heavy, ranged    |
+| Relative Attack Position    | $[0,a_r^2 - 1]$    | the relative location of the unit that  will be attacked |
+
+## Evaluation
+
+You can evaluate trained agents against a built-in bot:
+
+```bash
+cd experiments
+python ppo_gridnet_eval.py \
+    --agent-model-path gym-microrts-static-files/agent_sota.pt \
+    --ai coacAI
+```
+
+Alternatively, you can evaluate the trained RL bots against themselves
+
+```bash
+cd experiments
+python ppo_gridnet_eval.py \
+    --agent-model-path gym-microrts-static-files/agent_sota.pt \
+    --agent2-model-path gym-microrts-static-files/agent_sota.pt
+```
+
+### Evaluate Trueskill of the agents
+
+This repository already contains a preset Trueskill database in `experiments/league.db`. To evaluate a new AI, try running the following command, which will iteratively find good matches for `agent.pt` until the engine is confident `agent.pt`'s Trueskill (by having the agent's Trueskill sigma below `--highest-sigma 1.4`).
+
+```bash
+cd experiments
+python league.py --evals gym-microrts-static-files/agent_sota.pt --highest-sigma 1.4 --update-db False
+```
+
+To recreate the preset Trueskill database, start a round-robin Trueskill evaluation among built-in AIs by removing the database in `experiments/league.db`.
+```bash
+cd experiments
+rm league.csv league.db
+python league.py --evals randomBiasedAI workerRushAI lightRushAI coacAI
+```
+
+## Multi-maps support
+
+The training script allows you to train the agents with more than one maps and evaluate with more than one maps. Try executing:
+
+```
+cd experiments
+python ppo_gridnet.py \
+    --train-maps maps/16x16/basesWorkers16x16B.xml maps/16x16/basesWorkers16x16C.xml maps/16x16/basesWorkers16x16D.xml maps/16x16/basesWorkers16x16E.xml maps/16x16/basesWorkers16x16F.xml \
+    --eval-maps maps/16x16/basesWorkers16x16B.xml maps/16x16/basesWorkers16x16C.xml maps/16x16/basesWorkers16x16D.xml maps/16x16/basesWorkers16x16E.xml maps/16x16/basesWorkers16x16F.xml
+```
+
+where `--train-maps` allows you to specify the training maps and `--eval-maps` the evaluation maps. `--train-maps` and `--eval-maps` do not have to match (so you can evaluate on maps the agent has never trained on before).
+
+## Known issues
+
+[ ] Rendering does not exactly work in macos. See https://github.com/jpype-project/jpype/issues/906
+
+## Papers written using Gym-μRTS
+
+* AIIDE 2022 Strategy Games Workshop: [Transformers as Policies for Variable Action Environments](https://arxiv.org/abs/2301.03679)
+* CoG 2021: [Gym-μRTS: Toward Affordable Deep Reinforcement Learning Research in Real-time Strategy Games](https://arxiv.org/abs/2105.13807),
+* AAAI RLG 2021: [Generalization in Deep Reinforcement Learning with Real-time Strategy Games](http://aaai-rlg.mlanctot.info/papers/AAAI21-RLG_paper_33.pdf),
+* AIIDE 2020 Strategy Games Workshop: [Action Guidance: Getting the Best of Training Agents with Sparse Rewards and Shaped Rewards](https://arxiv.org/abs/2010.03956),
+* AIIDE 2019 Strategy Games Workshop: [Comparing Observation and Action Representations for Deep Reinforcement Learning in MicroRTS](https://arxiv.org/abs/1910.12134),
+
+## PettingZoo API
+
+We wrapped our Gym-µRTS simulator into a PettingZoo environment, which is defined in `gym_microrts/pettingzoo_api.py`. An example usage of the Gym-µRTS PettingZoo environment can be found in `hello_world_pettingzoo.py`.
+
+
+## Cite this project
+
+To cite the Gym-µRTS simulator:
+
+```bibtex
+@inproceedings{huang2021gym,
+  author    = {Shengyi Huang and
+               Santiago Onta{\~{n}}{\'{o}}n and
+               Chris Bamford and
+               Lukasz Grela},
+  title     = {Gym-{\(\mathrm{\mu}\)}RTS: Toward Affordable Full Game Real-time Strategy
+               Games Research with Deep Reinforcement Learning},
+  booktitle = {2021 {IEEE} Conference on Games (CoG), Copenhagen, Denmark, August
+               17-20, 2021},
+  pages     = {671--678},
+  publisher = {{IEEE}},
+  year      = {2021},
+  url       = {https://doi.org/10.1109/CoG52621.2021.9619076},
+  doi       = {10.1109/CoG52621.2021.9619076},
+  timestamp = {Fri, 10 Dec 2021 10:41:01 +0100},
+  biburl    = {https://dblp.org/rec/conf/cig/HuangO0G21.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
+
+To cite the invalid action masking technique used in our training script:
+
+```bibtex
+@inproceedings{huang2020closer,
+  author    = {Shengyi Huang and
+               Santiago Onta{\~{n}}{\'{o}}n},
+  editor    = {Roman Bart{\'{a}}k and
+               Fazel Keshtkar and
+               Michael Franklin},
+  title     = {A Closer Look at Invalid Action Masking in Policy Gradient Algorithms},
+  booktitle = {Proceedings of the Thirty-Fifth International Florida Artificial Intelligence
+               Research Society Conference, {FLAIRS} 2022, Hutchinson Island, Jensen
+               Beach, Florida, USA, May 15-18, 2022},
+  year      = {2022},
+  url       = {https://doi.org/10.32473/flairs.v35i.130584},
+  doi       = {10.32473/flairs.v35i.130584},
+  timestamp = {Thu, 09 Jun 2022 16:44:11 +0200},
+  biburl    = {https://dblp.org/rec/conf/flairs/HuangO22.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
diff --git a/experiments/gym-microrts-static-files b/experiments/gym-microrts-static-files
deleted file mode 160000
index 405f909f..00000000
--- a/experiments/gym-microrts-static-files
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 405f909fd98dd1adae5904e3facb54d8381f6291
diff --git a/experiments/gym-microrts-static-files/agent_sota.pt b/experiments/gym-microrts-static-files/agent_sota.pt
new file mode 100644
index 00000000..f738e9b1
Binary files /dev/null and b/experiments/gym-microrts-static-files/agent_sota.pt differ
diff --git a/experiments/gym-microrts-static-files/league.csv b/experiments/gym-microrts-static-files/league.csv
new file mode 100644
index 00000000..c6b165a2
--- /dev/null
+++ b/experiments/gym-microrts-static-files/league.csv
@@ -0,0 +1,14 @@
+name,mu,sigma,trueskill
+coacAI,37.01208300442514,1.2057837814702337,33.39473166001444
+workerRushAI,32.177999126535994,1.0158657259647073,29.13040194864187
+droplet,32.046013299709365,1.0151344815092278,29.000609855181683
+mixedBot,31.485630788984253,1.0538188458547628,28.324174251419965
+izanagi,30.251048792700562,1.048141521227313,27.10662422901862
+tiamat,27.8300726697548,1.0437640107384225,24.69878063753953
+lightRushAI,26.430335172946624,1.0158591694179326,23.382757664692825
+rojo,25.174781972400247,1.0053993662036687,22.15858387378924
+guidedRojoA3N,23.137975074471896,0.9950992581462096,20.152677300033268
+naiveMCTSAI,20.932758765019557,0.9990214832647916,17.93569431522518
+randomBiasedAI,16.41033348962392,1.1908998345256558,12.837633986046953
+passiveAI,6.315587903310498,2.2292417057662304,-0.3721372139881929
+randomAI,5.9383357094823985,2.1599970738074847,-0.5416555119400552
diff --git a/experiments/gym-microrts-static-files/league.db b/experiments/gym-microrts-static-files/league.db
new file mode 100644
index 00000000..af88de47
Binary files /dev/null and b/experiments/gym-microrts-static-files/league.db differ
diff --git a/experiments/league.py b/experiments/league.py
index 66d02d32..b29e9f64 100644
--- a/experiments/league.py
+++ b/experiments/league.py
@@ -1,493 +1,493 @@
-# http://proceedings.mlr.press/v97/han19a/han19a.pdf
-
-import argparse
-import datetime
-import itertools
-import os
-import random
-import shutil
-import uuid
-from distutils.util import strtobool
-from enum import Enum
-
-import numpy as np
-import pandas as pd
-import torch
-from peewee import (
-    JOIN,
-    CharField,
-    DateTimeField,
-    FloatField,
-    ForeignKeyField,
-    Model,
-    SmallIntegerField,
-    SqliteDatabase,
-    fn,
-)
-from stable_baselines3.common.vec_env import VecMonitor
-from trueskill import Rating, quality_1vs1, rate_1vs1
-
-from gym_microrts import microrts_ai  # fmt: off
-
-torch.set_num_threads(1)
-
-
-def parse_args():
-    # fmt: off
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--exp-name', type=str, default=os.path.basename(__file__).rstrip(".py"),
-        help='the name of this experiment')
-    parser.add_argument('--prod-mode', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='run the script in production mode and use wandb to log outputs')
-    parser.add_argument('--wandb-project-name', type=str, default="cleanRL",
-        help="the wandb's project name")
-    parser.add_argument('--wandb-entity', type=str, default=None,
-        help="the entity (team) of wandb's project")
-
-    parser.add_argument('--partial-obs', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='if toggled, the game will have partial observability')
-    parser.add_argument('--evals', nargs='+', default=["randomBiasedAI","workerRushAI","lightRushAI", "coacAI"], # [],
-        help='the ais')
-    parser.add_argument('--num-matches', type=int, default=10,
-        help='seed of the experiment')
-    parser.add_argument('--update-db', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='if toggled, the database will be updated')
-    parser.add_argument('--cuda', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='if toggled, cuda will not be enabled by default')
-    parser.add_argument('--highest-sigma', type=float, default=1.4,
-        help='the highest sigma of the trueskill evaluation')
-    parser.add_argument('--output-path', type=str, default=f"league.temp.csv",
-        help='the output path of the leaderboard csv')
-    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet_large", choices=["ppo_gridnet_large", "ppo_gridnet"],
-        help='the output path of the leaderboard csv')
-    parser.add_argument('--maps', nargs='+', default=["maps/16x16/basesWorkers16x16A.xml"],
-        help="the maps to do trueskill evaluations")
-    # ["randomBiasedAI","workerRushAI","lightRushAI","coacAI"]
-    # default=["randomBiasedAI","workerRushAI","lightRushAI","coacAI","randomAI","passiveAI","naiveMCTSAI","mixedBot","rojo","izanagi","tiamat","droplet","guidedRojoA3N"]
-    args = parser.parse_args()
-    # fmt: on
-    return args
-
-
-args = parse_args()
-dbname = "league"
-if args.partial_obs:
-    dbname = "po_league"
-dbpath = f"gym-microrts-static-files/{dbname}.db"
-csvpath = f"gym-microrts-static-files/{dbname}.csv"
-if not args.update_db:
-    if not os.path.exists(f"gym-microrts-static-files/tmp"):
-        os.makedirs(f"gym-microrts-static-files/tmp")
-    tmp_dbpath = f"gym-microrts-static-files/tmp/{str(uuid.uuid4())}.db"
-    shutil.copyfile(dbpath, tmp_dbpath)
-    dbpath = tmp_dbpath
-db = SqliteDatabase(dbpath)
-
-if args.model_type == "ppo_gridnet_large":
-    from ppo_gridnet_large import Agent, MicroRTSStatsRecorder
-
-    from gym_microrts.envs.vec_env import MicroRTSBotVecEnv, MicroRTSGridModeVecEnv
-else:
-    from ppo_gridnet import Agent, MicroRTSStatsRecorder
-
-    from gym_microrts.envs.vec_env import MicroRTSBotVecEnv, MicroRTSGridModeVecEnv
-
-
-class BaseModel(Model):
-    class Meta:
-        database = db
-
-
-class AI(BaseModel):
-    name = CharField(unique=True)
-    mu = FloatField()
-    sigma = FloatField()
-    ai_type = CharField()
-
-    def __str__(self):
-        return f"🤖 {self.name} with N({round(self.mu, 3)}, {round(self.sigma, 3)})"
-
-
-class MatchHistory(BaseModel):
-    challenger = ForeignKeyField(AI, backref="challenger_match_histories")
-    defender = ForeignKeyField(AI, backref="defender_match_histories")
-    win = SmallIntegerField()
-    draw = SmallIntegerField()
-    loss = SmallIntegerField()
-    created_date = DateTimeField(default=datetime.datetime.now)
-
-
-db.connect()
-db.create_tables([AI, MatchHistory])
-
-
-class Outcome(Enum):
-    WIN = 1
-    DRAW = 0
-    LOSS = -1
-
-
-class Match:
-    def __init__(self, partial_obs: bool, match_up=None, map_path="maps/16x16/basesWorkers16x16A.xml"):
-        # mode 0: rl-ai vs built-in-ai
-        # mode 1: rl-ai vs rl-ai
-        # mode 2: built-in-ai vs built-in-ai
-
-        built_in_ais = None
-        built_in_ais2 = None
-        rl_ai = None
-        rl_ai2 = None
-        self.map_path = map_path
-
-        # determine mode
-        rl_ais = []
-        built_in_ais = []
-        for ai in match_up:
-            if ai[-3:] == ".pt":
-                rl_ais += [ai]
-            else:
-                built_in_ais += [ai]
-        if len(rl_ais) == 1:
-            mode = 0
-            p0 = rl_ais[0]
-            p1 = built_in_ais[0]
-            rl_ai = p0
-            built_in_ais = [eval(f"microrts_ai.{p1}")]
-        elif len(rl_ais) == 2:
-            mode = 1
-            p0 = rl_ais[0]
-            p1 = rl_ais[1]
-            rl_ai = p0
-            rl_ai2 = p1
-        else:
-            mode = 2
-            p0 = built_in_ais[0]
-            p1 = built_in_ais[1]
-            built_in_ais = [eval(f"microrts_ai.{p0}")]
-            built_in_ais2 = [eval(f"microrts_ai.{p1}")]
-
-        self.p0, self.p1 = p0, p1
-
-        self.mode = mode
-        self.partial_obs = partial_obs
-        self.built_in_ais = built_in_ais
-        self.built_in_ais2 = built_in_ais2
-        self.rl_ai = rl_ai
-        self.rl_ai2 = rl_ai2
-        self.device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
-        max_steps = 5000
-        if mode == 0:
-            self.envs = MicroRTSGridModeVecEnv(
-                num_bot_envs=len(built_in_ais),
-                num_selfplay_envs=0,
-                partial_obs=partial_obs,
-                max_steps=max_steps,
-                render_theme=2,
-                ai2s=built_in_ais,
-                map_paths=[map_path],
-                reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
-                autobuild=False,
-            )
-            self.agent = Agent(self.envs).to(self.device)
-            self.agent.load_state_dict(torch.load(self.rl_ai, map_location=self.device))
-            self.agent.eval()
-        elif mode == 1:
-            self.envs = MicroRTSGridModeVecEnv(
-                num_bot_envs=0,
-                num_selfplay_envs=2,
-                partial_obs=partial_obs,
-                max_steps=max_steps,
-                render_theme=2,
-                map_paths=[map_path],
-                reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
-                autobuild=False,
-            )
-            self.agent = Agent(self.envs).to(self.device)
-            self.agent.load_state_dict(torch.load(self.rl_ai, map_location=self.device))
-            self.agent.eval()
-            self.agent2 = Agent(self.envs).to(self.device)
-            self.agent2.load_state_dict(torch.load(self.rl_ai2, map_location=self.device))
-            self.agent2.eval()
-        else:
-            self.envs = MicroRTSBotVecEnv(
-                ai1s=built_in_ais,
-                ai2s=built_in_ais2,
-                max_steps=max_steps,
-                render_theme=2,
-                map_paths=[map_path],
-                reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
-                autobuild=False,
-            )
-        self.envs = MicroRTSStatsRecorder(self.envs)
-        self.envs = VecMonitor(self.envs)
-
-    def run(self, num_matches=7):
-        if self.mode == 0:
-            return self.run_m0(num_matches)
-        elif self.mode == 1:
-            return self.run_m1(num_matches)
-        else:
-            return self.run_m2(num_matches)
-
-    def run_m0(self, num_matches):
-        results = []
-        16 * 16
-        next_obs = torch.Tensor(self.envs.reset()).to(self.device)
-        while True:
-            # self.envs.render()
-            # ALGO LOGIC: put action logic here
-            with torch.no_grad():
-                mask = torch.tensor(np.array(self.envs.get_action_mask())).to(self.device)
-                action, _, _, _, _ = self.agent.get_action_and_value(
-                    next_obs, envs=self.envs, invalid_action_masks=mask, device=self.device
-                )
-            try:
-                next_obs, rs, ds, infos = self.envs.step(action.cpu().numpy().reshape(self.envs.num_envs, -1))
-                next_obs = torch.Tensor(next_obs).to(self.device)
-            except Exception as e:
-                e.printStackTrace()
-                raise
-
-            for idx, info in enumerate(infos):
-                if "episode" in info.keys():
-                    results += [info["microrts_stats"]["WinLossRewardFunction"]]
-                    if len(results) >= num_matches:
-                        return results
-
-    def run_m1(self, num_matches):
-        results = []
-        16 * 16
-        next_obs = torch.Tensor(self.envs.reset()).to(self.device)
-        while True:
-            # self.envs.render()
-            # ALGO LOGIC: put action logic here
-            with torch.no_grad():
-                mask = torch.tensor(np.array(self.envs.get_action_mask())).to(self.device)
-
-                p1_obs = next_obs[::2]
-                p2_obs = next_obs[1::2]
-                p1_mask = mask[::2]
-                p2_mask = mask[1::2]
-
-                p1_action, _, _, _, _ = self.agent.get_action_and_value(
-                    p1_obs, envs=self.envs, invalid_action_masks=p1_mask, device=self.device
-                )
-                p2_action, _, _, _, _ = self.agent2.get_action_and_value(
-                    p2_obs, envs=self.envs, invalid_action_masks=p2_mask, device=self.device
-                )
-                action = torch.zeros((self.envs.num_envs, p2_action.shape[1], p2_action.shape[2]))
-                action[::2] = p1_action
-                action[1::2] = p2_action
-
-            try:
-                next_obs, rs, ds, infos = self.envs.step(action.cpu().numpy().reshape(self.envs.num_envs, -1))
-                next_obs = torch.Tensor(next_obs).to(self.device)
-            except Exception as e:
-                e.printStackTrace()
-                raise
-
-            for idx, info in enumerate(infos):
-                if "episode" in info.keys():
-                    results += [info["microrts_stats"]["WinLossRewardFunction"]]
-                    if len(results) >= num_matches:
-                        return results
-
-    def run_m2(self, num_matches):
-        results = []
-        self.envs.reset()
-        while True:
-            # self.envs.render()
-            # dummy actions
-            next_obs, reward, done, infos = self.envs.step(
-                [
-                    [
-                        [0, 0, 0, 0, 0, 0, 0, 0],
-                        [0, 0, 0, 0, 0, 0, 0, 0],
-                    ]
-                ]
-            )
-            for idx, info in enumerate(infos):
-                if "episode" in info.keys():
-                    results += [info["microrts_stats"]["WinLossRewardFunction"]]
-                    if len(results) >= num_matches:
-                        return results
-
-
-def get_ai_type(ai_name):
-    if ai_name[-3:] == ".pt":
-        return "rl_ai"
-    else:
-        return "built_in_ai"
-
-
-def get_match_history(ai_name):
-    query = (
-        MatchHistory.select(
-            AI.name,
-            fn.SUM(MatchHistory.win).alias("wins"),
-            fn.SUM(MatchHistory.draw).alias("draws"),
-            fn.SUM(MatchHistory.loss).alias("losss"),
-        )
-        .join(AI, JOIN.LEFT_OUTER, on=MatchHistory.defender)
-        .group_by(MatchHistory.defender)
-        .where(MatchHistory.challenger == AI.get(name=ai_name))
-    )
-    return pd.DataFrame(list(query.dicts()))
-
-
-def get_leaderboard():
-    query = AI.select(
-        AI.name,
-        AI.mu,
-        AI.sigma,
-        (AI.mu - 3 * AI.sigma).alias("trueskill"),
-    ).order_by((AI.mu - 3 * AI.sigma).desc())
-    return pd.DataFrame(list(query.dicts()))
-
-
-def get_leaderboard_existing_ais(existing_ai_names):
-    query = (
-        AI.select(
-            AI.name,
-            AI.mu,
-            AI.sigma,
-            (AI.mu - 3 * AI.sigma).alias("trueskill"),
-        )
-        .where((AI.name.in_(existing_ai_names)))
-        .order_by((AI.mu - 3 * AI.sigma).desc())
-    )
-    return pd.DataFrame(list(query.dicts()))
-
-
-if __name__ == "__main__":
-    print(f"evaluation maps is", args.maps)
-    existing_ai_names = [item.name for item in AI.select()]
-    all_ai_names = set(existing_ai_names + args.evals)
-
-    for ai_name in all_ai_names:
-        ai = AI.get_or_none(name=ai_name)
-        if ai is None:
-            ai = AI(name=ai_name, mu=25.0, sigma=8.333333333333334, ai_type=get_ai_type(ai_name))
-            ai.save()
-
-    # case 1: initialize the league with round robin
-    if len(existing_ai_names) == 0:
-        match_ups = list(itertools.combinations(all_ai_names, 2))
-        np.random.shuffle(match_ups)
-        for idx in range(2):  # switch player 1 and 2's starting locations
-            for match_up in match_ups:
-                if idx == 0:
-                    match_up = list(reversed(match_up))
-
-                for index in range(len(args.maps)):
-                    m = Match(args.partial_obs, match_up, args.maps[index])
-                    challenger = AI.get_or_none(name=m.p0)
-                    defender = AI.get_or_none(name=m.p1)
-
-                    r = m.run(args.num_matches // 2)
-                    for item in r:
-                        drawn = False
-                        if item == Outcome.WIN.value:
-                            winner = challenger
-                            loser = defender
-                        elif item == Outcome.DRAW.value:
-                            drawn = True
-                        else:
-                            winner = defender
-                            loser = challenger
-
-                        print(f"{winner.name} {'draws' if drawn else 'wins'} {loser.name}")
-
-                        winner_rating, loser_rating = rate_1vs1(
-                            Rating(winner.mu, winner.sigma), Rating(loser.mu, loser.sigma), drawn=drawn
-                        )
-
-                        winner.mu, winner.sigma = winner_rating.mu, winner_rating.sigma
-                        loser.mu, loser.sigma = loser_rating.mu, loser_rating.sigma
-                        winner.save()
-                        loser.save()
-
-                        MatchHistory(
-                            challenger=challenger,
-                            defender=defender,
-                            win=int(item == 1),
-                            draw=int(item == 0),
-                            loss=int(item == -1),
-                        ).save()
-        get_leaderboard().to_csv(csvpath, index=False)
-
-    # case 2: new AIs
-    else:
-        leaderboard = get_leaderboard_existing_ais(existing_ai_names)
-        new_ai_names = [ai_name for ai_name in args.evals if ai_name not in existing_ai_names]
-        for new_ai_name in new_ai_names:
-            ai = AI.get(name=new_ai_name)
-
-            while ai.sigma > args.highest_sigma:
-
-                match_qualities = []
-                for ai2_name in leaderboard["name"]:
-                    opponent_ai = AI.get(name=ai2_name)
-                    if ai.name == opponent_ai.name:
-                        continue
-                    match_qualities += [[opponent_ai, quality_1vs1(ai, opponent_ai)]]
-
-                # sort by quality
-                match_qualities = sorted(match_qualities, key=lambda x: x[1], reverse=True)
-                print("match_qualities[:3]", match_qualities[:3])
-
-                # run a match if the quality of the opponent is high enough
-                top_3_ai = [item[0] for item in match_qualities[:3]]
-                opponent_ai = random.choice(top_3_ai)
-                match_up = (ai.name, opponent_ai.name)
-                match_quality = quality_1vs1(ai, opponent_ai)
-                print(f"the match up is ({ai}, {opponent_ai}), quality is {round(match_quality, 4)}")
-                winner = ai  # dummy setting
-                for idx in range(2):  # switch player 1 and 2's starting locations
-                    if idx == 0:
-                        match_up = list(reversed(match_up))
-
-                    for index in range(len(args.maps)):
-                        m = Match(args.partial_obs, match_up, args.maps[index])
-                        challenger = AI.get_or_none(name=m.p0)
-                        defender = AI.get_or_none(name=m.p1)
-
-                        r = m.run(1)
-                        for item in r:
-                            drawn = False
-                            if item == Outcome.WIN.value:
-                                winner = challenger
-                                loser = defender
-                            elif item == Outcome.DRAW.value:
-                                drawn = True
-                                winner = defender
-                                loser = challenger
-                            else:
-                                winner = defender
-                                loser = challenger
-                            print(f"{winner.name} {'draws' if drawn else 'wins'} {loser.name}")
-                            winner_rating, loser_rating = rate_1vs1(
-                                Rating(winner.mu, winner.sigma), Rating(loser.mu, loser.sigma), drawn=drawn
-                            )
-
-                            # freeze existing AIs ratings
-                            if winner.name == ai.name:
-                                ai.mu, ai.sigma = winner_rating.mu, winner_rating.sigma
-                                ai.save()
-                            else:
-                                ai.mu, ai.sigma = loser_rating.mu, loser_rating.sigma
-                                ai.save()
-                            MatchHistory(
-                                challenger=challenger,
-                                defender=defender,
-                                win=int(item == 1),
-                                draw=int(item == 0),
-                                loss=int(item == -1),
-                            ).save()
-
-        get_leaderboard().to_csv(args.output_path, index=False)
-
-    print("=======================")
-    print(get_leaderboard())
-    if not args.update_db:
-        os.remove(dbpath)
+# http://proceedings.mlr.press/v97/han19a/han19a.pdf
+
+import argparse
+import datetime
+import itertools
+import os
+import random
+import shutil
+import uuid
+from distutils.util import strtobool
+from enum import Enum
+
+import numpy as np
+import pandas as pd
+import torch
+from peewee import (
+    JOIN,
+    CharField,
+    DateTimeField,
+    FloatField,
+    ForeignKeyField,
+    Model,
+    SmallIntegerField,
+    SqliteDatabase,
+    fn,
+)
+from stable_baselines3.common.vec_env import VecMonitor
+from trueskill import Rating, quality_1vs1, rate_1vs1
+
+from gym_microrts import microrts_ai  # fmt: off
+
+torch.set_num_threads(1)
+
+
+def parse_args():
+    # fmt: off
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--exp-name', type=str, default=os.path.basename(__file__).rstrip(".py"),
+        help='the name of this experiment')
+    parser.add_argument('--prod-mode', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='run the script in production mode and use wandb to log outputs')
+    parser.add_argument('--wandb-project-name', type=str, default="cleanRL",
+        help="the wandb's project name")
+    parser.add_argument('--wandb-entity', type=str, default=None,
+        help="the entity (team) of wandb's project")
+
+    parser.add_argument('--partial-obs', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='if toggled, the game will have partial observability')
+    parser.add_argument('--evals', nargs='+', default=["randomBiasedAI","workerRushAI","lightRushAI", "coacAI"], # [],
+        help='the ais')
+    parser.add_argument('--num-matches', type=int, default=10,
+        help='seed of the experiment')
+    parser.add_argument('--update-db', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='if toggled, the database will be updated')
+    parser.add_argument('--cuda', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='if toggled, cuda will not be enabled by default')
+    parser.add_argument('--highest-sigma', type=float, default=1.4,
+        help='the highest sigma of the trueskill evaluation')
+    parser.add_argument('--output-path', type=str, default=f"league.temp.csv",
+        help='the output path of the leaderboard csv')
+    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet_large", choices=["ppo_gridnet_large", "ppo_gridnet"],
+        help='the output path of the leaderboard csv')
+    parser.add_argument('--maps', nargs='+', default=["maps/16x16/basesWorkers16x16A.xml"],
+        help="the maps to do trueskill evaluations")
+    # ["randomBiasedAI","workerRushAI","lightRushAI","coacAI"]
+    # default=["randomBiasedAI","workerRushAI","lightRushAI","coacAI","randomAI","passiveAI","naiveMCTSAI","mixedBot","rojo","izanagi","tiamat","droplet","guidedRojoA3N"]
+    args = parser.parse_args()
+    # fmt: on
+    return args
+
+
+args = parse_args()
+dbname = "league"
+if args.partial_obs:
+    dbname = "po_league"
+dbpath = f"gym-microrts-static-files/{dbname}.db"
+csvpath = f"gym-microrts-static-files/{dbname}.csv"
+if not args.update_db:
+    if not os.path.exists(f"gym-microrts-static-files/tmp"):
+        os.makedirs(f"gym-microrts-static-files/tmp")
+    tmp_dbpath = f"gym-microrts-static-files/tmp/{str(uuid.uuid4())}.db"
+    shutil.copyfile(dbpath, tmp_dbpath)
+    dbpath = tmp_dbpath
+db = SqliteDatabase(dbpath)
+
+if args.model_type == "ppo_gridnet_large":
+    from ppo_gridnet_large import Agent, MicroRTSStatsRecorder
+
+    from gym_microrts.envs.vec_env import MicroRTSBotVecEnv, MicroRTSGridModeVecEnv
+else:
+    from ppo_gridnet import Agent, MicroRTSStatsRecorder
+
+    from gym_microrts.envs.vec_env import MicroRTSBotVecEnv, MicroRTSGridModeVecEnv
+
+
+class BaseModel(Model):
+    class Meta:
+        database = db
+
+
+class AI(BaseModel):
+    name = CharField(unique=True)
+    mu = FloatField()
+    sigma = FloatField()
+    ai_type = CharField()
+
+    def __str__(self):
+        return f"🤖 {self.name} with N({round(self.mu, 3)}, {round(self.sigma, 3)})"
+
+
+class MatchHistory(BaseModel):
+    challenger = ForeignKeyField(AI, backref="challenger_match_histories")
+    defender = ForeignKeyField(AI, backref="defender_match_histories")
+    win = SmallIntegerField()
+    draw = SmallIntegerField()
+    loss = SmallIntegerField()
+    created_date = DateTimeField(default=datetime.datetime.now)
+
+
+db.connect()
+db.create_tables([AI, MatchHistory])
+
+
+class Outcome(Enum):
+    WIN = 1
+    DRAW = 0
+    LOSS = -1
+
+
+class Match:
+    def __init__(self, partial_obs: bool, match_up=None, map_path="maps/16x16/basesWorkers16x16A.xml"):
+        # mode 0: rl-ai vs built-in-ai
+        # mode 1: rl-ai vs rl-ai
+        # mode 2: built-in-ai vs built-in-ai
+
+        built_in_ais = None
+        built_in_ais2 = None
+        rl_ai = None
+        rl_ai2 = None
+        self.map_path = map_path
+
+        # determine mode
+        rl_ais = []
+        built_in_ais = []
+        for ai in match_up:
+            if ai[-3:] == ".pt":
+                rl_ais += [ai]
+            else:
+                built_in_ais += [ai]
+        if len(rl_ais) == 1:
+            mode = 0
+            p0 = rl_ais[0]
+            p1 = built_in_ais[0]
+            rl_ai = p0
+            built_in_ais = [eval(f"microrts_ai.{p1}")]
+        elif len(rl_ais) == 2:
+            mode = 1
+            p0 = rl_ais[0]
+            p1 = rl_ais[1]
+            rl_ai = p0
+            rl_ai2 = p1
+        else:
+            mode = 2
+            p0 = built_in_ais[0]
+            p1 = built_in_ais[1]
+            built_in_ais = [eval(f"microrts_ai.{p0}")]
+            built_in_ais2 = [eval(f"microrts_ai.{p1}")]
+
+        self.p0, self.p1 = p0, p1
+
+        self.mode = mode
+        self.partial_obs = partial_obs
+        self.built_in_ais = built_in_ais
+        self.built_in_ais2 = built_in_ais2
+        self.rl_ai = rl_ai
+        self.rl_ai2 = rl_ai2
+        self.device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
+        max_steps = 5000
+        if mode == 0:
+            self.envs = MicroRTSGridModeVecEnv(
+                num_bot_envs=len(built_in_ais),
+                num_selfplay_envs=0,
+                partial_obs=partial_obs,
+                max_steps=max_steps,
+                render_theme=2,
+                ai2s=built_in_ais,
+                map_paths=[map_path],
+                reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+                autobuild=False,
+            )
+            self.agent = Agent(self.envs).to(self.device)
+            self.agent.load_state_dict(torch.load(self.rl_ai, map_location=self.device))
+            self.agent.eval()
+        elif mode == 1:
+            self.envs = MicroRTSGridModeVecEnv(
+                num_bot_envs=0,
+                num_selfplay_envs=2,
+                partial_obs=partial_obs,
+                max_steps=max_steps,
+                render_theme=2,
+                map_paths=[map_path],
+                reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+                autobuild=False,
+            )
+            self.agent = Agent(self.envs).to(self.device)
+            self.agent.load_state_dict(torch.load(self.rl_ai, map_location=self.device))
+            self.agent.eval()
+            self.agent2 = Agent(self.envs).to(self.device)
+            self.agent2.load_state_dict(torch.load(self.rl_ai2, map_location=self.device))
+            self.agent2.eval()
+        else:
+            self.envs = MicroRTSBotVecEnv(
+                ai1s=built_in_ais,
+                ai2s=built_in_ais2,
+                max_steps=max_steps,
+                render_theme=2,
+                map_paths=[map_path],
+                reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+                autobuild=False,
+            )
+        self.envs = MicroRTSStatsRecorder(self.envs)
+        self.envs = VecMonitor(self.envs)
+
+    def run(self, num_matches=7):
+        if self.mode == 0:
+            return self.run_m0(num_matches)
+        elif self.mode == 1:
+            return self.run_m1(num_matches)
+        else:
+            return self.run_m2(num_matches)
+
+    def run_m0(self, num_matches):
+        results = []
+        16 * 16
+        next_obs = torch.Tensor(self.envs.reset()).to(self.device)
+        while True:
+            # self.envs.render()
+            # ALGO LOGIC: put action logic here
+            with torch.no_grad():
+                mask = torch.tensor(np.array(self.envs.get_action_mask())).to(self.device)
+                action, _, _, _, _ = self.agent.get_action_and_value(
+                    next_obs, envs=self.envs, invalid_action_masks=mask, device=self.device
+                )
+            try:
+                next_obs, rs, ds, infos = self.envs.step(action.cpu().numpy().reshape(self.envs.num_envs, -1))
+                next_obs = torch.Tensor(next_obs).to(self.device)
+            except Exception as e:
+                e.printStackTrace()
+                raise
+
+            for idx, info in enumerate(infos):
+                if "episode" in info.keys():
+                    results += [info["microrts_stats"]["WinLossRewardFunction"]]
+                    if len(results) >= num_matches:
+                        return results
+
+    def run_m1(self, num_matches):
+        results = []
+        16 * 16
+        next_obs = torch.Tensor(self.envs.reset()).to(self.device)
+        while True:
+            # self.envs.render()
+            # ALGO LOGIC: put action logic here
+            with torch.no_grad():
+                mask = torch.tensor(np.array(self.envs.get_action_mask())).to(self.device)
+
+                p1_obs = next_obs[::2]
+                p2_obs = next_obs[1::2]
+                p1_mask = mask[::2]
+                p2_mask = mask[1::2]
+
+                p1_action, _, _, _, _ = self.agent.get_action_and_value(
+                    p1_obs, envs=self.envs, invalid_action_masks=p1_mask, device=self.device
+                )
+                p2_action, _, _, _, _ = self.agent2.get_action_and_value(
+                    p2_obs, envs=self.envs, invalid_action_masks=p2_mask, device=self.device
+                )
+                action = torch.zeros((self.envs.num_envs, p2_action.shape[1], p2_action.shape[2]))
+                action[::2] = p1_action
+                action[1::2] = p2_action
+
+            try:
+                next_obs, rs, ds, infos = self.envs.step(action.cpu().numpy().reshape(self.envs.num_envs, -1))
+                next_obs = torch.Tensor(next_obs).to(self.device)
+            except Exception as e:
+                e.printStackTrace()
+                raise
+
+            for idx, info in enumerate(infos):
+                if "episode" in info.keys():
+                    results += [info["microrts_stats"]["WinLossRewardFunction"]]
+                    if len(results) >= num_matches:
+                        return results
+
+    def run_m2(self, num_matches):
+        results = []
+        self.envs.reset()
+        while True:
+            # self.envs.render()
+            # dummy actions
+            next_obs, reward, done, infos = self.envs.step(
+                [
+                    [
+                        [0, 0, 0, 0, 0, 0, 0, 0],
+                        [0, 0, 0, 0, 0, 0, 0, 0],
+                    ]
+                ]
+            )
+            for idx, info in enumerate(infos):
+                if "episode" in info.keys():
+                    results += [info["microrts_stats"]["WinLossRewardFunction"]]
+                    if len(results) >= num_matches:
+                        return results
+
+
+def get_ai_type(ai_name):
+    if ai_name[-3:] == ".pt":
+        return "rl_ai"
+    else:
+        return "built_in_ai"
+
+
+def get_match_history(ai_name):
+    query = (
+        MatchHistory.select(
+            AI.name,
+            fn.SUM(MatchHistory.win).alias("wins"),
+            fn.SUM(MatchHistory.draw).alias("draws"),
+            fn.SUM(MatchHistory.loss).alias("losss"),
+        )
+        .join(AI, JOIN.LEFT_OUTER, on=MatchHistory.defender)
+        .group_by(MatchHistory.defender)
+        .where(MatchHistory.challenger == AI.get(name=ai_name))
+    )
+    return pd.DataFrame(list(query.dicts()))
+
+
+def get_leaderboard():
+    query = AI.select(
+        AI.name,
+        AI.mu,
+        AI.sigma,
+        (AI.mu - 3 * AI.sigma).alias("trueskill"),
+    ).order_by((AI.mu - 3 * AI.sigma).desc())
+    return pd.DataFrame(list(query.dicts()))
+
+
+def get_leaderboard_existing_ais(existing_ai_names):
+    query = (
+        AI.select(
+            AI.name,
+            AI.mu,
+            AI.sigma,
+            (AI.mu - 3 * AI.sigma).alias("trueskill"),
+        )
+        .where((AI.name.in_(existing_ai_names)))
+        .order_by((AI.mu - 3 * AI.sigma).desc())
+    )
+    return pd.DataFrame(list(query.dicts()))
+
+
+if __name__ == "__main__":
+    print(f"evaluation maps is", args.maps)
+    existing_ai_names = [item.name for item in AI.select()]
+    all_ai_names = set(existing_ai_names + args.evals)
+
+    for ai_name in all_ai_names:
+        ai = AI.get_or_none(name=ai_name)
+        if ai is None:
+            ai = AI(name=ai_name, mu=25.0, sigma=8.333333333333334, ai_type=get_ai_type(ai_name))
+            ai.save()
+
+    # case 1: initialize the league with round robin
+    if len(existing_ai_names) == 0:
+        match_ups = list(itertools.combinations(all_ai_names, 2))
+        np.random.shuffle(match_ups)
+        for idx in range(2):  # switch player 1 and 2's starting locations
+            for match_up in match_ups:
+                if idx == 0:
+                    match_up = list(reversed(match_up))
+
+                for index in range(len(args.maps)):
+                    m = Match(args.partial_obs, match_up, args.maps[index])
+                    challenger = AI.get_or_none(name=m.p0)
+                    defender = AI.get_or_none(name=m.p1)
+
+                    r = m.run(args.num_matches // 2)
+                    for item in r:
+                        drawn = False
+                        if item == Outcome.WIN.value:
+                            winner = challenger
+                            loser = defender
+                        elif item == Outcome.DRAW.value:
+                            drawn = True
+                        else:
+                            winner = defender
+                            loser = challenger
+
+                        print(f"{winner.name} {'draws' if drawn else 'wins'} {loser.name}")
+
+                        winner_rating, loser_rating = rate_1vs1(
+                            Rating(winner.mu, winner.sigma), Rating(loser.mu, loser.sigma), drawn=drawn
+                        )
+
+                        winner.mu, winner.sigma = winner_rating.mu, winner_rating.sigma
+                        loser.mu, loser.sigma = loser_rating.mu, loser_rating.sigma
+                        winner.save()
+                        loser.save()
+
+                        MatchHistory(
+                            challenger=challenger,
+                            defender=defender,
+                            win=int(item == 1),
+                            draw=int(item == 0),
+                            loss=int(item == -1),
+                        ).save()
+        get_leaderboard().to_csv(csvpath, index=False)
+
+    # case 2: new AIs
+    else:
+        leaderboard = get_leaderboard_existing_ais(existing_ai_names)
+        new_ai_names = [ai_name for ai_name in args.evals if ai_name not in existing_ai_names]
+        for new_ai_name in new_ai_names:
+            ai = AI.get(name=new_ai_name)
+
+            while ai.sigma > args.highest_sigma:
+
+                match_qualities = []
+                for ai2_name in leaderboard["name"]:
+                    opponent_ai = AI.get(name=ai2_name)
+                    if ai.name == opponent_ai.name:
+                        continue
+                    match_qualities += [[opponent_ai, quality_1vs1(ai, opponent_ai)]]
+
+                # sort by quality
+                match_qualities = sorted(match_qualities, key=lambda x: x[1], reverse=True)
+                print("match_qualities[:3]", match_qualities[:3])
+
+                # run a match if the quality of the opponent is high enough
+                top_3_ai = [item[0] for item in match_qualities[:3]]
+                opponent_ai = random.choice(top_3_ai)
+                match_up = (ai.name, opponent_ai.name)
+                match_quality = quality_1vs1(ai, opponent_ai)
+                print(f"the match up is ({ai}, {opponent_ai}), quality is {round(match_quality, 4)}")
+                winner = ai  # dummy setting
+                for idx in range(2):  # switch player 1 and 2's starting locations
+                    if idx == 0:
+                        match_up = list(reversed(match_up))
+
+                    for index in range(len(args.maps)):
+                        m = Match(args.partial_obs, match_up, args.maps[index])
+                        challenger = AI.get_or_none(name=m.p0)
+                        defender = AI.get_or_none(name=m.p1)
+
+                        r = m.run(1)
+                        for item in r:
+                            drawn = False
+                            if item == Outcome.WIN.value:
+                                winner = challenger
+                                loser = defender
+                            elif item == Outcome.DRAW.value:
+                                drawn = True
+                                winner = defender
+                                loser = challenger
+                            else:
+                                winner = defender
+                                loser = challenger
+                            print(f"{winner.name} {'draws' if drawn else 'wins'} {loser.name}")
+                            winner_rating, loser_rating = rate_1vs1(
+                                Rating(winner.mu, winner.sigma), Rating(loser.mu, loser.sigma), drawn=drawn
+                            )
+
+                            # freeze existing AIs ratings
+                            if winner.name == ai.name:
+                                ai.mu, ai.sigma = winner_rating.mu, winner_rating.sigma
+                                ai.save()
+                            else:
+                                ai.mu, ai.sigma = loser_rating.mu, loser_rating.sigma
+                                ai.save()
+                            MatchHistory(
+                                challenger=challenger,
+                                defender=defender,
+                                win=int(item == 1),
+                                draw=int(item == 0),
+                                loss=int(item == -1),
+                            ).save()
+
+        get_leaderboard().to_csv(args.output_path, index=False)
+
+    print("=======================")
+    print(get_leaderboard())
+    if not args.update_db:
+        os.remove(dbpath)
diff --git a/experiments/ppo_gridnet.py b/experiments/ppo_gridnet.py
index 720d95bc..fd6cd8c2 100644
--- a/experiments/ppo_gridnet.py
+++ b/experiments/ppo_gridnet.py
@@ -1,566 +1,566 @@
-# http://proceedings.mlr.press/v97/han19a/han19a.pdf
-
-import argparse
-import os
-import random
-import subprocess
-import time
-from distutils.util import strtobool
-from typing import List
-
-import numpy as np
-import pandas as pd
-import torch
-import torch.nn as nn
-import torch.optim as optim
-from gym.spaces import MultiDiscrete
-from stable_baselines3.common.vec_env import VecEnvWrapper, VecMonitor, VecVideoRecorder
-from torch.distributions.categorical import Categorical
-from torch.utils.tensorboard import SummaryWriter
-
-from gym_microrts import microrts_ai
-from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
-
-
-def parse_args():
-    # fmt: off
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--exp-name', type=str, default=os.path.basename(__file__).rstrip(".py"),
-        help='the name of this experiment')
-    parser.add_argument('--gym-id', type=str, default="MicroRTSGridModeVecEnv",
-        help='the id of the gym environment')
-    parser.add_argument('--learning-rate', type=float, default=2.5e-4,
-        help='the learning rate of the optimizer')
-    parser.add_argument('--seed', type=int, default=1,
-        help='seed of the experiment')
-    parser.add_argument('--total-timesteps', type=int, default=50000000,
-        help='total timesteps of the experiments')
-    parser.add_argument('--torch-deterministic', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='if toggled, `torch.backends.cudnn.deterministic=False`')
-    parser.add_argument('--cuda', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='if toggled, cuda will not be enabled by default')
-    parser.add_argument('--prod-mode', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='run the script in production mode and use wandb to log outputs')
-    parser.add_argument('--capture-video', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='whether to capture videos of the agent performances (check out `videos` folder)')
-    parser.add_argument('--wandb-project-name', type=str, default="gym-microrts",
-        help="the wandb's project name")
-    parser.add_argument('--wandb-entity', type=str, default=None,
-        help="the entity (team) of wandb's project")
-
-    # Algorithm specific arguments
-    parser.add_argument('--partial-obs', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='if toggled, the game will have partial observability')
-    parser.add_argument('--n-minibatch', type=int, default=4,
-        help='the number of mini batch')
-    parser.add_argument('--num-bot-envs', type=int, default=0,
-        help='the number of bot game environment; 16 bot envs means 16 games')
-    parser.add_argument('--num-selfplay-envs', type=int, default=24,
-        help='the number of self play envs; 16 self play envs means 8 games')
-    parser.add_argument('--num-steps', type=int, default=256,
-        help='the number of steps per game environment')
-    parser.add_argument('--gamma', type=float, default=0.99,
-        help='the discount factor gamma')
-    parser.add_argument('--gae-lambda', type=float, default=0.95,
-        help='the lambda for the general advantage estimation')
-    parser.add_argument('--ent-coef', type=float, default=0.01,
-        help="coefficient of the entropy")
-    parser.add_argument('--vf-coef', type=float, default=0.5,
-        help="coefficient of the value function")
-    parser.add_argument('--max-grad-norm', type=float, default=0.5,
-        help='the maximum norm for the gradient clipping')
-    parser.add_argument('--clip-coef', type=float, default=0.1,
-        help="the surrogate clipping coefficient")
-    parser.add_argument('--update-epochs', type=int, default=4,
-        help="the K epochs to update the policy")
-    parser.add_argument('--kle-stop', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='If toggled, the policy updates will be early stopped w.r.t target-kl')
-    parser.add_argument('--kle-rollback', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='If toggled, the policy updates will roll back to previous policy if KL exceeds target-kl')
-    parser.add_argument('--target-kl', type=float, default=0.03,
-        help='the target-kl variable that is referred by --kl')
-    parser.add_argument('--gae', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='Use GAE for advantage computation')
-    parser.add_argument('--norm-adv', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help="Toggles advantages normalization")
-    parser.add_argument('--anneal-lr', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help="Toggle learning rate annealing for policy and value networks")
-    parser.add_argument('--clip-vloss', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='Toggles whether or not to use a clipped loss for the value function, as per the paper.')
-    parser.add_argument('--num-models', type=int, default=100,
-        help='the number of models saved')
-    parser.add_argument('--max-eval-workers', type=int, default=4,
-        help='the maximum number of eval workers (skips evaluation when set to 0)')
-    parser.add_argument('--train-maps', nargs='+', default=["maps/16x16/basesWorkers16x16A.xml"],
-        help='the list of maps used during training')
-    parser.add_argument('--eval-maps', nargs='+', default=["maps/16x16/basesWorkers16x16A.xml"],
-        help='the list of maps used during evaluation')
-
-    args = parser.parse_args()
-    if not args.seed:
-        args.seed = int(time.time())
-    args.num_envs = args.num_selfplay_envs + args.num_bot_envs
-    args.batch_size = int(args.num_envs * args.num_steps)
-    args.minibatch_size = int(args.batch_size // args.n_minibatch)
-    args.num_updates = args.total_timesteps // args.batch_size
-    args.save_frequency = max(1, int(args.num_updates // args.num_models))
-    # fmt: on
-    return args
-
-
-class MicroRTSStatsRecorder(VecEnvWrapper):
-    def __init__(self, env, gamma=0.99) -> None:
-        super().__init__(env)
-        self.gamma = gamma
-
-    def reset(self):
-        obs = self.venv.reset()
-        self.raw_rewards = [[] for _ in range(self.num_envs)]
-        self.ts = np.zeros(self.num_envs, dtype=np.float32)
-        self.raw_discount_rewards = [[] for _ in range(self.num_envs)]
-        return obs
-
-    def step_wait(self):
-        obs, rews, dones, infos = self.venv.step_wait()
-        newinfos = list(infos[:])
-        for i in range(len(dones)):
-            self.raw_rewards[i] += [infos[i]["raw_rewards"]]
-            self.raw_discount_rewards[i] += [
-                (self.gamma ** self.ts[i])
-                * np.concatenate((infos[i]["raw_rewards"], infos[i]["raw_rewards"].sum()), axis=None)
-            ]
-            self.ts[i] += 1
-            if dones[i]:
-                info = infos[i].copy()
-                raw_returns = np.array(self.raw_rewards[i]).sum(0)
-                raw_names = [str(rf) for rf in self.rfs]
-                raw_discount_returns = np.array(self.raw_discount_rewards[i]).sum(0)
-                raw_discount_names = ["discounted_" + str(rf) for rf in self.rfs] + ["discounted"]
-                info["microrts_stats"] = dict(zip(raw_names, raw_returns))
-                info["microrts_stats"].update(dict(zip(raw_discount_names, raw_discount_returns)))
-                self.raw_rewards[i] = []
-                self.raw_discount_rewards[i] = []
-                self.ts[i] = 0
-                newinfos[i] = info
-        return obs, rews, dones, newinfos
-
-
-# ALGO LOGIC: initialize agent here:
-class CategoricalMasked(Categorical):
-    def __init__(self, probs=None, logits=None, validate_args=None, masks=[], mask_value=None):
-        logits = torch.where(masks.bool(), logits, mask_value)
-        super(CategoricalMasked, self).__init__(probs, logits, validate_args)
-
-
-class Transpose(nn.Module):
-    def __init__(self, permutation):
-        super().__init__()
-        self.permutation = permutation
-
-    def forward(self, x):
-        return x.permute(self.permutation)
-
-
-def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
-    torch.nn.init.orthogonal_(layer.weight, std)
-    torch.nn.init.constant_(layer.bias, bias_const)
-    return layer
-
-
-class Agent(nn.Module):
-    def __init__(self, envs, mapsize=16 * 16):
-        super(Agent, self).__init__()
-        self.mapsize = mapsize
-        h, w, c = envs.observation_space.shape
-        self.encoder = nn.Sequential(
-            Transpose((0, 3, 1, 2)),
-            layer_init(nn.Conv2d(c, 32, kernel_size=3, padding=1)),
-            nn.MaxPool2d(3, stride=2, padding=1),
-            nn.ReLU(),
-            layer_init(nn.Conv2d(32, 64, kernel_size=3, padding=1)),
-            nn.MaxPool2d(3, stride=2, padding=1),
-            nn.ReLU(),
-        )
-
-        self.actor = nn.Sequential(
-            layer_init(nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1)),
-            nn.ReLU(),
-            layer_init(nn.ConvTranspose2d(32, 78, 3, stride=2, padding=1, output_padding=1)),
-            Transpose((0, 2, 3, 1)),
-        )
-        self.critic = nn.Sequential(
-            nn.Flatten(),
-            layer_init(nn.Linear(64 * 4 * 4, 128)),
-            nn.ReLU(),
-            layer_init(nn.Linear(128, 1), std=1),
-        )
-        self.register_buffer("mask_value", torch.tensor(-1e8))
-
-    def get_action_and_value(self, x, action=None, invalid_action_masks=None, envs=None, device=None):
-        hidden = self.encoder(x)
-        logits = self.actor(hidden)
-        grid_logits = logits.reshape(-1, envs.action_plane_space.nvec.sum())
-        split_logits = torch.split(grid_logits, envs.action_plane_space.nvec.tolist(), dim=1)
-
-        if action is None:
-            invalid_action_masks = invalid_action_masks.view(-1, invalid_action_masks.shape[-1])
-            split_invalid_action_masks = torch.split(invalid_action_masks, envs.action_plane_space.nvec.tolist(), dim=1)
-            multi_categoricals = [
-                CategoricalMasked(logits=logits, masks=iam, mask_value=self.mask_value)
-                for (logits, iam) in zip(split_logits, split_invalid_action_masks)
-            ]
-            action = torch.stack([categorical.sample() for categorical in multi_categoricals])
-        else:
-            invalid_action_masks = invalid_action_masks.view(-1, invalid_action_masks.shape[-1])
-            action = action.view(-1, action.shape[-1]).T
-            split_invalid_action_masks = torch.split(invalid_action_masks, envs.action_plane_space.nvec.tolist(), dim=1)
-            multi_categoricals = [
-                CategoricalMasked(logits=logits, masks=iam, mask_value=self.mask_value)
-                for (logits, iam) in zip(split_logits, split_invalid_action_masks)
-            ]
-        logprob = torch.stack([categorical.log_prob(a) for a, categorical in zip(action, multi_categoricals)])
-        entropy = torch.stack([categorical.entropy() for categorical in multi_categoricals])
-        num_predicted_parameters = len(envs.action_plane_space.nvec)
-        logprob = logprob.T.view(-1, self.mapsize, num_predicted_parameters)
-        entropy = entropy.T.view(-1, self.mapsize, num_predicted_parameters)
-        action = action.T.view(-1, self.mapsize, num_predicted_parameters)
-        return action, logprob.sum(1).sum(1), entropy.sum(1).sum(1), invalid_action_masks, self.critic(hidden)
-
-    def get_value(self, x):
-        return self.critic(self.encoder(x))
-
-
-def run_evaluation(model_path: str, output_path: str, eval_maps: List[str]):
-    args = [
-        "python",
-        "league.py",
-        "--evals",
-        model_path,
-        "--update-db",
-        "false",
-        "--cuda",
-        "false",
-        "--output-path",
-        output_path,
-        "--model-type",
-        "ppo_gridnet",
-        "--maps",
-        *eval_maps,
-    ]
-    fd = subprocess.Popen(args)
-    print(f"Evaluating {model_path}")
-    return_code = fd.wait()
-    assert return_code == 0
-    return (model_path, output_path)
-
-
-class TrueskillWriter:
-    def __init__(self, prod_mode, writer, league_path: str, league_step_path: str):
-        self.prod_mode = prod_mode
-        self.writer = writer
-        self.trueskill_df = pd.read_csv(league_path)
-        self.trueskill_step_df = pd.read_csv(league_step_path)
-        self.trueskill_step_df["type"] = self.trueskill_step_df["name"]
-        self.trueskill_step_df["step"] = 0
-        # xxx(okachaiev): not sure we need this copy
-        self.preset_trueskill_step_df = self.trueskill_step_df.copy()
-
-    def on_evaluation_done(self, future):
-        if future.cancelled():
-            return
-        model_path, output_path = future.result()
-        league = pd.read_csv(output_path, index_col="name")
-        assert model_path in league.index
-        model_global_step = int(model_path.split("/")[-1][:-3])
-        self.writer.add_scalar("charts/trueskill", league.loc[model_path]["trueskill"], model_global_step)
-        print(f"global_step={model_global_step}, trueskill={league.loc[model_path]['trueskill']}")
-
-        # table visualization logic
-        if self.prod_mode:
-            trueskill_data = {
-                "name": league.loc[model_path].name,
-                "mu": league.loc[model_path]["mu"],
-                "sigma": league.loc[model_path]["sigma"],
-                "trueskill": league.loc[model_path]["trueskill"],
-            }
-            self.trueskill_df = self.trueskill_df.append(trueskill_data, ignore_index=True)
-            wandb.log({"trueskill": wandb.Table(dataframe=self.trueskill_df)})
-            trueskill_data["type"] = "training"
-            trueskill_data["step"] = model_global_step
-            self.trueskill_step_df = self.trueskill_step_df.append(trueskill_data, ignore_index=True)
-            preset_trueskill_step_df_clone = self.preset_trueskill_step_df.copy()
-            preset_trueskill_step_df_clone["step"] = model_global_step
-            self.trueskill_step_df = self.trueskill_step_df.append(preset_trueskill_step_df_clone, ignore_index=True)
-            wandb.log({"trueskill_step": wandb.Table(dataframe=self.trueskill_step_df)})
-
-
-if __name__ == "__main__":
-    args = parse_args()
-
-    print(f"Save frequency: {args.save_frequency}")
-
-    # TRY NOT TO MODIFY: setup the environment
-    experiment_name = f"{args.gym_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
-    if args.prod_mode:
-        import wandb
-
-        run = wandb.init(
-            project=args.wandb_project_name,
-            entity=args.wandb_entity,
-            # sync_tensorboard=True,
-            config=vars(args),
-            name=experiment_name,
-            monitor_gym=True,
-            save_code=True,
-        )
-        wandb.tensorboard.patch(save=False)
-    writer = SummaryWriter(f"runs/{experiment_name}")
-    writer.add_text(
-        "hyperparameters", "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()]))
-    )
-
-    # TRY NOT TO MODIFY: seeding
-    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
-
-    print(f"Device: {device}")
-
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    torch.manual_seed(args.seed)
-    torch.backends.cudnn.deterministic = args.torch_deterministic
-    envs = MicroRTSGridModeVecEnv(
-        num_selfplay_envs=args.num_selfplay_envs,
-        num_bot_envs=args.num_bot_envs,
-        partial_obs=args.partial_obs,
-        max_steps=2000,
-        render_theme=2,
-        ai2s=[microrts_ai.coacAI for _ in range(args.num_bot_envs - 6)]
-        + [microrts_ai.randomBiasedAI for _ in range(min(args.num_bot_envs, 2))]
-        + [microrts_ai.lightRushAI for _ in range(min(args.num_bot_envs, 2))]
-        + [microrts_ai.workerRushAI for _ in range(min(args.num_bot_envs, 2))],
-        map_paths=[args.train_maps[0]],
-        reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
-        cycle_maps=args.train_maps,
-    )
-    envs = MicroRTSStatsRecorder(envs, args.gamma)
-    envs = VecMonitor(envs)
-    if args.capture_video:
-        envs = VecVideoRecorder(
-            envs, f"videos/{experiment_name}", record_video_trigger=lambda x: x % 100000 == 0, video_length=2000
-        )
-    assert isinstance(envs.action_space, MultiDiscrete), "only MultiDiscrete action space is supported"
-
-    eval_executor = None
-    if args.max_eval_workers > 0:
-        from concurrent.futures import ThreadPoolExecutor
-
-        eval_executor = ThreadPoolExecutor(max_workers=args.max_eval_workers, thread_name_prefix="league-eval-")
-
-    agent = Agent(envs).to(device)
-    optimizer = optim.Adam(agent.parameters(), lr=args.learning_rate, eps=1e-5)
-    if args.anneal_lr:
-        # https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/defaults.py#L20
-        lr = lambda f: f * args.learning_rate
-
-    # ALGO Logic: Storage for epoch data
-    mapsize = 16 * 16
-    action_space_shape = (mapsize, len(envs.action_plane_space.nvec))
-    invalid_action_shape = (mapsize, envs.action_plane_space.nvec.sum())
-
-    obs = torch.zeros((args.num_steps, args.num_envs) + envs.observation_space.shape).to(device)
-    actions = torch.zeros((args.num_steps, args.num_envs) + action_space_shape).to(device)
-    logprobs = torch.zeros((args.num_steps, args.num_envs)).to(device)
-    rewards = torch.zeros((args.num_steps, args.num_envs)).to(device)
-    dones = torch.zeros((args.num_steps, args.num_envs)).to(device)
-    values = torch.zeros((args.num_steps, args.num_envs)).to(device)
-    invalid_action_masks = torch.zeros((args.num_steps, args.num_envs) + invalid_action_shape).to(device)
-    # TRY NOT TO MODIFY: start the game
-    global_step = 0
-    start_time = time.time()
-    # Note how `next_obs` and `next_done` are used; their usage is equivalent to
-    # https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/84a7582477fb0d5c82ad6d850fe476829dddd2e1/a2c_ppo_acktr/storage.py#L60
-    next_obs = torch.Tensor(envs.reset()).to(device)
-    next_done = torch.zeros(args.num_envs).to(device)
-
-    # CRASH AND RESUME LOGIC:
-    starting_update = 1
-
-    if args.prod_mode and wandb.run.resumed:
-        starting_update = run.summary.get("charts/update") + 1
-        global_step = starting_update * args.batch_size
-        api = wandb.Api()
-        run = api.run(f"{run.entity}/{run.project}/{run.id}")
-        model = run.file("agent.pt")
-        model.download(f"models/{experiment_name}/")
-        agent.load_state_dict(torch.load(f"models/{experiment_name}/agent.pt", map_location=device))
-        agent.eval()
-        print(f"resumed at update {starting_update}")
-
-    print("Model's state_dict:")
-    for param_tensor in agent.state_dict():
-        print(param_tensor, "\t", agent.state_dict()[param_tensor].size())
-    total_params = sum([param.nelement() for param in agent.parameters()])
-    print("Model's total parameters:", total_params)
-
-    # EVALUATION LOGIC:
-    trueskill_writer = TrueskillWriter(
-        args.prod_mode, writer, "gym-microrts-static-files/league.csv", "gym-microrts-static-files/league.csv"
-    )
-
-    for update in range(starting_update, args.num_updates + 1):
-        # Annealing the rate if instructed to do so.
-        if args.anneal_lr:
-            frac = 1.0 - (update - 1.0) / args.num_updates
-            lrnow = lr(frac)
-            optimizer.param_groups[0]["lr"] = lrnow
-
-        # TRY NOT TO MODIFY: prepare the execution of the game.
-        for step in range(0, args.num_steps):
-            # envs.render()
-            global_step += 1 * args.num_envs
-            obs[step] = next_obs
-            dones[step] = next_done
-            # ALGO LOGIC: put action logic here
-            with torch.no_grad():
-                invalid_action_masks[step] = torch.tensor(envs.get_action_mask()).to(device)
-                action, logproba, _, _, vs = agent.get_action_and_value(
-                    next_obs, envs=envs, invalid_action_masks=invalid_action_masks[step], device=device
-                )
-                values[step] = vs.flatten()
-
-            actions[step] = action
-            logprobs[step] = logproba
-            try:
-                next_obs, rs, ds, infos = envs.step(action.cpu().numpy().reshape(envs.num_envs, -1))
-                next_obs = torch.Tensor(next_obs).to(device)
-            except Exception as e:
-                e.printStackTrace()
-                raise
-            rewards[step], next_done = torch.Tensor(rs).to(device), torch.Tensor(ds).to(device)
-
-            for info in infos:
-                if "episode" in info.keys():
-                    print(f"global_step={global_step}, episodic_return={info['episode']['r']}")
-                    writer.add_scalar("charts/episodic_return", info["episode"]["r"], global_step)
-                    writer.add_scalar("charts/episodic_length", info["episode"]["l"], global_step)
-                    for key in info["microrts_stats"]:
-                        writer.add_scalar(f"charts/episodic_return/{key}", info["microrts_stats"][key], global_step)
-                    break
-
-        # bootstrap reward if not done. reached the batch limit
-        with torch.no_grad():
-            last_value = agent.get_value(next_obs).reshape(1, -1)
-            if args.gae:
-                advantages = torch.zeros_like(rewards).to(device)
-                lastgaelam = 0
-                for t in reversed(range(args.num_steps)):
-                    if t == args.num_steps - 1:
-                        nextnonterminal = 1.0 - next_done
-                        nextvalues = last_value
-                    else:
-                        nextnonterminal = 1.0 - dones[t + 1]
-                        nextvalues = values[t + 1]
-                    delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
-                    advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
-                returns = advantages + values
-            else:
-                returns = torch.zeros_like(rewards).to(device)
-                for t in reversed(range(args.num_steps)):
-                    if t == args.num_steps - 1:
-                        nextnonterminal = 1.0 - next_done
-                        next_return = last_value
-                    else:
-                        nextnonterminal = 1.0 - dones[t + 1]
-                        next_return = returns[t + 1]
-                    returns[t] = rewards[t] + args.gamma * nextnonterminal * next_return
-                advantages = returns - values
-
-        # flatten the batch
-        b_obs = obs.reshape((-1,) + envs.observation_space.shape)
-        b_logprobs = logprobs.reshape(-1)
-        b_actions = actions.reshape((-1,) + action_space_shape)
-        b_advantages = advantages.reshape(-1)
-        b_returns = returns.reshape(-1)
-        b_values = values.reshape(-1)
-        b_invalid_action_masks = invalid_action_masks.reshape((-1,) + invalid_action_shape)
-
-        # Optimizing the policy and value network
-        inds = np.arange(
-            args.batch_size,
-        )
-        for i_epoch_pi in range(args.update_epochs):
-            np.random.shuffle(inds)
-            for start in range(0, args.batch_size, args.minibatch_size):
-                end = start + args.minibatch_size
-                minibatch_ind = inds[start:end]
-                mb_advantages = b_advantages[minibatch_ind]
-                if args.norm_adv:
-                    mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + 1e-8)
-                _, newlogproba, entropy, _, new_values = agent.get_action_and_value(
-                    b_obs[minibatch_ind], b_actions.long()[minibatch_ind], b_invalid_action_masks[minibatch_ind], envs, device
-                )
-                ratio = (newlogproba - b_logprobs[minibatch_ind]).exp()
-
-                # Stats
-                approx_kl = (b_logprobs[minibatch_ind] - newlogproba).mean()
-
-                # Policy loss
-                pg_loss1 = -mb_advantages * ratio
-                pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
-                pg_loss = torch.max(pg_loss1, pg_loss2).mean()
-                entropy_loss = entropy.mean()
-
-                # Value loss
-                new_values = new_values.view(-1)
-                if args.clip_vloss:
-                    v_loss_unclipped = (new_values - b_returns[minibatch_ind]) ** 2
-                    v_clipped = b_values[minibatch_ind] + torch.clamp(
-                        new_values - b_values[minibatch_ind], -args.clip_coef, args.clip_coef
-                    )
-                    v_loss_clipped = (v_clipped - b_returns[minibatch_ind]) ** 2
-                    v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
-                    v_loss = 0.5 * v_loss_max.mean()
-                else:
-                    v_loss = 0.5 * ((new_values - b_returns[minibatch_ind]) ** 2)
-
-                loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef
-
-                optimizer.zero_grad()
-                loss.backward()
-                nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
-                optimizer.step()
-
-        if (update - 1) % args.save_frequency == 0:
-            if not os.path.exists(f"models/{experiment_name}"):
-                os.makedirs(f"models/{experiment_name}")
-            torch.save(agent.state_dict(), f"models/{experiment_name}/agent.pt")
-            torch.save(agent.state_dict(), f"models/{experiment_name}/{global_step}.pt")
-            if args.prod_mode:
-                wandb.save(f"models/{experiment_name}/agent.pt", base_path=f"models/{experiment_name}", policy="now")
-            if eval_executor is not None:
-                future = eval_executor.submit(
-                    run_evaluation,
-                    f"models/{experiment_name}/{global_step}.pt",
-                    f"runs/{experiment_name}/{global_step}.csv",
-                    args.eval_maps,
-                )
-                print(f"Queued models/{experiment_name}/{global_step}.pt")
-                future.add_done_callback(trueskill_writer.on_evaluation_done)
-
-        # TRY NOT TO MODIFY: record rewards for plotting purposes
-        writer.add_scalar("charts/learning_rate", optimizer.param_groups[0]["lr"], global_step)
-        writer.add_scalar("charts/update", update, global_step)
-        writer.add_scalar("losses/value_loss", v_loss.detach().item(), global_step)
-        writer.add_scalar("losses/policy_loss", pg_loss.detach().item(), global_step)
-        writer.add_scalar("losses/entropy", entropy.detach().mean().item(), global_step)
-        writer.add_scalar("losses/approx_kl", approx_kl.detach().item(), global_step)
-        if args.kle_stop or args.kle_rollback:
-            writer.add_scalar("debug/pg_stop_iter", i_epoch_pi, global_step)
-        writer.add_scalar("charts/sps", int(global_step / (time.time() - start_time)), global_step)
-        print("SPS:", int(global_step / (time.time() - start_time)))
-
-    if eval_executor is not None:
-        # shutdown pool of threads but make sure we finished scheduled evaluations
-        eval_executor.shutdown(wait=True, cancel_futures=False)
-    envs.close()
-    writer.close()
+# http://proceedings.mlr.press/v97/han19a/han19a.pdf
+
+import argparse
+import os
+import random
+import subprocess
+import time
+from distutils.util import strtobool
+from typing import List
+
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from gym.spaces import MultiDiscrete
+from stable_baselines3.common.vec_env import VecEnvWrapper, VecMonitor, VecVideoRecorder
+from torch.distributions.categorical import Categorical
+from torch.utils.tensorboard import SummaryWriter
+
+from gym_microrts import microrts_ai
+from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
+
+
+def parse_args():
+    # fmt: off
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--exp-name', type=str, default=os.path.basename(__file__).rstrip(".py"),
+        help='the name of this experiment')
+    parser.add_argument('--gym-id', type=str, default="MicroRTSGridModeVecEnv",
+        help='the id of the gym environment')
+    parser.add_argument('--learning-rate', type=float, default=2.5e-4,
+        help='the learning rate of the optimizer')
+    parser.add_argument('--seed', type=int, default=1,
+        help='seed of the experiment')
+    parser.add_argument('--total-timesteps', type=int, default=50000000,
+        help='total timesteps of the experiments')
+    parser.add_argument('--torch-deterministic', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='if toggled, `torch.backends.cudnn.deterministic=False`')
+    parser.add_argument('--cuda', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='if toggled, cuda will not be enabled by default')
+    parser.add_argument('--prod-mode', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='run the script in production mode and use wandb to log outputs')
+    parser.add_argument('--capture-video', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='whether to capture videos of the agent performances (check out `videos` folder)')
+    parser.add_argument('--wandb-project-name', type=str, default="gym-microrts",
+        help="the wandb's project name")
+    parser.add_argument('--wandb-entity', type=str, default=None,
+        help="the entity (team) of wandb's project")
+
+    # Algorithm specific arguments
+    parser.add_argument('--partial-obs', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='if toggled, the game will have partial observability')
+    parser.add_argument('--n-minibatch', type=int, default=4,
+        help='the number of mini batch')
+    parser.add_argument('--num-bot-envs', type=int, default=0,
+        help='the number of bot game environment; 16 bot envs means 16 games')
+    parser.add_argument('--num-selfplay-envs', type=int, default=24,
+        help='the number of self play envs; 16 self play envs means 8 games')
+    parser.add_argument('--num-steps', type=int, default=256,
+        help='the number of steps per game environment')
+    parser.add_argument('--gamma', type=float, default=0.99,
+        help='the discount factor gamma')
+    parser.add_argument('--gae-lambda', type=float, default=0.95,
+        help='the lambda for the general advantage estimation')
+    parser.add_argument('--ent-coef', type=float, default=0.01,
+        help="coefficient of the entropy")
+    parser.add_argument('--vf-coef', type=float, default=0.5,
+        help="coefficient of the value function")
+    parser.add_argument('--max-grad-norm', type=float, default=0.5,
+        help='the maximum norm for the gradient clipping')
+    parser.add_argument('--clip-coef', type=float, default=0.1,
+        help="the surrogate clipping coefficient")
+    parser.add_argument('--update-epochs', type=int, default=4,
+        help="the K epochs to update the policy")
+    parser.add_argument('--kle-stop', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='If toggled, the policy updates will be early stopped w.r.t target-kl')
+    parser.add_argument('--kle-rollback', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='If toggled, the policy updates will roll back to previous policy if KL exceeds target-kl')
+    parser.add_argument('--target-kl', type=float, default=0.03,
+        help='the target-kl variable that is referred by --kl')
+    parser.add_argument('--gae', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='Use GAE for advantage computation')
+    parser.add_argument('--norm-adv', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help="Toggles advantages normalization")
+    parser.add_argument('--anneal-lr', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help="Toggle learning rate annealing for policy and value networks")
+    parser.add_argument('--clip-vloss', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='Toggles whether or not to use a clipped loss for the value function, as per the paper.')
+    parser.add_argument('--num-models', type=int, default=100,
+        help='the number of models saved')
+    parser.add_argument('--max-eval-workers', type=int, default=4,
+        help='the maximum number of eval workers (skips evaluation when set to 0)')
+    parser.add_argument('--train-maps', nargs='+', default=["maps/16x16/basesWorkers16x16A.xml"],
+        help='the list of maps used during training')
+    parser.add_argument('--eval-maps', nargs='+', default=["maps/16x16/basesWorkers16x16A.xml"],
+        help='the list of maps used during evaluation')
+
+    args = parser.parse_args()
+    if not args.seed:
+        args.seed = int(time.time())
+    args.num_envs = args.num_selfplay_envs + args.num_bot_envs
+    args.batch_size = int(args.num_envs * args.num_steps)
+    args.minibatch_size = int(args.batch_size // args.n_minibatch)
+    args.num_updates = args.total_timesteps // args.batch_size
+    args.save_frequency = max(1, int(args.num_updates // args.num_models))
+    # fmt: on
+    return args
+
+
+class MicroRTSStatsRecorder(VecEnvWrapper):
+    def __init__(self, env, gamma=0.99) -> None:
+        super().__init__(env)
+        self.gamma = gamma
+
+    def reset(self):
+        obs = self.venv.reset()
+        self.raw_rewards = [[] for _ in range(self.num_envs)]
+        self.ts = np.zeros(self.num_envs, dtype=np.float32)
+        self.raw_discount_rewards = [[] for _ in range(self.num_envs)]
+        return obs
+
+    def step_wait(self):
+        obs, rews, dones, infos = self.venv.step_wait()
+        newinfos = list(infos[:])
+        for i in range(len(dones)):
+            self.raw_rewards[i] += [infos[i]["raw_rewards"]]
+            self.raw_discount_rewards[i] += [
+                (self.gamma ** self.ts[i])
+                * np.concatenate((infos[i]["raw_rewards"], infos[i]["raw_rewards"].sum()), axis=None)
+            ]
+            self.ts[i] += 1
+            if dones[i]:
+                info = infos[i].copy()
+                raw_returns = np.array(self.raw_rewards[i]).sum(0)
+                raw_names = [str(rf) for rf in self.rfs]
+                raw_discount_returns = np.array(self.raw_discount_rewards[i]).sum(0)
+                raw_discount_names = ["discounted_" + str(rf) for rf in self.rfs] + ["discounted"]
+                info["microrts_stats"] = dict(zip(raw_names, raw_returns))
+                info["microrts_stats"].update(dict(zip(raw_discount_names, raw_discount_returns)))
+                self.raw_rewards[i] = []
+                self.raw_discount_rewards[i] = []
+                self.ts[i] = 0
+                newinfos[i] = info
+        return obs, rews, dones, newinfos
+
+
+# ALGO LOGIC: initialize agent here:
+class CategoricalMasked(Categorical):
+    def __init__(self, probs=None, logits=None, validate_args=None, masks=[], mask_value=None):
+        logits = torch.where(masks.bool(), logits, mask_value)
+        super(CategoricalMasked, self).__init__(probs, logits, validate_args)
+
+
+class Transpose(nn.Module):
+    def __init__(self, permutation):
+        super().__init__()
+        self.permutation = permutation
+
+    def forward(self, x):
+        return x.permute(self.permutation)
+
+
+def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
+    torch.nn.init.orthogonal_(layer.weight, std)
+    torch.nn.init.constant_(layer.bias, bias_const)
+    return layer
+
+
+class Agent(nn.Module):
+    def __init__(self, envs, mapsize=16 * 16):
+        super(Agent, self).__init__()
+        self.mapsize = mapsize
+        h, w, c = envs.observation_space.shape
+        self.encoder = nn.Sequential(
+            Transpose((0, 3, 1, 2)),
+            layer_init(nn.Conv2d(c, 32, kernel_size=3, padding=1)),
+            nn.MaxPool2d(3, stride=2, padding=1),
+            nn.ReLU(),
+            layer_init(nn.Conv2d(32, 64, kernel_size=3, padding=1)),
+            nn.MaxPool2d(3, stride=2, padding=1),
+            nn.ReLU(),
+        )
+
+        self.actor = nn.Sequential(
+            layer_init(nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1)),
+            nn.ReLU(),
+            layer_init(nn.ConvTranspose2d(32, 78, 3, stride=2, padding=1, output_padding=1)),
+            Transpose((0, 2, 3, 1)),
+        )
+        self.critic = nn.Sequential(
+            nn.Flatten(),
+            layer_init(nn.Linear(64 * 4 * 4, 128)),
+            nn.ReLU(),
+            layer_init(nn.Linear(128, 1), std=1),
+        )
+        self.register_buffer("mask_value", torch.tensor(-1e8))
+
+    def get_action_and_value(self, x, action=None, invalid_action_masks=None, envs=None, device=None):
+        hidden = self.encoder(x)
+        logits = self.actor(hidden)
+        grid_logits = logits.reshape(-1, envs.action_plane_space.nvec.sum())
+        split_logits = torch.split(grid_logits, envs.action_plane_space.nvec.tolist(), dim=1)
+
+        if action is None:
+            invalid_action_masks = invalid_action_masks.view(-1, invalid_action_masks.shape[-1])
+            split_invalid_action_masks = torch.split(invalid_action_masks, envs.action_plane_space.nvec.tolist(), dim=1)
+            multi_categoricals = [
+                CategoricalMasked(logits=logits, masks=iam, mask_value=self.mask_value)
+                for (logits, iam) in zip(split_logits, split_invalid_action_masks)
+            ]
+            action = torch.stack([categorical.sample() for categorical in multi_categoricals])
+        else:
+            invalid_action_masks = invalid_action_masks.view(-1, invalid_action_masks.shape[-1])
+            action = action.view(-1, action.shape[-1]).T
+            split_invalid_action_masks = torch.split(invalid_action_masks, envs.action_plane_space.nvec.tolist(), dim=1)
+            multi_categoricals = [
+                CategoricalMasked(logits=logits, masks=iam, mask_value=self.mask_value)
+                for (logits, iam) in zip(split_logits, split_invalid_action_masks)
+            ]
+        logprob = torch.stack([categorical.log_prob(a) for a, categorical in zip(action, multi_categoricals)])
+        entropy = torch.stack([categorical.entropy() for categorical in multi_categoricals])
+        num_predicted_parameters = len(envs.action_plane_space.nvec)
+        logprob = logprob.T.view(-1, self.mapsize, num_predicted_parameters)
+        entropy = entropy.T.view(-1, self.mapsize, num_predicted_parameters)
+        action = action.T.view(-1, self.mapsize, num_predicted_parameters)
+        return action, logprob.sum(1).sum(1), entropy.sum(1).sum(1), invalid_action_masks, self.critic(hidden)
+
+    def get_value(self, x):
+        return self.critic(self.encoder(x))
+
+
+def run_evaluation(model_path: str, output_path: str, eval_maps: List[str]):
+    args = [
+        "python",
+        "league.py",
+        "--evals",
+        model_path,
+        "--update-db",
+        "false",
+        "--cuda",
+        "false",
+        "--output-path",
+        output_path,
+        "--model-type",
+        "ppo_gridnet",
+        "--maps",
+        *eval_maps,
+    ]
+    fd = subprocess.Popen(args)
+    print(f"Evaluating {model_path}")
+    return_code = fd.wait()
+    assert return_code == 0
+    return (model_path, output_path)
+
+
+class TrueskillWriter:
+    def __init__(self, prod_mode, writer, league_path: str, league_step_path: str):
+        self.prod_mode = prod_mode
+        self.writer = writer
+        self.trueskill_df = pd.read_csv(league_path)
+        self.trueskill_step_df = pd.read_csv(league_step_path)
+        self.trueskill_step_df["type"] = self.trueskill_step_df["name"]
+        self.trueskill_step_df["step"] = 0
+        # xxx(okachaiev): not sure we need this copy
+        self.preset_trueskill_step_df = self.trueskill_step_df.copy()
+
+    def on_evaluation_done(self, future):
+        if future.cancelled():
+            return
+        model_path, output_path = future.result()
+        league = pd.read_csv(output_path, index_col="name")
+        assert model_path in league.index
+        model_global_step = int(model_path.split("/")[-1][:-3])
+        self.writer.add_scalar("charts/trueskill", league.loc[model_path]["trueskill"], model_global_step)
+        print(f"global_step={model_global_step}, trueskill={league.loc[model_path]['trueskill']}")
+
+        # table visualization logic
+        if self.prod_mode:
+            trueskill_data = {
+                "name": league.loc[model_path].name,
+                "mu": league.loc[model_path]["mu"],
+                "sigma": league.loc[model_path]["sigma"],
+                "trueskill": league.loc[model_path]["trueskill"],
+            }
+            self.trueskill_df = self.trueskill_df.append(trueskill_data, ignore_index=True)
+            wandb.log({"trueskill": wandb.Table(dataframe=self.trueskill_df)})
+            trueskill_data["type"] = "training"
+            trueskill_data["step"] = model_global_step
+            self.trueskill_step_df = self.trueskill_step_df.append(trueskill_data, ignore_index=True)
+            preset_trueskill_step_df_clone = self.preset_trueskill_step_df.copy()
+            preset_trueskill_step_df_clone["step"] = model_global_step
+            self.trueskill_step_df = self.trueskill_step_df.append(preset_trueskill_step_df_clone, ignore_index=True)
+            wandb.log({"trueskill_step": wandb.Table(dataframe=self.trueskill_step_df)})
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    print(f"Save frequency: {args.save_frequency}")
+
+    # TRY NOT TO MODIFY: setup the environment
+    experiment_name = f"{args.gym_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
+    if args.prod_mode:
+        import wandb
+
+        run = wandb.init(
+            project=args.wandb_project_name,
+            entity=args.wandb_entity,
+            # sync_tensorboard=True,
+            config=vars(args),
+            name=experiment_name,
+            monitor_gym=True,
+            save_code=True,
+        )
+        wandb.tensorboard.patch(save=False)
+    writer = SummaryWriter(f"runs/{experiment_name}")
+    writer.add_text(
+        "hyperparameters", "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()]))
+    )
+
+    # TRY NOT TO MODIFY: seeding
+    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
+
+    print(f"Device: {device}")
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.backends.cudnn.deterministic = args.torch_deterministic
+    envs = MicroRTSGridModeVecEnv(
+        num_selfplay_envs=args.num_selfplay_envs,
+        num_bot_envs=args.num_bot_envs,
+        partial_obs=args.partial_obs,
+        max_steps=2000,
+        render_theme=2,
+        ai2s=[microrts_ai.coacAI for _ in range(args.num_bot_envs - 6)]
+        + [microrts_ai.randomBiasedAI for _ in range(min(args.num_bot_envs, 2))]
+        + [microrts_ai.lightRushAI for _ in range(min(args.num_bot_envs, 2))]
+        + [microrts_ai.workerRushAI for _ in range(min(args.num_bot_envs, 2))],
+        map_paths=[args.train_maps[0]],
+        reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+        cycle_maps=args.train_maps,
+    )
+    envs = MicroRTSStatsRecorder(envs, args.gamma)
+    envs = VecMonitor(envs)
+    if args.capture_video:
+        envs = VecVideoRecorder(
+            envs, f"videos/{experiment_name}", record_video_trigger=lambda x: x % 100000 == 0, video_length=2000
+        )
+    assert isinstance(envs.action_space, MultiDiscrete), "only MultiDiscrete action space is supported"
+
+    eval_executor = None
+    if args.max_eval_workers > 0:
+        from concurrent.futures import ThreadPoolExecutor
+
+        eval_executor = ThreadPoolExecutor(max_workers=args.max_eval_workers, thread_name_prefix="league-eval-")
+
+    agent = Agent(envs).to(device)
+    optimizer = optim.Adam(agent.parameters(), lr=args.learning_rate, eps=1e-5)
+    if args.anneal_lr:
+        # https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/defaults.py#L20
+        lr = lambda f: f * args.learning_rate
+
+    # ALGO Logic: Storage for epoch data
+    mapsize = 16 * 16
+    action_space_shape = (mapsize, len(envs.action_plane_space.nvec))
+    invalid_action_shape = (mapsize, envs.action_plane_space.nvec.sum())
+
+    obs = torch.zeros((args.num_steps, args.num_envs) + envs.observation_space.shape).to(device)
+    actions = torch.zeros((args.num_steps, args.num_envs) + action_space_shape).to(device)
+    logprobs = torch.zeros((args.num_steps, args.num_envs)).to(device)
+    rewards = torch.zeros((args.num_steps, args.num_envs)).to(device)
+    dones = torch.zeros((args.num_steps, args.num_envs)).to(device)
+    values = torch.zeros((args.num_steps, args.num_envs)).to(device)
+    invalid_action_masks = torch.zeros((args.num_steps, args.num_envs) + invalid_action_shape).to(device)
+    # TRY NOT TO MODIFY: start the game
+    global_step = 0
+    start_time = time.time()
+    # Note how `next_obs` and `next_done` are used; their usage is equivalent to
+    # https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/84a7582477fb0d5c82ad6d850fe476829dddd2e1/a2c_ppo_acktr/storage.py#L60
+    next_obs = torch.Tensor(envs.reset()).to(device)
+    next_done = torch.zeros(args.num_envs).to(device)
+
+    # CRASH AND RESUME LOGIC:
+    starting_update = 1
+
+    if args.prod_mode and wandb.run.resumed:
+        starting_update = run.summary.get("charts/update") + 1
+        global_step = starting_update * args.batch_size
+        api = wandb.Api()
+        run = api.run(f"{run.entity}/{run.project}/{run.id}")
+        model = run.file("agent.pt")
+        model.download(f"models/{experiment_name}/")
+        agent.load_state_dict(torch.load(f"models/{experiment_name}/agent.pt", map_location=device))
+        agent.eval()
+        print(f"resumed at update {starting_update}")
+
+    print("Model's state_dict:")
+    for param_tensor in agent.state_dict():
+        print(param_tensor, "\t", agent.state_dict()[param_tensor].size())
+    total_params = sum([param.nelement() for param in agent.parameters()])
+    print("Model's total parameters:", total_params)
+
+    # EVALUATION LOGIC:
+    trueskill_writer = TrueskillWriter(
+        args.prod_mode, writer, "gym-microrts-static-files/league.csv", "gym-microrts-static-files/league.csv"
+    )
+
+    for update in range(starting_update, args.num_updates + 1):
+        # Annealing the rate if instructed to do so.
+        if args.anneal_lr:
+            frac = 1.0 - (update - 1.0) / args.num_updates
+            lrnow = lr(frac)
+            optimizer.param_groups[0]["lr"] = lrnow
+
+        # TRY NOT TO MODIFY: prepare the execution of the game.
+        for step in range(0, args.num_steps):
+            # envs.render()
+            global_step += 1 * args.num_envs
+            obs[step] = next_obs
+            dones[step] = next_done
+            # ALGO LOGIC: put action logic here
+            with torch.no_grad():
+                invalid_action_masks[step] = torch.tensor(envs.get_action_mask()).to(device)
+                action, logproba, _, _, vs = agent.get_action_and_value(
+                    next_obs, envs=envs, invalid_action_masks=invalid_action_masks[step], device=device
+                )
+                values[step] = vs.flatten()
+
+            actions[step] = action
+            logprobs[step] = logproba
+            try:
+                next_obs, rs, ds, infos = envs.step(action.cpu().numpy().reshape(envs.num_envs, -1))
+                next_obs = torch.Tensor(next_obs).to(device)
+            except Exception as e:
+                e.printStackTrace()
+                raise
+            rewards[step], next_done = torch.Tensor(rs).to(device), torch.Tensor(ds).to(device)
+
+            for info in infos:
+                if "episode" in info.keys():
+                    print(f"global_step={global_step}, episodic_return={info['episode']['r']}")
+                    writer.add_scalar("charts/episodic_return", info["episode"]["r"], global_step)
+                    writer.add_scalar("charts/episodic_length", info["episode"]["l"], global_step)
+                    for key in info["microrts_stats"]:
+                        writer.add_scalar(f"charts/episodic_return/{key}", info["microrts_stats"][key], global_step)
+                    break
+
+        # bootstrap reward if not done. reached the batch limit
+        with torch.no_grad():
+            last_value = agent.get_value(next_obs).reshape(1, -1)
+            if args.gae:
+                advantages = torch.zeros_like(rewards).to(device)
+                lastgaelam = 0
+                for t in reversed(range(args.num_steps)):
+                    if t == args.num_steps - 1:
+                        nextnonterminal = 1.0 - next_done
+                        nextvalues = last_value
+                    else:
+                        nextnonterminal = 1.0 - dones[t + 1]
+                        nextvalues = values[t + 1]
+                    delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
+                    advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
+                returns = advantages + values
+            else:
+                returns = torch.zeros_like(rewards).to(device)
+                for t in reversed(range(args.num_steps)):
+                    if t == args.num_steps - 1:
+                        nextnonterminal = 1.0 - next_done
+                        next_return = last_value
+                    else:
+                        nextnonterminal = 1.0 - dones[t + 1]
+                        next_return = returns[t + 1]
+                    returns[t] = rewards[t] + args.gamma * nextnonterminal * next_return
+                advantages = returns - values
+
+        # flatten the batch
+        b_obs = obs.reshape((-1,) + envs.observation_space.shape)
+        b_logprobs = logprobs.reshape(-1)
+        b_actions = actions.reshape((-1,) + action_space_shape)
+        b_advantages = advantages.reshape(-1)
+        b_returns = returns.reshape(-1)
+        b_values = values.reshape(-1)
+        b_invalid_action_masks = invalid_action_masks.reshape((-1,) + invalid_action_shape)
+
+        # Optimizing the policy and value network
+        inds = np.arange(
+            args.batch_size,
+        )
+        for i_epoch_pi in range(args.update_epochs):
+            np.random.shuffle(inds)
+            for start in range(0, args.batch_size, args.minibatch_size):
+                end = start + args.minibatch_size
+                minibatch_ind = inds[start:end]
+                mb_advantages = b_advantages[minibatch_ind]
+                if args.norm_adv:
+                    mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + 1e-8)
+                _, newlogproba, entropy, _, new_values = agent.get_action_and_value(
+                    b_obs[minibatch_ind], b_actions.long()[minibatch_ind], b_invalid_action_masks[minibatch_ind], envs, device
+                )
+                ratio = (newlogproba - b_logprobs[minibatch_ind]).exp()
+
+                # Stats
+                approx_kl = (b_logprobs[minibatch_ind] - newlogproba).mean()
+
+                # Policy loss
+                pg_loss1 = -mb_advantages * ratio
+                pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
+                pg_loss = torch.max(pg_loss1, pg_loss2).mean()
+                entropy_loss = entropy.mean()
+
+                # Value loss
+                new_values = new_values.view(-1)
+                if args.clip_vloss:
+                    v_loss_unclipped = (new_values - b_returns[minibatch_ind]) ** 2
+                    v_clipped = b_values[minibatch_ind] + torch.clamp(
+                        new_values - b_values[minibatch_ind], -args.clip_coef, args.clip_coef
+                    )
+                    v_loss_clipped = (v_clipped - b_returns[minibatch_ind]) ** 2
+                    v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
+                    v_loss = 0.5 * v_loss_max.mean()
+                else:
+                    v_loss = 0.5 * ((new_values - b_returns[minibatch_ind]) ** 2)
+
+                loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef
+
+                optimizer.zero_grad()
+                loss.backward()
+                nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
+                optimizer.step()
+
+        if (update - 1) % args.save_frequency == 0:
+            if not os.path.exists(f"models/{experiment_name}"):
+                os.makedirs(f"models/{experiment_name}")
+            torch.save(agent.state_dict(), f"models/{experiment_name}/agent.pt")
+            torch.save(agent.state_dict(), f"models/{experiment_name}/{global_step}.pt")
+            if args.prod_mode:
+                wandb.save(f"models/{experiment_name}/agent.pt", base_path=f"models/{experiment_name}", policy="now")
+            if eval_executor is not None:
+                future = eval_executor.submit(
+                    run_evaluation,
+                    f"models/{experiment_name}/{global_step}.pt",
+                    f"runs/{experiment_name}/{global_step}.csv",
+                    args.eval_maps,
+                )
+                print(f"Queued models/{experiment_name}/{global_step}.pt")
+                future.add_done_callback(trueskill_writer.on_evaluation_done)
+
+        # TRY NOT TO MODIFY: record rewards for plotting purposes
+        writer.add_scalar("charts/learning_rate", optimizer.param_groups[0]["lr"], global_step)
+        writer.add_scalar("charts/update", update, global_step)
+        writer.add_scalar("losses/value_loss", v_loss.detach().item(), global_step)
+        writer.add_scalar("losses/policy_loss", pg_loss.detach().item(), global_step)
+        writer.add_scalar("losses/entropy", entropy.detach().mean().item(), global_step)
+        writer.add_scalar("losses/approx_kl", approx_kl.detach().item(), global_step)
+        if args.kle_stop or args.kle_rollback:
+            writer.add_scalar("debug/pg_stop_iter", i_epoch_pi, global_step)
+        writer.add_scalar("charts/sps", int(global_step / (time.time() - start_time)), global_step)
+        print("SPS:", int(global_step / (time.time() - start_time)))
+
+    if eval_executor is not None:
+        # shutdown pool of threads but make sure we finished scheduled evaluations
+        eval_executor.shutdown(wait=True, cancel_futures=False)
+    envs.close()
+    writer.close()
diff --git a/experiments/ppo_gridnet_eval.py b/experiments/ppo_gridnet_eval.py
index def2e59e..1d9637fc 100644
--- a/experiments/ppo_gridnet_eval.py
+++ b/experiments/ppo_gridnet_eval.py
@@ -1,205 +1,205 @@
-# http://proceedings.mlr.press/v97/han19a/han19a.pdf
-
-import argparse
-import os
-import random
-import time
-from distutils.util import strtobool
-
-import numpy as np
-import torch
-from gym.spaces import MultiDiscrete
-from stable_baselines3.common.vec_env import VecMonitor, VecVideoRecorder
-from torch.utils.tensorboard import SummaryWriter
-
-from gym_microrts import microrts_ai  # noqa
-
-
-def parse_args():
-    # fmt: off
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--exp-name', type=str, default=os.path.basename(__file__).rstrip(".py"),
-        help='the name of this experiment')
-    parser.add_argument('--gym-id', type=str, default="MicroRTSGridModeVecEnv",
-        help='the id of the gym environment')
-    parser.add_argument('--learning-rate', type=float, default=2.5e-4,
-        help='the learning rate of the optimizer')
-    parser.add_argument('--seed', type=int, default=1,
-        help='seed of the experiment')
-    parser.add_argument('--total-timesteps', type=int, default=1000000,
-        help='total timesteps of the experiments')
-    parser.add_argument('--torch-deterministic', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='if toggled, `torch.backends.cudnn.deterministic=False`')
-    parser.add_argument('--cuda', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
-        help='if toggled, cuda will not be enabled by default')
-    parser.add_argument('--prod-mode', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='run the script in production mode and use wandb to log outputs')
-    parser.add_argument('--capture-video', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='whether to capture videos of the agent performances (check out `videos` folder)')
-    parser.add_argument('--wandb-project-name', type=str, default="cleanRL",
-        help="the wandb's project name")
-    parser.add_argument('--wandb-entity', type=str, default=None,
-        help="the entity (team) of wandb's project")
-
-    # Algorithm specific arguments
-    parser.add_argument('--partial-obs', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
-        help='if toggled, the game will have partial observability')
-    parser.add_argument('--num-steps', type=int, default=256,
-        help='the number of steps per game environment')
-    parser.add_argument("--agent-model-path", type=str, default="gym-microrts-static-files/agent_sota.pt",
-        help="the path to the agent's model")
-    parser.add_argument("--agent2-model-path", type=str, default="gym-microrts-static-files/agent_sota.pt",
-        help="the path to the agent's model")
-    parser.add_argument('--ai', type=str, default="",
-        help='the opponent AI to evaluate against')
-    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet_large", choices=["ppo_gridnet_large", "ppo_gridnet"],
-        help='the output path of the leaderboard csv')
-    args = parser.parse_args()
-    if not args.seed:
-        args.seed = int(time.time())
-    if args.ai:
-        args.num_bot_envs, args.num_selfplay_envs = 1, 0
-    else:
-        args.num_bot_envs, args.num_selfplay_envs = 0, 2
-    args.num_envs = args.num_selfplay_envs + args.num_bot_envs
-    args.batch_size = int(args.num_envs * args.num_steps)
-    args.num_updates = args.total_timesteps // args.batch_size
-    # fmt: on
-    return args
-
-
-if __name__ == "__main__":
-    args = parse_args()
-
-    if args.model_type == "ppo_gridnet_large":
-        from ppo_gridnet_large import Agent, MicroRTSStatsRecorder
-
-        from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
-    else:
-        from ppo_gridnet import Agent, MicroRTSStatsRecorder
-
-        from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
-
-    # TRY NOT TO MODIFY: setup the environment
-    experiment_name = f"{args.gym_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
-    if args.prod_mode:
-        import wandb
-
-        run = wandb.init(
-            project=args.wandb_project_name,
-            entity=args.wandb_entity,
-            sync_tensorboard=True,
-            config=vars(args),
-            name=experiment_name,
-            monitor_gym=True,
-            save_code=True,
-        )
-        CHECKPOINT_FREQUENCY = 10
-    writer = SummaryWriter(f"runs/{experiment_name}")
-    writer.add_text(
-        "hyperparameters", "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()]))
-    )
-
-    # TRY NOT TO MODIFY: seeding
-    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    torch.manual_seed(args.seed)
-    torch.backends.cudnn.deterministic = args.torch_deterministic
-
-    ais = []
-    if args.ai:
-        ais = [eval(f"microrts_ai.{args.ai}")]
-    envs = MicroRTSGridModeVecEnv(
-        num_bot_envs=len(ais),
-        num_selfplay_envs=args.num_selfplay_envs,
-        partial_obs=args.partial_obs,
-        max_steps=5000,
-        render_theme=2,
-        ai2s=ais,
-        map_paths=["maps/16x16/basesWorkers16x16A.xml"],
-        reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
-    )
-    envs = MicroRTSStatsRecorder(envs)
-    envs = VecMonitor(envs)
-    if args.capture_video:
-        envs = VecVideoRecorder(
-            envs, f"videos/{experiment_name}", record_video_trigger=lambda x: x % 100000 == 0, video_length=2000
-        )
-    assert isinstance(envs.action_space, MultiDiscrete), "only MultiDiscrete action space is supported"
-
-    agent = Agent(envs).to(device)
-    agent2 = Agent(envs).to(device)
-
-    # ALGO Logic: Storage for epoch data
-    mapsize = 16 * 16
-    invalid_action_shape = (mapsize, envs.action_plane_space.nvec.sum())
-
-    # TRY NOT TO MODIFY: start the game
-    global_step = 0
-    start_time = time.time()
-    # Note how `next_obs` and `next_done` are used; their usage is equivalent to
-    # https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/84a7582477fb0d5c82ad6d850fe476829dddd2e1/a2c_ppo_acktr/storage.py#L60
-    next_obs = torch.Tensor(envs.reset()).to(device)
-    next_done = torch.zeros(args.num_envs).to(device)
-
-    ## CRASH AND RESUME LOGIC:
-    starting_update = 1
-    agent.load_state_dict(torch.load(args.agent_model_path, map_location=device))
-    agent.eval()
-    if not args.ai:
-        agent2.load_state_dict(torch.load(args.agent2_model_path, map_location=device))
-        agent2.eval()
-
-    print("Model's state_dict:")
-    for param_tensor in agent.state_dict():
-        print(param_tensor, "\t", agent.state_dict()[param_tensor].size())
-    total_params = sum([param.nelement() for param in agent.parameters()])
-    print("Model's total parameters:", total_params)
-
-    for update in range(starting_update, args.num_updates + 1):
-        # TRY NOT TO MODIFY: prepare the execution of the game.
-        for step in range(0, args.num_steps):
-            envs.render()
-            global_step += 1 * args.num_envs
-            # ALGO LOGIC: put action logic here
-            with torch.no_grad():
-                invalid_action_masks = torch.tensor(np.array(envs.get_action_mask())).to(device)
-
-                if args.ai:
-                    action, logproba, _, _, vs = agent.get_action_and_value(
-                        next_obs, envs=envs, invalid_action_masks=invalid_action_masks, device=device
-                    )
-                else:
-                    p1_obs = next_obs[::2]
-                    p2_obs = next_obs[1::2]
-                    p1_mask = invalid_action_masks[::2]
-                    p2_mask = invalid_action_masks[1::2]
-
-                    p1_action, _, _, _, _ = agent.get_action_and_value(
-                        p1_obs, envs=envs, invalid_action_masks=p1_mask, device=device
-                    )
-                    p2_action, _, _, _, _ = agent2.get_action_and_value(
-                        p2_obs, envs=envs, invalid_action_masks=p2_mask, device=device
-                    )
-                    action = torch.zeros((args.num_envs, p2_action.shape[1], p2_action.shape[2]))
-                    action[::2] = p1_action
-                    action[1::2] = p2_action
-
-            try:
-                next_obs, rs, ds, infos = envs.step(action.cpu().numpy().reshape(envs.num_envs, -1))
-                next_obs = torch.Tensor(next_obs).to(device)
-            except Exception as e:
-                e.printStackTrace()
-                raise
-
-            for idx, info in enumerate(infos):
-                if "episode" in info.keys():
-                    if args.ai:
-                        print("against", args.ai, info["microrts_stats"]["WinLossRewardFunction"])
-                    else:
-                        if idx % 2 == 0:
-                            print(f"player{idx % 2}", info["microrts_stats"]["WinLossRewardFunction"])
-
-    envs.close()
-    writer.close()
+# http://proceedings.mlr.press/v97/han19a/han19a.pdf
+
+import argparse
+import os
+import random
+import time
+from distutils.util import strtobool
+
+import numpy as np
+import torch
+from gym.spaces import MultiDiscrete
+from stable_baselines3.common.vec_env import VecMonitor, VecVideoRecorder
+from torch.utils.tensorboard import SummaryWriter
+
+from gym_microrts import microrts_ai  # noqa
+
+
+def parse_args():
+    # fmt: off
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--exp-name', type=str, default=os.path.basename(__file__).rstrip(".py"),
+        help='the name of this experiment')
+    parser.add_argument('--gym-id', type=str, default="MicroRTSGridModeVecEnv",
+        help='the id of the gym environment')
+    parser.add_argument('--learning-rate', type=float, default=2.5e-4,
+        help='the learning rate of the optimizer')
+    parser.add_argument('--seed', type=int, default=1,
+        help='seed of the experiment')
+    parser.add_argument('--total-timesteps', type=int, default=1000000,
+        help='total timesteps of the experiments')
+    parser.add_argument('--torch-deterministic', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='if toggled, `torch.backends.cudnn.deterministic=False`')
+    parser.add_argument('--cuda', type=lambda x: bool(strtobool(x)), default=True, nargs='?', const=True,
+        help='if toggled, cuda will not be enabled by default')
+    parser.add_argument('--prod-mode', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='run the script in production mode and use wandb to log outputs')
+    parser.add_argument('--capture-video', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='whether to capture videos of the agent performances (check out `videos` folder)')
+    parser.add_argument('--wandb-project-name', type=str, default="cleanRL",
+        help="the wandb's project name")
+    parser.add_argument('--wandb-entity', type=str, default=None,
+        help="the entity (team) of wandb's project")
+
+    # Algorithm specific arguments
+    parser.add_argument('--partial-obs', type=lambda x: bool(strtobool(x)), default=False, nargs='?', const=True,
+        help='if toggled, the game will have partial observability')
+    parser.add_argument('--num-steps', type=int, default=256,
+        help='the number of steps per game environment')
+    parser.add_argument("--agent-model-path", type=str, default="gym-microrts-static-files/agent_sota.pt",
+        help="the path to the agent's model")
+    parser.add_argument("--agent2-model-path", type=str, default="gym-microrts-static-files/agent_sota.pt",
+        help="the path to the agent's model")
+    parser.add_argument('--ai', type=str, default="",
+        help='the opponent AI to evaluate against')
+    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet", choices=["ppo_gridnet_large", "ppo_gridnet"],
+        help='the output path of the leaderboard csv')
+    args = parser.parse_args()
+    if not args.seed:
+        args.seed = int(time.time())
+    if args.ai:
+        args.num_bot_envs, args.num_selfplay_envs = 1, 0
+    else:
+        args.num_bot_envs, args.num_selfplay_envs = 0, 2
+    args.num_envs = args.num_selfplay_envs + args.num_bot_envs
+    args.batch_size = int(args.num_envs * args.num_steps)
+    args.num_updates = args.total_timesteps // args.batch_size
+    # fmt: on
+    return args
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    if args.model_type == "ppo_gridnet_large":
+        from ppo_gridnet_large import Agent, MicroRTSStatsRecorder
+
+        from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
+    else:
+        from ppo_gridnet import Agent, MicroRTSStatsRecorder
+
+        from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
+
+    # TRY NOT TO MODIFY: setup the environment
+    experiment_name = f"{args.gym_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
+    if args.prod_mode:
+        import wandb
+
+        run = wandb.init(
+            project=args.wandb_project_name,
+            entity=args.wandb_entity,
+            sync_tensorboard=True,
+            config=vars(args),
+            name=experiment_name,
+            monitor_gym=True,
+            save_code=True,
+        )
+        CHECKPOINT_FREQUENCY = 10
+    writer = SummaryWriter(f"runs/{experiment_name}")
+    writer.add_text(
+        "hyperparameters", "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()]))
+    )
+
+    # TRY NOT TO MODIFY: seeding
+    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.backends.cudnn.deterministic = args.torch_deterministic
+
+    ais = []
+    if args.ai:
+        ais = [eval(f"microrts_ai.{args.ai}")]
+    envs = MicroRTSGridModeVecEnv(
+        num_bot_envs=len(ais),
+        num_selfplay_envs=args.num_selfplay_envs,
+        partial_obs=args.partial_obs,
+        max_steps=5000,
+        render_theme=2,
+        ai2s=ais,
+        map_paths=["maps/16x16/basesWorkers16x16A.xml"],
+        reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+    )
+    envs = MicroRTSStatsRecorder(envs)
+    envs = VecMonitor(envs)
+    if args.capture_video:
+        envs = VecVideoRecorder(
+            envs, f"videos/{experiment_name}", record_video_trigger=lambda x: x % 100000 == 0, video_length=2000
+        )
+    assert isinstance(envs.action_space, MultiDiscrete), "only MultiDiscrete action space is supported"
+
+    agent = Agent(envs).to(device)
+    agent2 = Agent(envs).to(device)
+
+    # ALGO Logic: Storage for epoch data
+    mapsize = 16 * 16
+    invalid_action_shape = (mapsize, envs.action_plane_space.nvec.sum())
+
+    # TRY NOT TO MODIFY: start the game
+    global_step = 0
+    start_time = time.time()
+    # Note how `next_obs` and `next_done` are used; their usage is equivalent to
+    # https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/84a7582477fb0d5c82ad6d850fe476829dddd2e1/a2c_ppo_acktr/storage.py#L60
+    next_obs = torch.Tensor(envs.reset()).to(device)
+    next_done = torch.zeros(args.num_envs).to(device)
+
+    ## CRASH AND RESUME LOGIC:
+    starting_update = 1
+    agent.load_state_dict(torch.load(args.agent_model_path, map_location=device))
+    agent.eval()
+    if not args.ai:
+        agent2.load_state_dict(torch.load(args.agent2_model_path, map_location=device))
+        agent2.eval()
+
+    print("Model's state_dict:")
+    for param_tensor in agent.state_dict():
+        print(param_tensor, "\t", agent.state_dict()[param_tensor].size())
+    total_params = sum([param.nelement() for param in agent.parameters()])
+    print("Model's total parameters:", total_params)
+
+    for update in range(starting_update, args.num_updates + 1):
+        # TRY NOT TO MODIFY: prepare the execution of the game.
+        for step in range(0, args.num_steps):
+            envs.render()
+            global_step += 1 * args.num_envs
+            # ALGO LOGIC: put action logic here
+            with torch.no_grad():
+                invalid_action_masks = torch.tensor(np.array(envs.get_action_mask())).to(device)
+
+                if args.ai:
+                    action, logproba, _, _, vs = agent.get_action_and_value(
+                        next_obs, envs=envs, invalid_action_masks=invalid_action_masks, device=device
+                    )
+                else:
+                    p1_obs = next_obs[::2]
+                    p2_obs = next_obs[1::2]
+                    p1_mask = invalid_action_masks[::2]
+                    p2_mask = invalid_action_masks[1::2]
+
+                    p1_action, _, _, _, _ = agent.get_action_and_value(
+                        p1_obs, envs=envs, invalid_action_masks=p1_mask, device=device
+                    )
+                    p2_action, _, _, _, _ = agent2.get_action_and_value(
+                        p2_obs, envs=envs, invalid_action_masks=p2_mask, device=device
+                    )
+                    action = torch.zeros((args.num_envs, p2_action.shape[1], p2_action.shape[2]))
+                    action[::2] = p1_action
+                    action[1::2] = p2_action
+
+            try:
+                next_obs, rs, ds, infos = envs.step(action.cpu().numpy().reshape(envs.num_envs, -1))
+                next_obs = torch.Tensor(next_obs).to(device)
+            except Exception as e:
+                e.printStackTrace()
+                raise
+
+            for idx, info in enumerate(infos):
+                if "episode" in info.keys():
+                    if args.ai:
+                        print("against", args.ai, info["microrts_stats"]["WinLossRewardFunction"])
+                    else:
+                        if idx % 2 == 0:
+                            print(f"player{idx % 2}", info["microrts_stats"]["WinLossRewardFunction"])
+
+    envs.close()
+    writer.close()
diff --git a/gym_microrts/envs/vec_env.py b/gym_microrts/envs/vec_env.py
index aed545fa..0ebd75ac 100644
--- a/gym_microrts/envs/vec_env.py
+++ b/gym_microrts/envs/vec_env.py
@@ -1,556 +1,556 @@
-import json
-import os
-import subprocess
-import sys
-import warnings
-import xml.etree.ElementTree as ET
-from itertools import cycle
-
-import gym
-import jpype
-import jpype.imports
-import numpy as np
-from jpype.imports import registerDomain
-from jpype.types import JArray, JInt
-from PIL import Image
-
-import gym_microrts
-
-MICRORTS_CLONE_MESSAGE = """
-WARNING: the repository does not include the microrts git submodule.
-Executing `git submodule update --init --recursive` to clone it now.
-"""
-
-MICRORTS_MAC_OS_RENDER_MESSAGE = """
-gym-microrts render is not available on MacOS. See https://github.com/jpype-project/jpype/issues/906
-
-It is however possible to record the videos via `env.render(mode='rgb_array')`. 
-See https://github.com/vwxyzjn/gym-microrts/blob/b46c0815efd60ae959b70c14659efb95ef16ffb0/hello_world_record_video.py
-as an example.
-"""
-
-
-class MicroRTSGridModeVecEnv:
-    metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 150}
-    """
-    [[0]x_coordinate*y_coordinate(x*y), [1]a_t(6), [2]p_move(4), [3]p_harvest(4), 
-    [4]p_return(4), [5]p_produce_direction(4), [6]p_produce_unit_type(z), 
-    [7]x_coordinate*y_coordinate(x*y)]
-    Create a baselines VecEnv environment from a gym3 environment.
-    :param env: gym3 environment to adapt
-    """
-
-    def __init__(
-        self,
-        num_selfplay_envs,
-        num_bot_envs,
-        partial_obs=False,
-        max_steps=2000,
-        render_theme=2,
-        frame_skip=0,
-        ai2s=[],
-        map_paths=["maps/10x10/basesTwoWorkers10x10.xml"],
-        reward_weight=np.array([0.0, 1.0, 0.0, 0.0, 0.0, 5.0]),
-        cycle_maps=[],
-        autobuild=True,
-        jvm_args=[],
-    ):
-
-        self.num_selfplay_envs = num_selfplay_envs
-        self.num_bot_envs = num_bot_envs
-        self.num_envs = num_selfplay_envs + num_bot_envs
-        assert self.num_bot_envs == len(ai2s), "for each environment, a microrts ai should be provided"
-        self.partial_obs = partial_obs
-        self.max_steps = max_steps
-        self.render_theme = render_theme
-        self.frame_skip = frame_skip
-        self.ai2s = ai2s
-        self.map_paths = map_paths
-        if len(map_paths) == 1:
-            self.map_paths = [map_paths[0] for _ in range(self.num_envs)]
-        else:
-            assert (
-                len(map_paths) == self.num_envs
-            ), "if multiple maps are provided, they should be provided for each environment"
-        self.reward_weight = reward_weight
-
-        self.microrts_path = os.path.join(gym_microrts.__path__[0], "microrts")
-
-        # prepare training maps
-        self.cycle_maps = list(map(lambda i: os.path.join(self.microrts_path, i), cycle_maps))
-        self.next_map = cycle(self.cycle_maps)
-
-        if not os.path.exists(f"{self.microrts_path}/README.md"):
-            print(MICRORTS_CLONE_MESSAGE)
-            os.system(f"git submodule update --init --recursive")
-
-        if autobuild:
-            print(f"removing {self.microrts_path}/microrts.jar...")
-            if os.path.exists(f"{self.microrts_path}/microrts.jar"):
-                os.remove(f"{self.microrts_path}/microrts.jar")
-            print(f"building {self.microrts_path}/microrts.jar...")
-            root_dir = os.path.dirname(gym_microrts.__path__[0])
-            print(root_dir)
-            subprocess.run(["bash", "build.sh", "&>", "build.log"], cwd=f"{root_dir}")
-
-        # read map
-        root = ET.parse(os.path.join(self.microrts_path, self.map_paths[0])).getroot()
-        self.height, self.width = int(root.get("height")), int(root.get("width"))
-
-        # launch the JVM
-        if not jpype._jpype.isStarted():
-            registerDomain("ts", alias="tests")
-            registerDomain("ai")
-            jars = [
-                "microrts.jar",
-                "lib/bots/Coac.jar",
-                "lib/bots/Droplet.jar",
-                "lib/bots/GRojoA3N.jar",
-                "lib/bots/Izanagi.jar",
-                "lib/bots/MixedBot.jar",
-                "lib/bots/TiamatBot.jar",
-                "lib/bots/UMSBot.jar",
-                "lib/bots/mayariBot.jar",  # "MindSeal.jar"
-            ]
-            for jar in jars:
-                jpype.addClassPath(os.path.join(self.microrts_path, jar))
-            jpype.startJVM(*jvm_args, convertStrings=False)
-
-        # start microrts client
-        from rts.units import UnitTypeTable
-
-        self.real_utt = UnitTypeTable()
-        from ai.reward import (
-            AttackRewardFunction,
-            ProduceBuildingRewardFunction,
-            ProduceCombatUnitRewardFunction,
-            ProduceWorkerRewardFunction,
-            ResourceGatherRewardFunction,
-            RewardFunctionInterface,
-            WinLossRewardFunction,
-        )
-
-        self.rfs = JArray(RewardFunctionInterface)(
-            [
-                WinLossRewardFunction(),
-                ResourceGatherRewardFunction(),
-                ProduceWorkerRewardFunction(),
-                ProduceBuildingRewardFunction(),
-                AttackRewardFunction(),
-                ProduceCombatUnitRewardFunction(),
-                # CloserToEnemyBaseRewardFunction(),
-            ]
-        )
-        self.start_client()
-
-        # computed properties
-        # [num_planes_hp(5), num_planes_resources(5), num_planes_player(3),
-        # num_planes_unit_type(z), num_planes_unit_action(6)]
-
-        self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6]
-        if partial_obs:
-            self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6, 2]
-        self.observation_space = gym.spaces.Box(
-            low=0.0, high=1.0, shape=(self.height, self.width, sum(self.num_planes)), dtype=np.int32
-        )
-
-        self.num_planes_len = len(self.num_planes)
-        self.num_planes_prefix_sum = [0]
-        for num_plane in self.num_planes:
-            self.num_planes_prefix_sum.append(self.num_planes_prefix_sum[-1] + num_plane)
-
-        self.action_space_dims = [6, 4, 4, 4, 4, len(self.utt["unitTypes"]), 7 * 7]
-        self.action_space = gym.spaces.MultiDiscrete(np.array([self.action_space_dims] * self.height * self.width).flatten())
-        self.action_plane_space = gym.spaces.MultiDiscrete(self.action_space_dims)
-        self.source_unit_idxs = np.tile(np.arange(self.height * self.width), (self.num_envs, 1))
-        self.source_unit_idxs = self.source_unit_idxs.reshape((self.source_unit_idxs.shape + (1,)))
-
-    def start_client(self):
-
-        from ai.core import AI
-        from ts import JNIGridnetVecClient as Client
-
-        self.vec_client = Client(
-            self.num_selfplay_envs,
-            self.num_bot_envs,
-            self.max_steps,
-            self.rfs,
-            os.path.expanduser(self.microrts_path),
-            self.map_paths,
-            JArray(AI)([ai2(self.real_utt) for ai2 in self.ai2s]),
-            self.real_utt,
-            self.partial_obs,
-        )
-        self.render_client = (
-            self.vec_client.selfPlayClients[0] if len(self.vec_client.selfPlayClients) > 0 else self.vec_client.clients[0]
-        )
-        # get the unit type table
-        self.utt = json.loads(str(self.render_client.sendUTT()))
-
-    def reset(self):
-        responses = self.vec_client.reset([0] * self.num_envs)
-        obs = [self._encode_obs(np.array(ro)) for ro in responses.observation]
-
-        return np.array(obs)
-
-    def _encode_obs(self, obs):
-        obs = obs.reshape(len(obs), -1).clip(0, np.array([self.num_planes]).T - 1)
-        obs_planes = np.zeros((self.height * self.width, self.num_planes_prefix_sum[-1]), dtype=np.int32)
-        obs_planes_idx = np.arange(len(obs_planes))
-        obs_planes[obs_planes_idx, obs[0]] = 1
-
-        for i in range(1, self.num_planes_len):
-            obs_planes[obs_planes_idx, obs[i] + self.num_planes_prefix_sum[i]] = 1
-        return obs_planes.reshape(self.height, self.width, -1)
-
-    def step_async(self, actions):
-        actions = actions.reshape((self.num_envs, self.width * self.height, -1))
-        actions = np.concatenate((self.source_unit_idxs, actions), 2)  # specify source unit
-        # valid actions
-        actions = actions[np.where(self.source_unit_mask == 1)]
-        action_counts_per_env = self.source_unit_mask.sum(1)
-        java_actions = [None] * len(action_counts_per_env)
-        action_idx = 0
-        for outer_idx, action_count in enumerate(action_counts_per_env):
-            java_valid_action = [None] * action_count
-            for idx in range(action_count):
-                java_valid_action[idx] = JArray(JInt)(actions[action_idx])
-                action_idx += 1
-            java_actions[outer_idx] = JArray(JArray(JInt))(java_valid_action)
-        self.actions = JArray(JArray(JArray(JInt)))(java_actions)
-
-    def step_wait(self):
-        responses = self.vec_client.gameStep(self.actions, [0] * self.num_envs)
-        reward, done = np.array(responses.reward), np.array(responses.done)
-        obs = [self._encode_obs(np.array(ro)) for ro in responses.observation]
-        infos = [{"raw_rewards": item} for item in reward]
-        # check if it is in evaluation, if not, then change maps
-        if len(self.cycle_maps) > 0:
-            # check if an environment is done, if done, reset the client, and replace the observation
-            for done_idx, d in enumerate(done[:, 0]):
-                # bot envs settings
-                if done_idx < self.num_bot_envs:
-                    if d:
-                        self.vec_client.clients[done_idx].mapPath = next(self.next_map)
-                        response = self.vec_client.clients[done_idx].reset(0)
-                        obs[done_idx] = self._encode_obs(np.array(response.observation))
-                # selfplay envs settings
-                else:
-                    if d and done_idx % 2 == 0:
-                        done_idx -= self.num_bot_envs  # recalibrate the index
-                        self.vec_client.selfPlayClients[done_idx // 2].mapPath = next(self.next_map)
-                        self.vec_client.selfPlayClients[done_idx // 2].reset()
-                        p0_response = self.vec_client.selfPlayClients[done_idx // 2].getResponse(0)
-                        p1_response = self.vec_client.selfPlayClients[done_idx // 2].getResponse(1)
-                        obs[done_idx] = self._encode_obs(np.array(p0_response.observation))
-                        obs[done_idx + 1] = self._encode_obs(np.array(p1_response.observation))
-        return np.array(obs), reward @ self.reward_weight, done[:, 0], infos
-
-    def step(self, ac):
-        self.step_async(ac)
-        return self.step_wait()
-
-    def getattr_depth_check(self, name, already_found):
-        """
-        Check if an attribute reference is being hidden in a recursive call to __getattr__
-        :param name: (str) name of attribute to check for
-        :param already_found: (bool) whether this attribute has already been found in a wrapper
-        :return: (str or None) name of module whose attribute is being shadowed, if any.
-        """
-        if hasattr(self, name) and already_found:
-            return "{0}.{1}".format(type(self).__module__, type(self).__name__)
-        else:
-            return None
-
-    def render(self, mode="human"):
-        if mode == "human":
-            self.render_client.render(False)
-            # give warning on macos because the render is not available
-            if sys.platform == "darwin":
-                warnings.warn(MICRORTS_MAC_OS_RENDER_MESSAGE)
-        elif mode == "rgb_array":
-            bytes_array = np.array(self.render_client.render(True))
-            image = Image.frombytes("RGB", (640, 640), bytes_array)
-            return np.array(image)[:, :, ::-1]
-
-    def close(self):
-        if jpype._jpype.isStarted():
-            self.vec_client.close()
-            jpype.shutdownJVM()
-
-    def get_action_mask(self):
-        """
-        :return: Mask for action types and action parameters,
-        of shape [num_envs, map height * width, action types + params]
-        """
-        # action_mask shape: [num_envs, map height, map width, 1 + action types + params]
-        action_mask = np.array(self.vec_client.getMasks(0))
-        # self.source_unit_mask shape: [num_envs, map height * map width * 1]
-        self.source_unit_mask = action_mask[:, :, :, 0].reshape(self.num_envs, -1)
-        action_type_and_parameter_mask = action_mask[:, :, :, 1:].reshape(self.num_envs, self.height * self.width, -1)
-        return action_type_and_parameter_mask
-
-
-class MicroRTSBotVecEnv(MicroRTSGridModeVecEnv):
-    metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 150}
-
-    def __init__(
-        self,
-        ai1s=[],
-        ai2s=[],
-        partial_obs=False,
-        max_steps=2000,
-        render_theme=2,
-        map_paths="maps/10x10/basesTwoWorkers10x10.xml",
-        reward_weight=np.array([0.0, 1.0, 0.0, 0.0, 0.0, 5.0]),
-        autobuild=True,
-        jvm_args=[],
-    ):
-
-        self.ai1s = ai1s
-        self.ai2s = ai2s
-        assert len(ai1s) == len(ai2s), "for each environment, a microrts ai should be provided"
-        self.num_envs = len(ai1s)
-        self.partial_obs = partial_obs
-        self.max_steps = max_steps
-        self.render_theme = render_theme
-        self.map_paths = map_paths
-        self.reward_weight = reward_weight
-
-        # read map
-        self.microrts_path = os.path.join(gym_microrts.__path__[0], "microrts")
-        if not os.path.exists(f"{self.microrts_path}/README.md"):
-            print(MICRORTS_CLONE_MESSAGE)
-            os.system(f"git submodule update --init --recursive")
-
-        if autobuild:
-            print(f"removing {self.microrts_path}/microrts.jar...")
-            if os.path.exists(f"{self.microrts_path}/microrts.jar"):
-                os.remove(f"{self.microrts_path}/microrts.jar")
-            print(f"building {self.microrts_path}/microrts.jar...")
-            root_dir = os.path.dirname(gym_microrts.__path__[0])
-            print(root_dir)
-            subprocess.run(["bash", "build.sh", "&>", "build.log"], cwd=f"{root_dir}")
-
-        root = ET.parse(os.path.join(self.microrts_path, self.map_paths[0])).getroot()
-        self.height, self.width = int(root.get("height")), int(root.get("width"))
-
-        # launch the JVM
-        if not jpype._jpype.isStarted():
-            registerDomain("ts", alias="tests")
-            registerDomain("ai")
-            registerDomain("rts")
-            jars = [
-                "microrts.jar",
-                "lib/bots/Coac.jar",
-                "lib/bots/Droplet.jar",
-                "lib/bots/GRojoA3N.jar",
-                "lib/bots/Izanagi.jar",
-                "lib/bots/MixedBot.jar",
-                "lib/bots/TiamatBot.jar",
-                "lib/bots/UMSBot.jar",
-                "lib/bots/mayariBot.jar",  # "MindSeal.jar"
-            ]
-            for jar in jars:
-                jpype.addClassPath(os.path.join(self.microrts_path, jar))
-            jpype.startJVM(*jvm_args, convertStrings=False)
-
-        # start microrts client
-        from rts.units import UnitTypeTable
-
-        self.real_utt = UnitTypeTable()
-        from ai.reward import (
-            AttackRewardFunction,
-            ProduceBuildingRewardFunction,
-            ProduceCombatUnitRewardFunction,
-            ProduceWorkerRewardFunction,
-            ResourceGatherRewardFunction,
-            RewardFunctionInterface,
-            WinLossRewardFunction,
-        )
-
-        self.rfs = JArray(RewardFunctionInterface)(
-            [
-                WinLossRewardFunction(),
-                ResourceGatherRewardFunction(),
-                ProduceWorkerRewardFunction(),
-                ProduceBuildingRewardFunction(),
-                AttackRewardFunction(),
-                ProduceCombatUnitRewardFunction(),
-                # CloserToEnemyBaseRewardFunction(),
-            ]
-        )
-        self.start_client()
-
-        # computed properties
-        # [num_planes_hp(5), num_planes_resources(5), num_planes_player(5),
-        # num_planes_unit_type(z), num_planes_unit_action(6)]
-
-        self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6]
-        if partial_obs:
-            self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6, 2]
-        self.observation_space = gym.spaces.Discrete(2)
-        self.action_space = gym.spaces.Discrete(2)
-
-    def start_client(self):
-
-        from ai.core import AI
-        from ts import JNIGridnetVecClient as Client
-
-        self.vec_client = Client(
-            self.max_steps,
-            self.rfs,
-            os.path.expanduser(self.microrts_path),
-            self.map_paths,
-            JArray(AI)([ai1(self.real_utt) for ai1 in self.ai1s]),
-            JArray(AI)([ai2(self.real_utt) for ai2 in self.ai2s]),
-            self.real_utt,
-            self.partial_obs,
-        )
-        self.render_client = self.vec_client.botClients[0]
-        # get the unit type table
-        self.utt = json.loads(str(self.render_client.sendUTT()))
-
-    def reset(self):
-        responses = self.vec_client.reset([0 for _ in range(self.num_envs)])
-        raw_obs, reward, done, info = np.ones((self.num_envs, 2)), np.array(responses.reward), np.array(responses.done), {}
-        return raw_obs
-
-    def step_async(self, actions):
-        self.actions = actions
-
-    def step_wait(self):
-        responses = self.vec_client.gameStep(self.actions, [0 for _ in range(self.num_envs)])
-        raw_obs, reward, done = np.ones((self.num_envs, 2)), np.array(responses.reward), np.array(responses.done)
-        infos = [{"raw_rewards": item} for item in reward]
-        return raw_obs, reward @ self.reward_weight, done[:, 0], infos
-
-
-class MicroRTSGridModeSharedMemVecEnv(MicroRTSGridModeVecEnv):
-    """
-    Similar function to `MicroRTSGridModeVecEnv` but uses shared mem buffers for
-    zero-copy data exchange between NumPy and JVM runtimes. Drastically improves
-    performance of the environment with some limitations introduced to the API.
-    Notably, all games should be performed on the same map.
-    """
-
-    def __init__(
-        self,
-        num_selfplay_envs,
-        num_bot_envs,
-        partial_obs=False,
-        max_steps=2000,
-        render_theme=2,
-        frame_skip=0,
-        ai2s=[],
-        map_paths=["maps/10x10/basesTwoWorkers10x10.xml"],
-        reward_weight=np.array([0.0, 1.0, 0.0, 0.0, 0.0, 5.0]),
-        cycle_maps=[],
-    ):
-        if len(map_paths) > 1 and len(set(map_paths)) > 1:
-            raise ValueError("Mem shared environment requires all games to be played on the same map.")
-
-        super(MicroRTSGridModeSharedMemVecEnv, self).__init__(
-            num_selfplay_envs,
-            num_bot_envs,
-            partial_obs,
-            max_steps,
-            render_theme,
-            frame_skip,
-            ai2s,
-            map_paths,
-            reward_weight,
-            cycle_maps,
-        )
-
-    def _allocate_shared_buffer(self, nbytes):
-        from java.nio import ByteOrder
-        from jpype.nio import convertToDirectBuffer
-
-        c_buffer = bytearray(nbytes)
-        jvm_buffer = convertToDirectBuffer(c_buffer).order(ByteOrder.nativeOrder()).asIntBuffer()
-        np_buffer = np.asarray(jvm_buffer, order="C")
-        return jvm_buffer, np_buffer
-
-    def start_client(self):
-
-        from ai.core import AI
-        from rts import GameState
-        from ts import JNIGridnetSharedMemVecClient as Client
-
-        self.num_feature_planes = GameState.numFeaturePlanes
-        num_unit_types = len(self.real_utt.getUnitTypes())
-        self.action_space_dims = [6, 4, 4, 4, 4, num_unit_types, (self.real_utt.getMaxAttackRange() * 2 + 1) ** 2]
-        self.masks_dim = sum(self.action_space_dims)
-        self.action_dim = len(self.action_space_dims)
-
-        # pre-allocate shared buffers with JVM
-        obs_nbytes = self.num_envs * self.height * self.width * self.num_feature_planes * 4
-        obs_jvm_buffer, obs_np_buffer = self._allocate_shared_buffer(obs_nbytes)
-        self.obs = obs_np_buffer.reshape((self.num_envs, self.height, self.width, self.num_feature_planes))
-
-        action_mask_nbytes = self.num_envs * self.height * self.width * self.masks_dim * 4
-        action_mask_jvm_buffer, action_mask_np_buffer = self._allocate_shared_buffer(action_mask_nbytes)
-        self.action_mask = action_mask_np_buffer.reshape((self.num_envs, self.height * self.width, self.masks_dim))
-
-        action_nbytes = self.num_envs * self.width * self.height * self.action_dim * 4
-        action_jvm_buffer, action_np_buffer = self._allocate_shared_buffer(action_nbytes)
-        self.actions = action_np_buffer.reshape((self.num_envs, self.height * self.width, self.action_dim))
-
-        self.vec_client = Client(
-            self.num_selfplay_envs,
-            self.num_bot_envs,
-            self.max_steps,
-            self.rfs,
-            os.path.expanduser(self.microrts_path),
-            self.map_paths[0],
-            JArray(AI)([ai2(self.real_utt) for ai2 in self.ai2s]),
-            self.real_utt,
-            self.partial_obs,
-            obs_jvm_buffer,
-            action_mask_jvm_buffer,
-            action_jvm_buffer,
-            0,
-        )
-        self.render_client = (
-            self.vec_client.selfPlayClients[0] if len(self.vec_client.selfPlayClients) > 0 else self.vec_client.clients[0]
-        )
-        # get the unit type table
-        self.utt = json.loads(str(self.render_client.sendUTT()))
-
-    def reset(self):
-        self.vec_client.reset([0] * self.num_envs)
-        return self.obs
-
-    def step_async(self, actions):
-        actions = actions.reshape((self.num_envs, self.width * self.height, self.action_dim))
-        np.copyto(self.actions, actions)
-
-    def step_wait(self):
-        responses = self.vec_client.gameStep([0] * self.num_envs)
-        reward, done = np.array(responses.reward), np.array(responses.done)
-        infos = [{"raw_rewards": item} for item in reward]
-        # check if it is in evaluation, if not, then change maps
-        if len(self.cycle_maps) > 1:
-            # check if an environment is done, if done, reset the client, and replace the observation
-            for done_idx, d in enumerate(done[:, 0]):
-                # bot envs settings
-                if done_idx < self.num_bot_envs:
-                    if d:
-                        self.vec_client.clients[done_idx].mapPath = next(self.next_map)
-                        self.vec_client.clients[done_idx].reset(0)
-                        # self.obs[done_idx] = self._encode_obs(np.array(response.observation))
-                # selfplay envs settings
-                else:
-                    if d and done_idx % 2 == 0:
-                        done_idx -= self.num_bot_envs  # recalibrate the index
-                        self.vec_client.selfPlayClients[done_idx // 2].mapPath = next(self.next_map)
-                        self.vec_client.selfPlayClients[done_idx // 2].reset()
-                        # self.vec_client.selfPlayClients[done_idx // 2].reset()
-                        # self.obs[done_idx] = self._encode_obs(np.array(p0_response.observation))
-                        # self.obs[done_idx + 1] = self._encode_obs(np.array(p1_response.observation))
-        return self.obs, reward @ self.reward_weight, done[:, 0], infos
-
-    def get_action_mask(self):
-        self.vec_client.getMasks(0)
-        return self.action_mask
+import json
+import os
+import subprocess
+import sys
+import warnings
+import xml.etree.ElementTree as ET
+from itertools import cycle
+
+import gym
+import jpype
+import jpype.imports
+import numpy as np
+from jpype.imports import registerDomain
+from jpype.types import JArray, JInt
+from PIL import Image
+
+import gym_microrts
+
+MICRORTS_CLONE_MESSAGE = """
+WARNING: the repository does not include the microrts git submodule.
+Executing `git submodule update --init --recursive` to clone it now.
+"""
+
+MICRORTS_MAC_OS_RENDER_MESSAGE = """
+gym-microrts render is not available on MacOS. See https://github.com/jpype-project/jpype/issues/906
+
+It is however possible to record the videos via `env.render(mode='rgb_array')`. 
+See https://github.com/vwxyzjn/gym-microrts/blob/b46c0815efd60ae959b70c14659efb95ef16ffb0/hello_world_record_video.py
+as an example.
+"""
+
+
+class MicroRTSGridModeVecEnv:
+    metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 150}
+    """
+    [[0]x_coordinate*y_coordinate(x*y), [1]a_t(6), [2]p_move(4), [3]p_harvest(4), 
+    [4]p_return(4), [5]p_produce_direction(4), [6]p_produce_unit_type(z), 
+    [7]x_coordinate*y_coordinate(x*y)]
+    Create a baselines VecEnv environment from a gym3 environment.
+    :param env: gym3 environment to adapt
+    """
+
+    def __init__(
+        self,
+        num_selfplay_envs,
+        num_bot_envs,
+        partial_obs=False,
+        max_steps=2000,
+        render_theme=2,
+        frame_skip=0,
+        ai2s=[],
+        map_paths=["maps/10x10/basesTwoWorkers10x10.xml"],
+        reward_weight=np.array([0.0, 1.0, 0.0, 0.0, 0.0, 5.0]),
+        cycle_maps=[],
+        autobuild=True,
+        jvm_args=[],
+    ):
+
+        self.num_selfplay_envs = num_selfplay_envs
+        self.num_bot_envs = num_bot_envs
+        self.num_envs = num_selfplay_envs + num_bot_envs
+        assert self.num_bot_envs == len(ai2s), "for each environment, a microrts ai should be provided"
+        self.partial_obs = partial_obs
+        self.max_steps = max_steps
+        self.render_theme = render_theme
+        self.frame_skip = frame_skip
+        self.ai2s = ai2s
+        self.map_paths = map_paths
+        if len(map_paths) == 1:
+            self.map_paths = [map_paths[0] for _ in range(self.num_envs)]
+        else:
+            assert (
+                len(map_paths) == self.num_envs
+            ), "if multiple maps are provided, they should be provided for each environment"
+        self.reward_weight = reward_weight
+
+        self.microrts_path = os.path.join(gym_microrts.__path__[0], "microrts")
+
+        # prepare training maps
+        self.cycle_maps = list(map(lambda i: os.path.join(self.microrts_path, i), cycle_maps))
+        self.next_map = cycle(self.cycle_maps)
+
+        if not os.path.exists(f"{self.microrts_path}/README.md"):
+            print(MICRORTS_CLONE_MESSAGE)
+            os.system(f"git submodule update --init --recursive")
+
+        if autobuild:
+            print(f"removing {self.microrts_path}/microrts.jar...")
+            if os.path.exists(f"{self.microrts_path}/microrts.jar"):
+                os.remove(f"{self.microrts_path}/microrts.jar")
+            print(f"building {self.microrts_path}/microrts.jar...")
+            root_dir = os.path.dirname(gym_microrts.__path__[0])
+            print(root_dir)
+            subprocess.run(["bash", "build.sh", "&>", "build.log"], cwd=f"{root_dir}")
+
+        # read map
+        root = ET.parse(os.path.join(self.microrts_path, self.map_paths[0])).getroot()
+        self.height, self.width = int(root.get("height")), int(root.get("width"))
+
+        # launch the JVM
+        if not jpype._jpype.isStarted():
+            registerDomain("ts", alias="tests")
+            registerDomain("ai")
+            jars = [
+                "microrts.jar",
+                "lib/bots/Coac.jar",
+                "lib/bots/Droplet.jar",
+                "lib/bots/GRojoA3N.jar",
+                "lib/bots/Izanagi.jar",
+                "lib/bots/MixedBot.jar",
+                "lib/bots/TiamatBot.jar",
+                "lib/bots/UMSBot.jar",
+                "lib/bots/mayariBot.jar",  # "MindSeal.jar"
+            ]
+            for jar in jars:
+                jpype.addClassPath(os.path.join(self.microrts_path, jar))
+            jpype.startJVM(*jvm_args, convertStrings=False)
+
+        # start microrts client
+        from rts.units import UnitTypeTable
+
+        self.real_utt = UnitTypeTable()
+        from ai.reward import (
+            AttackRewardFunction,
+            ProduceBuildingRewardFunction,
+            ProduceCombatUnitRewardFunction,
+            ProduceWorkerRewardFunction,
+            ResourceGatherRewardFunction,
+            RewardFunctionInterface,
+            WinLossRewardFunction,
+        )
+
+        self.rfs = JArray(RewardFunctionInterface)(
+            [
+                WinLossRewardFunction(),
+                ResourceGatherRewardFunction(),
+                ProduceWorkerRewardFunction(),
+                ProduceBuildingRewardFunction(),
+                AttackRewardFunction(),
+                ProduceCombatUnitRewardFunction(),
+                # CloserToEnemyBaseRewardFunction(),
+            ]
+        )
+        self.start_client()
+
+        # computed properties
+        # [num_planes_hp(5), num_planes_resources(5), num_planes_player(3),
+        # num_planes_unit_type(z), num_planes_unit_action(6), num_planes_terrain(2)]
+
+        self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6, 2]
+        if partial_obs:
+            self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6, 2, 2]  # 2 extra for visibility
+        self.observation_space = gym.spaces.Box(
+            low=0.0, high=1.0, shape=(self.height, self.width, sum(self.num_planes)), dtype=np.int32
+        )
+
+        self.num_planes_len = len(self.num_planes)
+        self.num_planes_prefix_sum = [0]
+        for num_plane in self.num_planes:
+            self.num_planes_prefix_sum.append(self.num_planes_prefix_sum[-1] + num_plane)
+
+        self.action_space_dims = [6, 4, 4, 4, 4, len(self.utt["unitTypes"]), 7 * 7]
+        self.action_space = gym.spaces.MultiDiscrete(np.array([self.action_space_dims] * self.height * self.width).flatten())
+        self.action_plane_space = gym.spaces.MultiDiscrete(self.action_space_dims)
+        self.source_unit_idxs = np.tile(np.arange(self.height * self.width), (self.num_envs, 1))
+        self.source_unit_idxs = self.source_unit_idxs.reshape((self.source_unit_idxs.shape + (1,)))
+
+    def start_client(self):
+
+        from ai.core import AI
+        from ts import JNIGridnetVecClient as Client
+
+        self.vec_client = Client(
+            self.num_selfplay_envs,
+            self.num_bot_envs,
+            self.max_steps,
+            self.rfs,
+            os.path.expanduser(self.microrts_path),
+            self.map_paths,
+            JArray(AI)([ai2(self.real_utt) for ai2 in self.ai2s]),
+            self.real_utt,
+            self.partial_obs,
+        )
+        self.render_client = (
+            self.vec_client.selfPlayClients[0] if len(self.vec_client.selfPlayClients) > 0 else self.vec_client.clients[0]
+        )
+        # get the unit type table
+        self.utt = json.loads(str(self.render_client.sendUTT()))
+
+    def reset(self):
+        responses = self.vec_client.reset([0] * self.num_envs)
+        obs = [self._encode_obs(np.array(ro)) for ro in responses.observation]
+
+        return np.array(obs)
+
+    def _encode_obs(self, obs):
+        obs = obs.reshape(len(obs), -1).clip(0, np.array([self.num_planes]).T - 1)
+        obs_planes = np.zeros((self.height * self.width, self.num_planes_prefix_sum[-1]), dtype=np.int32)
+        obs_planes_idx = np.arange(len(obs_planes))
+        obs_planes[obs_planes_idx, obs[0]] = 1
+
+        for i in range(1, self.num_planes_len):
+            obs_planes[obs_planes_idx, obs[i] + self.num_planes_prefix_sum[i]] = 1
+        return obs_planes.reshape(self.height, self.width, -1)
+
+    def step_async(self, actions):
+        actions = actions.reshape((self.num_envs, self.width * self.height, -1))
+        actions = np.concatenate((self.source_unit_idxs, actions), 2)  # specify source unit
+        # valid actions
+        actions = actions[np.where(self.source_unit_mask == 1)]
+        action_counts_per_env = self.source_unit_mask.sum(1)
+        java_actions = [None] * len(action_counts_per_env)
+        action_idx = 0
+        for outer_idx, action_count in enumerate(action_counts_per_env):
+            java_valid_action = [None] * action_count
+            for idx in range(action_count):
+                java_valid_action[idx] = JArray(JInt)(actions[action_idx])
+                action_idx += 1
+            java_actions[outer_idx] = JArray(JArray(JInt))(java_valid_action)
+        self.actions = JArray(JArray(JArray(JInt)))(java_actions)
+
+    def step_wait(self):
+        responses = self.vec_client.gameStep(self.actions, [0] * self.num_envs)
+        reward, done = np.array(responses.reward), np.array(responses.done)
+        obs = [self._encode_obs(np.array(ro)) for ro in responses.observation]
+        infos = [{"raw_rewards": item} for item in reward]
+        # check if it is in evaluation, if not, then change maps
+        if len(self.cycle_maps) > 0:
+            # check if an environment is done, if done, reset the client, and replace the observation
+            for done_idx, d in enumerate(done[:, 0]):
+                # bot envs settings
+                if done_idx < self.num_bot_envs:
+                    if d:
+                        self.vec_client.clients[done_idx].mapPath = next(self.next_map)
+                        response = self.vec_client.clients[done_idx].reset(0)
+                        obs[done_idx] = self._encode_obs(np.array(response.observation))
+                # selfplay envs settings
+                else:
+                    if d and done_idx % 2 == 0:
+                        done_idx -= self.num_bot_envs  # recalibrate the index
+                        self.vec_client.selfPlayClients[done_idx // 2].mapPath = next(self.next_map)
+                        self.vec_client.selfPlayClients[done_idx // 2].reset()
+                        p0_response = self.vec_client.selfPlayClients[done_idx // 2].getResponse(0)
+                        p1_response = self.vec_client.selfPlayClients[done_idx // 2].getResponse(1)
+                        obs[done_idx] = self._encode_obs(np.array(p0_response.observation))
+                        obs[done_idx + 1] = self._encode_obs(np.array(p1_response.observation))
+        return np.array(obs), reward @ self.reward_weight, done[:, 0], infos
+
+    def step(self, ac):
+        self.step_async(ac)
+        return self.step_wait()
+
+    def getattr_depth_check(self, name, already_found):
+        """
+        Check if an attribute reference is being hidden in a recursive call to __getattr__
+        :param name: (str) name of attribute to check for
+        :param already_found: (bool) whether this attribute has already been found in a wrapper
+        :return: (str or None) name of module whose attribute is being shadowed, if any.
+        """
+        if hasattr(self, name) and already_found:
+            return "{0}.{1}".format(type(self).__module__, type(self).__name__)
+        else:
+            return None
+
+    def render(self, mode="human"):
+        if mode == "human":
+            self.render_client.render(False)
+            # give warning on macos because the render is not available
+            if sys.platform == "darwin":
+                warnings.warn(MICRORTS_MAC_OS_RENDER_MESSAGE)
+        elif mode == "rgb_array":
+            bytes_array = np.array(self.render_client.render(True))
+            image = Image.frombytes("RGB", (640, 640), bytes_array)
+            return np.array(image)[:, :, ::-1]
+
+    def close(self):
+        if jpype._jpype.isStarted():
+            self.vec_client.close()
+            jpype.shutdownJVM()
+
+    def get_action_mask(self):
+        """
+        :return: Mask for action types and action parameters,
+        of shape [num_envs, map height * width, action types + params]
+        """
+        # action_mask shape: [num_envs, map height, map width, 1 + action types + params]
+        action_mask = np.array(self.vec_client.getMasks(0))
+        # self.source_unit_mask shape: [num_envs, map height * map width * 1]
+        self.source_unit_mask = action_mask[:, :, :, 0].reshape(self.num_envs, -1)
+        action_type_and_parameter_mask = action_mask[:, :, :, 1:].reshape(self.num_envs, self.height * self.width, -1)
+        return action_type_and_parameter_mask
+
+
+class MicroRTSBotVecEnv(MicroRTSGridModeVecEnv):
+    metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 150}
+
+    def __init__(
+        self,
+        ai1s=[],
+        ai2s=[],
+        partial_obs=False,
+        max_steps=2000,
+        render_theme=2,
+        map_paths="maps/10x10/basesTwoWorkers10x10.xml",
+        reward_weight=np.array([0.0, 1.0, 0.0, 0.0, 0.0, 5.0]),
+        autobuild=True,
+        jvm_args=[],
+    ):
+
+        self.ai1s = ai1s
+        self.ai2s = ai2s
+        assert len(ai1s) == len(ai2s), "for each environment, a microrts ai should be provided"
+        self.num_envs = len(ai1s)
+        self.partial_obs = partial_obs
+        self.max_steps = max_steps
+        self.render_theme = render_theme
+        self.map_paths = map_paths
+        self.reward_weight = reward_weight
+
+        # read map
+        self.microrts_path = os.path.join(gym_microrts.__path__[0], "microrts")
+        if not os.path.exists(f"{self.microrts_path}/README.md"):
+            print(MICRORTS_CLONE_MESSAGE)
+            os.system(f"git submodule update --init --recursive")
+
+        if autobuild:
+            print(f"removing {self.microrts_path}/microrts.jar...")
+            if os.path.exists(f"{self.microrts_path}/microrts.jar"):
+                os.remove(f"{self.microrts_path}/microrts.jar")
+            print(f"building {self.microrts_path}/microrts.jar...")
+            root_dir = os.path.dirname(gym_microrts.__path__[0])
+            print(root_dir)
+            subprocess.run(["bash", "build.sh", "&>", "build.log"], cwd=f"{root_dir}")
+
+        root = ET.parse(os.path.join(self.microrts_path, self.map_paths[0])).getroot()
+        self.height, self.width = int(root.get("height")), int(root.get("width"))
+
+        # launch the JVM
+        if not jpype._jpype.isStarted():
+            registerDomain("ts", alias="tests")
+            registerDomain("ai")
+            registerDomain("rts")
+            jars = [
+                "microrts.jar",
+                "lib/bots/Coac.jar",
+                "lib/bots/Droplet.jar",
+                "lib/bots/GRojoA3N.jar",
+                "lib/bots/Izanagi.jar",
+                "lib/bots/MixedBot.jar",
+                "lib/bots/TiamatBot.jar",
+                "lib/bots/UMSBot.jar",
+                "lib/bots/mayariBot.jar",  # "MindSeal.jar"
+            ]
+            for jar in jars:
+                jpype.addClassPath(os.path.join(self.microrts_path, jar))
+            jpype.startJVM(*jvm_args, convertStrings=False)
+
+        # start microrts client
+        from rts.units import UnitTypeTable
+
+        self.real_utt = UnitTypeTable()
+        from ai.reward import (
+            AttackRewardFunction,
+            ProduceBuildingRewardFunction,
+            ProduceCombatUnitRewardFunction,
+            ProduceWorkerRewardFunction,
+            ResourceGatherRewardFunction,
+            RewardFunctionInterface,
+            WinLossRewardFunction,
+        )
+
+        self.rfs = JArray(RewardFunctionInterface)(
+            [
+                WinLossRewardFunction(),
+                ResourceGatherRewardFunction(),
+                ProduceWorkerRewardFunction(),
+                ProduceBuildingRewardFunction(),
+                AttackRewardFunction(),
+                ProduceCombatUnitRewardFunction(),
+                # CloserToEnemyBaseRewardFunction(),
+            ]
+        )
+        self.start_client()
+
+        # computed properties
+        # [num_planes_hp(5), num_planes_resources(5), num_planes_player(5),
+        # num_planes_unit_type(z), num_planes_unit_action(6), num_planes_terrain(2)]
+
+        self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6, 2]
+        if partial_obs:
+            self.num_planes = [5, 5, 3, len(self.utt["unitTypes"]) + 1, 6, 2, 2]  # 2 extra for visibility
+        self.observation_space = gym.spaces.Discrete(2)
+        self.action_space = gym.spaces.Discrete(2)
+
+    def start_client(self):
+
+        from ai.core import AI
+        from ts import JNIGridnetVecClient as Client
+
+        self.vec_client = Client(
+            self.max_steps,
+            self.rfs,
+            os.path.expanduser(self.microrts_path),
+            self.map_paths,
+            JArray(AI)([ai1(self.real_utt) for ai1 in self.ai1s]),
+            JArray(AI)([ai2(self.real_utt) for ai2 in self.ai2s]),
+            self.real_utt,
+            self.partial_obs,
+        )
+        self.render_client = self.vec_client.botClients[0]
+        # get the unit type table
+        self.utt = json.loads(str(self.render_client.sendUTT()))
+
+    def reset(self):
+        responses = self.vec_client.reset([0 for _ in range(self.num_envs)])
+        raw_obs, reward, done, info = np.ones((self.num_envs, 2)), np.array(responses.reward), np.array(responses.done), {}
+        return raw_obs
+
+    def step_async(self, actions):
+        self.actions = actions
+
+    def step_wait(self):
+        responses = self.vec_client.gameStep(self.actions, [0 for _ in range(self.num_envs)])
+        raw_obs, reward, done = np.ones((self.num_envs, 2)), np.array(responses.reward), np.array(responses.done)
+        infos = [{"raw_rewards": item} for item in reward]
+        return raw_obs, reward @ self.reward_weight, done[:, 0], infos
+
+
+class MicroRTSGridModeSharedMemVecEnv(MicroRTSGridModeVecEnv):
+    """
+    Similar function to `MicroRTSGridModeVecEnv` but uses shared mem buffers for
+    zero-copy data exchange between NumPy and JVM runtimes. Drastically improves
+    performance of the environment with some limitations introduced to the API.
+    Notably, all games should be performed on the same map.
+    """
+
+    def __init__(
+        self,
+        num_selfplay_envs,
+        num_bot_envs,
+        partial_obs=False,
+        max_steps=2000,
+        render_theme=2,
+        frame_skip=0,
+        ai2s=[],
+        map_paths=["maps/10x10/basesTwoWorkers10x10.xml"],
+        reward_weight=np.array([0.0, 1.0, 0.0, 0.0, 0.0, 5.0]),
+        cycle_maps=[],
+    ):
+        if len(map_paths) > 1 and len(set(map_paths)) > 1:
+            raise ValueError("Mem shared environment requires all games to be played on the same map.")
+
+        super(MicroRTSGridModeSharedMemVecEnv, self).__init__(
+            num_selfplay_envs,
+            num_bot_envs,
+            partial_obs,
+            max_steps,
+            render_theme,
+            frame_skip,
+            ai2s,
+            map_paths,
+            reward_weight,
+            cycle_maps,
+        )
+
+    def _allocate_shared_buffer(self, nbytes):
+        from java.nio import ByteOrder
+        from jpype.nio import convertToDirectBuffer
+
+        c_buffer = bytearray(nbytes)
+        jvm_buffer = convertToDirectBuffer(c_buffer).order(ByteOrder.nativeOrder()).asIntBuffer()
+        np_buffer = np.asarray(jvm_buffer, order="C")
+        return jvm_buffer, np_buffer
+
+    def start_client(self):
+
+        from ai.core import AI
+        from rts import GameState
+        from ts import JNIGridnetSharedMemVecClient as Client
+
+        self.num_feature_planes = GameState.numFeaturePlanes
+        num_unit_types = len(self.real_utt.getUnitTypes())
+        self.action_space_dims = [6, 4, 4, 4, 4, num_unit_types, (self.real_utt.getMaxAttackRange() * 2 + 1) ** 2]
+        self.masks_dim = sum(self.action_space_dims)
+        self.action_dim = len(self.action_space_dims)
+
+        # pre-allocate shared buffers with JVM
+        obs_nbytes = self.num_envs * self.height * self.width * self.num_feature_planes * 4
+        obs_jvm_buffer, obs_np_buffer = self._allocate_shared_buffer(obs_nbytes)
+        self.obs = obs_np_buffer.reshape((self.num_envs, self.height, self.width, self.num_feature_planes))
+
+        action_mask_nbytes = self.num_envs * self.height * self.width * self.masks_dim * 4
+        action_mask_jvm_buffer, action_mask_np_buffer = self._allocate_shared_buffer(action_mask_nbytes)
+        self.action_mask = action_mask_np_buffer.reshape((self.num_envs, self.height * self.width, self.masks_dim))
+
+        action_nbytes = self.num_envs * self.width * self.height * self.action_dim * 4
+        action_jvm_buffer, action_np_buffer = self._allocate_shared_buffer(action_nbytes)
+        self.actions = action_np_buffer.reshape((self.num_envs, self.height * self.width, self.action_dim))
+
+        self.vec_client = Client(
+            self.num_selfplay_envs,
+            self.num_bot_envs,
+            self.max_steps,
+            self.rfs,
+            os.path.expanduser(self.microrts_path),
+            self.map_paths[0],
+            JArray(AI)([ai2(self.real_utt) for ai2 in self.ai2s]),
+            self.real_utt,
+            self.partial_obs,
+            obs_jvm_buffer,
+            action_mask_jvm_buffer,
+            action_jvm_buffer,
+            0,
+        )
+        self.render_client = (
+            self.vec_client.selfPlayClients[0] if len(self.vec_client.selfPlayClients) > 0 else self.vec_client.clients[0]
+        )
+        # get the unit type table
+        self.utt = json.loads(str(self.render_client.sendUTT()))
+
+    def reset(self):
+        self.vec_client.reset([0] * self.num_envs)
+        return self.obs
+
+    def step_async(self, actions):
+        actions = actions.reshape((self.num_envs, self.width * self.height, self.action_dim))
+        np.copyto(self.actions, actions)
+
+    def step_wait(self):
+        responses = self.vec_client.gameStep([0] * self.num_envs)
+        reward, done = np.array(responses.reward), np.array(responses.done)
+        infos = [{"raw_rewards": item} for item in reward]
+        # check if it is in evaluation, if not, then change maps
+        if len(self.cycle_maps) > 1:
+            # check if an environment is done, if done, reset the client, and replace the observation
+            for done_idx, d in enumerate(done[:, 0]):
+                # bot envs settings
+                if done_idx < self.num_bot_envs:
+                    if d:
+                        self.vec_client.clients[done_idx].mapPath = next(self.next_map)
+                        self.vec_client.clients[done_idx].reset(0)
+                        # self.obs[done_idx] = self._encode_obs(np.array(response.observation))
+                # selfplay envs settings
+                else:
+                    if d and done_idx % 2 == 0:
+                        done_idx -= self.num_bot_envs  # recalibrate the index
+                        self.vec_client.selfPlayClients[done_idx // 2].mapPath = next(self.next_map)
+                        self.vec_client.selfPlayClients[done_idx // 2].reset()
+                        # self.vec_client.selfPlayClients[done_idx // 2].reset()
+                        # self.obs[done_idx] = self._encode_obs(np.array(p0_response.observation))
+                        # self.obs[done_idx + 1] = self._encode_obs(np.array(p1_response.observation))
+        return self.obs, reward @ self.reward_weight, done[:, 0], infos
+
+    def get_action_mask(self):
+        self.vec_client.getMasks(0)
+        return self.action_mask
diff --git a/gym_microrts/microrts b/gym_microrts/microrts
index 1e3b6639..05f2ac7e 160000
--- a/gym_microrts/microrts
+++ b/gym_microrts/microrts
@@ -1 +1 @@
-Subproject commit 1e3b6639b05c188767b4098c36813319b71222db
+Subproject commit 05f2ac7e80cbb8398e2acee1c3e335fc85225f2f
diff --git a/tests/test_e2e.py b/tests/test_e2e.py
index 0746b538..1ccb3009 100644
--- a/tests/test_e2e.py
+++ b/tests/test_e2e.py
@@ -1,35 +1,35 @@
-import subprocess
-
-
-def test_ppo_gridnet():
-
-    try:
-        subprocess.run(
-            "cd experiments; python ppo_gridnet.py --num-bot-envs 0 --num-selfplay-envs 2 --num-steps 16 --total-timesteps 32 --cuda False --max-eval-workers 0",
-            shell=True,
-            check=True,
-        )
-    except subprocess.CalledProcessError as grepexc:
-        print("error code", grepexc.returncode, grepexc.output)
-        assert grepexc.returncode in [0, 134]
-
-
-def test_ppo_gridnet_eval_selfplay():
-    try:
-        subprocess.run(
-            "cd experiments; python ppo_gridnet_eval.py --num-steps 16 --total-timesteps 32 --cuda False",
-            shell=True,
-            check=True,
-        )
-    except subprocess.CalledProcessError as grepexc:
-        print("error code", grepexc.returncode, grepexc.output)
-        assert grepexc.returncode in [0, 134]
-
-
-def test_ppo_gridnet_eval_bot():
-
-    subprocess.run(
-        "cd experiments; python ppo_gridnet_eval.py --ai coacAI --num-steps 16 --total-timesteps 32 --cuda False",
-        shell=True,
-        check=True,
-    )
+import subprocess
+
+
+def test_ppo_gridnet():
+
+    try:
+        subprocess.run(
+            "cd experiments; python ppo_gridnet.py --num-bot-envs 0 --num-selfplay-envs 2 --num-steps 16 --total-timesteps 32 --cuda False --max-eval-workers 0",
+            shell=True,
+            check=True,
+        )
+    except subprocess.CalledProcessError as grepexc:
+        print("error code", grepexc.returncode, grepexc.output)
+        assert grepexc.returncode in [0, 134]
+
+
+def test_ppo_gridnet_eval_selfplay():
+    try:
+        subprocess.run(
+            "cd experiments; python ppo_gridnet_eval.py --num-steps 16 --total-timesteps 32 --cuda False",
+            shell=True,
+            check=True,
+        )
+    except subprocess.CalledProcessError as grepexc:
+        print("error code", grepexc.returncode, grepexc.output)
+        assert grepexc.returncode in [0, 134]
+
+
+def test_ppo_gridnet_eval_bot():
+
+    subprocess.run(
+        "cd experiments; python ppo_gridnet_eval.py --ai coacAI --num-steps 16 --total-timesteps 32 --cuda False",
+        shell=True,
+        check=True,
+    )
diff --git a/tests/test_observation.py b/tests/test_observation.py
index 8e712269..6f4a69c7 100644
--- a/tests/test_observation.py
+++ b/tests/test_observation.py
@@ -1,80 +1,108 @@
-import numpy as np
-
-from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
-
-render = False
-
-
-def test_observation():
-    envs = MicroRTSGridModeVecEnv(
-        num_bot_envs=0,
-        num_selfplay_envs=2,
-        partial_obs=False,
-        max_steps=5000,
-        render_theme=2,
-        ai2s=[],
-        map_paths=["maps/16x16/basesWorkers16x16A.xml"],
-        reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
-    )
-
-    # fmt: off
-    next_obs = envs.reset()
-    resource = np.array([
-        0., 1., 0., 0., 0., # 1 hp
-        0., 0., 0., 0., 1., # >= 4 resources
-        1., 0., 0.,         # no owner
-        0., 1., 0., 0., 0., 0., 0., 0.,  # unit type resource
-        1., 0., 0., 0., 0., 0.  # currently not executing actions
-    ]).astype(np.int32)
-    p1_worker = np.array([
-        0., 1., 0., 0., 0., # 1 hp
-        1., 0., 0., 0., 0., # 0 resources
-        0., 1., 0.,         # player 1 owns it 
-        0., 0., 0., 0., 1., 0., 0., 0., # unit type worker
-        1., 0., 0., 0., 0., 0. # currently not executing actions
-    ]).astype(np.int32)
-    p1_base = np.array([
-        0., 0., 0., 0., 1.,  # 1 hp
-        1., 0., 0., 0., 0.,  # 0 resources
-        0., 1., 0.,          # player 1 owns it
-        0., 0., 1., 0., 0., 0., 0., 0., # unit type base
-        1., 0., 0., 0., 0., 0. # currently not executing actions
-    ]).astype(np.int32)
-    p2_worker = p1_worker.copy()
-    p2_worker[10:13] = np.array([0., 0., 1.,]) # player 2 owns it
-    p2_base = p1_base.copy()
-    p2_base[10:13] = np.array([0., 0., 1.,]) # player 2 owns it
-    empty_cell = np.array([
-        1., 0., 0., 0., 0.,  # 0 hp
-        1., 0., 0., 0., 0.,  # 0 resources
-        1., 0., 0.,          # no owner
-        1., 0., 0., 0., 0., 0., 0., 0., # unit type empty cell
-        1., 0., 0., 0., 0., 0. # currently not executing actions
-    ]).astype(np.int32)
-    # fmt: on
-
-    # player 1's perspective
-    np.testing.assert_array_equal(next_obs[0][0][0], resource)
-    np.testing.assert_array_equal(next_obs[0][1][0], resource)
-    np.testing.assert_array_equal(next_obs[0][1][1], p1_worker)
-    np.testing.assert_array_equal(next_obs[0][2][2], p1_base)
-    np.testing.assert_array_equal(next_obs[0][15][15], resource)
-    np.testing.assert_array_equal(next_obs[0][14][15], resource)
-    np.testing.assert_array_equal(next_obs[0][14][14], p2_worker)
-    np.testing.assert_array_equal(next_obs[0][13][13], p2_base)
-
-    # player 2's perspective (self play)
-    np.testing.assert_array_equal(next_obs[1][0][0], resource)
-    np.testing.assert_array_equal(next_obs[1][1][0], resource)
-    np.testing.assert_array_equal(next_obs[1][1][1], p2_worker)
-    np.testing.assert_array_equal(next_obs[1][2][2], p2_base)
-    np.testing.assert_array_equal(next_obs[1][15][15], resource)
-    np.testing.assert_array_equal(next_obs[1][14][15], resource)
-    np.testing.assert_array_equal(next_obs[1][14][14], p1_worker)
-    np.testing.assert_array_equal(next_obs[1][13][13], p1_base)
-
-    feature_sum = 0
-    for item in [resource, resource, p1_worker, p1_base, resource, resource, p2_worker, p2_base]:
-        feature_sum += item.sum()
-    feature_sum += empty_cell.sum() * (256 - 8)
-    assert next_obs.sum() == feature_sum * 2 == 2560.0
+import numpy as np
+
+from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
+
+render = False
+
+
+def test_observation():
+    envs = MicroRTSGridModeVecEnv(
+        num_bot_envs=0,
+        num_selfplay_envs=2,
+        partial_obs=False,
+        max_steps=5000,
+        render_theme=2,
+        ai2s=[],
+        map_paths=["maps/16x16/basesWorkers16x16A.xml"],
+        reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+    )
+
+    # fmt: off
+    next_obs = envs.reset()
+    resource = np.array([
+        0., 1., 0., 0., 0.,  # 1 hp
+        0., 0., 0., 0., 1.,  # >= 4 resources
+        1., 0., 0.,          # no owner
+        0., 1., 0., 0., 0., 0., 0., 0.,  # unit type resource
+        1., 0., 0., 0., 0., 0.,  # currently not executing actions
+        1., 0.,  # terrain: TERRAIN_NONE
+    ]).astype(np.int32)
+    p1_worker = np.array([
+        0., 1., 0., 0., 0.,  # 1 hp
+        1., 0., 0., 0., 0.,  # 0 resources
+        0., 1., 0.,          # player 1 owns it
+        0., 0., 0., 0., 1., 0., 0., 0.,  # unit type worker
+        1., 0., 0., 0., 0., 0.,  # currently not executing actions
+        1., 0.,  # terrain: TERRAIN_NONE
+    ]).astype(np.int32)
+    p1_base = np.array([
+        0., 0., 0., 0., 1.,  # 1 hp
+        1., 0., 0., 0., 0.,  # 0 resources
+        0., 1., 0.,          # player 1 owns it
+        0., 0., 1., 0., 0., 0., 0., 0.,  # unit type base
+        1., 0., 0., 0., 0., 0.,  # currently not executing actions
+        1., 0.,  # terrain: TERRAIN_NONE
+    ]).astype(np.int32)
+    p2_worker = p1_worker.copy()
+    p2_worker[10:13] = np.array([0., 0., 1., ])  # player 2 owns it
+    p2_base = p1_base.copy()
+    p2_base[10:13] = np.array([0., 0., 1., ])  # player 2 owns it
+    empty_cell = np.array([
+        1., 0., 0., 0., 0.,  # 0 hp
+        1., 0., 0., 0., 0.,  # 0 resources
+        1., 0., 0.,          # no owner
+        1., 0., 0., 0., 0., 0., 0., 0.,  # unit type empty cell
+        1., 0., 0., 0., 0., 0.,  # currently not executing actions
+        1., 0.,  # terrain: TERRAIN_NONE
+    ]).astype(np.int32)
+    # fmt: on
+
+    # player 1's perspective
+    np.testing.assert_array_equal(next_obs[0][0][0], resource)
+    np.testing.assert_array_equal(next_obs[0][1][0], resource)
+    np.testing.assert_array_equal(next_obs[0][1][1], p1_worker)
+    np.testing.assert_array_equal(next_obs[0][2][2], p1_base)
+    np.testing.assert_array_equal(next_obs[0][15][15], resource)
+    np.testing.assert_array_equal(next_obs[0][14][15], resource)
+    np.testing.assert_array_equal(next_obs[0][14][14], p2_worker)
+    np.testing.assert_array_equal(next_obs[0][13][13], p2_base)
+
+    # player 2's perspective (self play)
+    np.testing.assert_array_equal(next_obs[1][0][0], resource)
+    np.testing.assert_array_equal(next_obs[1][1][0], resource)
+    np.testing.assert_array_equal(next_obs[1][1][1], p2_worker)
+    np.testing.assert_array_equal(next_obs[1][2][2], p2_base)
+    np.testing.assert_array_equal(next_obs[1][15][15], resource)
+    np.testing.assert_array_equal(next_obs[1][14][15], resource)
+    np.testing.assert_array_equal(next_obs[1][14][14], p1_worker)
+    np.testing.assert_array_equal(next_obs[1][13][13], p1_base)
+
+    feature_sum = 0
+    for item in [resource, resource, p1_worker, p1_base, resource, resource, p2_worker, p2_base]:
+        feature_sum += item.sum()
+    feature_sum += empty_cell.sum() * (256 - 8)
+    assert next_obs.sum() == feature_sum * 2 == 3072.0
+
+    # test observation with walls
+    envs = MicroRTSGridModeVecEnv(
+        num_bot_envs=0,
+        num_selfplay_envs=2,
+        partial_obs=False,
+        max_steps=5000,
+        render_theme=2,
+        ai2s=[],
+        map_paths=["maps/barricades24x24.xml"],
+        reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+    )
+    # fmt: off
+    wall = np.array([
+        1., 0., 0., 0., 0., # 0 hp
+        1., 0., 0., 0., 0., # 0 resources
+        1., 0., 0.,         # no owner
+        1., 0., 0., 0., 0., 0., 0., 0.,  # unit type `-`
+        1., 0., 0., 0., 0., 0.,  # currently not executing actions
+        0., 1.,         # terrain: TERRAIN_WALL
+    ]).astype(np.int32)
+    # fmt: on
+    next_obs = envs.reset()
+    np.testing.assert_array_equal(next_obs[0][6][6], wall)