OptiChain is an agent-based, reinforcement learning platform for supply chain optimization. It simulates factories, warehouses and markets using a custom Gymnasium environment built on SimPy, trains agents with Stable-Baselines3 (PPO), and provides scripts to test & evaluate learned policies.
- Custom Gymnasium environment:
ares_environment/supply_chain_env.py(factory β warehouse β market flow) - Agent training script:
train_agent.py(Stable-Baselines3 PPO) - Environment test script:
test_env.py(checks and simple stepping) - Evaluation script:
evaluate_agent.py(compare trained agent with a constant baseline) - Logs & models:
logs/andtrained_models/
OptiChain/ (local folder name: ARES-Supply-Chain-Optimization-main)
βββ .gitignore
βββ README.md # (you are editing this)
βββ requirements.txt # python dependencies
βββ train_agent.py # train RL agent (PPO)
βββ test_env.py # smoke tests and demo stepping through env
βββ evaluate_agent.py # evaluate saved agent vs baseline
βββ ares_environment/ # custom environment implementation
β βββ __init__.py
β βββ supply_chain_env.py # main Gym environment class
β βββ simulation_nodes.py # Factory, Warehouse, Market classes and helpers
βββ docs/ # figures (training_graph.png, etc.)
βββ logs/ # training logs (TensorBoard friendly)
βββ trained_models/ # saved model(s), e.g. ppo_ares_agent.zip
βββ venv/ # (this is included in the zip; **recommend removing from repo**)
β οΈ Important:venv/is included in the project archive but should not be tracked in GitHub. Remove it before pushing: see Tips below.
These exact commands will get the repository running locally (tested for a local CPU-based setup):
- Clone (or download & extract) the repo
# if you haven't already
git clone https://github.com/meanderinghuman/OptiChain.git
cd OptiChain- Create and activate a virtual environment
python -m venv venv
# mac/linux
source venv/bin/activate
# windows (PowerShell)
venv\Scripts\Activate.ps1
# windows (cmd)
venv\Scripts\activate- Install dependencies
pip install --upgrade pip
pip install -r requirements.txtIf you want GPU support, install a PyTorch build that matches your CUDA version (official PyTorch install instructions). The
requirements.txtcontainstorch, but for CUDA you may prefer installingtorchmanually.
- (Optional) Run an environment smoke test β confirms the custom Gym env follows the API
python test_env.pyYou should see printed messages showing environment checking and a short 20-step demonstration with observations/rewards.
- Train the PPO agent (default short run)
python train_agent.py- By default the script trains with
TIMESTEPS_TO_TRAIN = 50000. Edittrain_agent.pyto increaseTIMESTEPS_TO_TRAIN(e.g.1_000_000) for serious training. - Training outputs live logs to
logs/<timestamp>/and saves the final model astrained_models/ppo_ares_agent.zip.
- Monitor training with TensorBoard (open a separate terminal)
tensorboard --logdir=logs/ --port=6006
# then open http://localhost:6006 in your browser- Evaluate the trained agent
python evaluate_agent.pyThis script loads trained_models/ppo_ares_agent.zip and compares average reward against a simple constant-order baseline. It prints a summary; if the PPO agent wins, you'll see a success message.
Below are short, copyable snippets that match how the repo's scripts use the environment and models.
This is the same idea implemented by test_env.py.
from ares_environment.supply_chain_env import SupplyChainEnv
env = SupplyChainEnv()
# Gymnasium-style reset returns (obs, info)
obs, info = env.reset()
print("Initial observation:", obs)
for i in range(20):
# random action in the normalized action space [-1, 1]
action = env.action_space.sample()
# step returns: observation, reward, terminated, truncated, info
obs, reward, terminated, truncated, info = env.step(action)
print(f"Step {i+1}: action={action} reward={reward:.2f} obs={obs}")
if terminated or truncated:
print("Episode ended")
break
env.close()Observation space: 3-dimensional numpy array: [factory_inventory, warehouse_inventory, market_demand]
Action space: single value [-1, 1] which the environment rescales to an integer order quantity via:
order_quantity = int(((action[0] + 1) / 2) * env.max_order_quantity)
This keeps the agent's policy normalized and stable while allowing discrete order quantities internally.
train_agent.py creates the environment, sets up a Stable-Baselines3 PPO model, trains and saves it. The core flow is:
from stable_baselines3 import PPO
from ares_environment.supply_chain_env import SupplyChainEnv
env = SupplyChainEnv()
model = PPO(
"MlpPolicy",
env,
verbose=1,
tensorboard_log="logs/"
)
model.learn(total_timesteps=50000, tb_log_name="PPO_ARES_v1")
model.save("trained_models/ppo_ares_agent.zip")
env.close()Change total_timesteps to a larger value for production training (e.g. 1_000_000+).
evaluate_agent.py demonstrates a comparison run: it loads the trained PPO model and runs multiple episodes for both the learned policy and a simple baseline.
The sketch:
import numpy as np
from stable_baselines3 import PPO
from ares_environment.supply_chain_env import SupplyChainEnv
env = SupplyChainEnv()
model = PPO.load("trained_models/ppo_ares_agent.zip", env=env)
# baseline: always order 30 units (rescaled to action space inside file)
# run several episodes and compare average rewards
# final printout: average reward PPO vs baselineRun the file directly:
python evaluate_agent.pyfrom stable_baselines3 import PPO
from ares_environment.supply_chain_env import SupplyChainEnv
env = SupplyChainEnv()
model = PPO.load("trained_models/ppo_ares_agent.zip", env=env)
obs, info = env.reset()
for _ in range(200):
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
env.close()This is useful for producing policy rollouts for demo videos or plotting.
ares_environment/supply_chain_env.pyimplements a Gymnasium-compatible environment using SimPy to simulate time-based shipping and demand processes.- The environment maintains three core components:
Factory,Warehouse, andMarket(seeares_environment/simulation_nodes.py). - The RL agent controls ordering decisions (action β order quantity). Rewards are shaped to encourage revenue while penalizing holding costs and unmet demand.
- The repo uses stable-baselines3's PPO as the default learning algorithm for reliability and reproducibility.
- Remove
venv/before pushing to GitHub: it bloats the repo. Locally run:
# remove the folder from disk and stop tracking it
rm -rf venv
git rm -r --cached venv
# add venv to .gitignore
echo "venv/" >> .gitignore-
Large requirements file: If
pip install -r requirements.txtfails because of binary wheels (e.g.,torch), install PyTorch from its official installer for your OS/CUDA combination, then re-runpip install -r requirements.txtwith--no-depsif needed. -
TensorBoard logs not visible? Make sure you started
tensorboardpointing at thelogs/parent directory that contains timestamped subfolders created during training. -
Stable-Baselines3 version: If you have an older SB3 or Gym/Gymnasium mismatch, you may see API errors. The repository uses
gymnasiumand SB3; upgrade/downgrade packages accordingly. -
If training is slow: reduce
total_timestepswhen experimenting, or run on a machine with GPU and install a CUDA-enabled PyTorch.
- Add a Flask-based dashboard to visualize live rollouts & KPIs
- Add configurable scenario files (demand profiles, shipping latency distributions)
- Integrate real datasets (CSV ingestion, external API connectors)
- Add automated unit tests for the environment dynamics
- Add hyperparameter tuning (Optuna / Ray Tune)
Contributions, issues and feature requests are welcome!
- Fork this repository
- Create a branch (
git checkout -b feature/awesome) - Commit your changes (
git commit -m 'Add feature') - Push to the branch (
git push origin feature/awesome) - Open a Pull Request
This project is released under the MIT License. If you'd like to reach out: Siddharth Pal β meanderinghuman