This repository contains the implementation of a Proximal Policy Optimization (PPO) framework for training Non-Player Characters (NPCs) in a web-based 3D voxel hide-and-seek game. The system demonstrates that browser-based 3D environments can serve as effective training platforms for multi-agent reinforcement learning, achieving stable convergence through 143,961 training episodes over 600 iterations.
The framework combines Ray RLlib's PPO implementation with a JavaScript-based THREE.js game environment via WebSocket communication, enabling sophisticated multi-agent RL training without requiring custom native simulation infrastructure.
We present a reinforcement learning framework for training NPCs in a browser-based 3D voxel hide-and-seek game using Proximal Policy Optimization. The system uses a Python training backend communicating with a JavaScript browser environment via WebSockets, allowing NPCs to interact with a real-time 3D world rendered in THREE.js. Our observation space includes position, orientation, velocity, and a 64-ray vision system that detects terrain and other agents within a 32-block radius. Through 143,961 training episodes over 600 iterations, the system achieved stable convergence with NPCs demonstrating adaptive hiding and seeking behaviors. Tournament evaluation across 100 games shows hider agents achieving a 68% win rate, validating emergent strategies including terrain exploitation, systematic coverage, and dynamic evasion tactics.
- PPO-based Multi-Agent Training: Separate policies for seeker and hider roles with asymmetric learning rates and entropy coefficients
- WebSocket Architecture: Python training backend integrated with JavaScript browser environment
- 64-Ray Vision System: Comprehensive environmental perception with 32-block detection radius
- 161-Dimensional Observation Space: Captures position, velocity, visual field, game context, and agent-specific information
- Continuous Action Space: 7-dimensional control including movement, rotation, and jumping
- Stable Convergence: Achieved in approximately 96,000-108,000 training episodes
- Tournament Validation: 100-game evaluation demonstrating emergent strategic behaviors
- Open-Source Implementation: Complete training pipeline and browser environment
- CPU: Multi-core processor (16+ cores recommended)
- GPU: NVIDIA GPU with CUDA support
- RAM: 32 GB minimum
- Storage: SSD with at least 50 GB free space
- Python 3.10
- Node.js 16+ (for development)
- Modern web browser (Chrome/Chromium recommended)
- Docker (optional, for containerized training)
git clone https://github.com/pstepanovum/browser-marl-hideseek.git
cd browser-marl-hideseek# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtcd browser-environment
npm install
npm run buildThe system consists of three main components:
- THREE.js-based 3D voxel world
- Physics simulation at 60 FPS
- 64-ray vision system for agent perception
- Procedurally generated terrain with fixed seed
- Ray RLlib 2.50.0 PPO implementation
- PyTorch 2.0.1 neural networks
- Dual-policy configuration (seeker/hider)
- Generalized Advantage Estimation
- Bidirectional communication layer
- JSON-based protocol
- Asynchronous message handling
- Minimal latency overhead
Key hyperparameters used in training:
- Learning rate: 3×10⁻⁴
- Discount factor (gamma): 0.99
- GAE lambda: 0.95
- Clip parameter: 0.2
- Entropy coefficient: 0.001 (seeker), 0.01 (hider)
- Train batch size: 57,600 timesteps
- Minibatch size: 512
- Training epochs: 10 per iteration
# Start browser environment
cd browser-environment
npm run serve
# In separate terminal, start training
cd python-backend
python train_ppo.py --config config/default.yamlTraining metrics are logged to:
checkpoints/: Model checkpoints (saved every 10 iterations)logs/: Training metrics and episode statisticsplots/: Visualization of training progress
- Stable convergence achieved by iteration 450 (108,000 episodes)
- Seeker policy: KL divergence 0.0097, entropy 10.37
- Hider policy: KL divergence 0.0112, entropy 22.34
- Average episode length: 237.5 steps (98.9% of maximum)
100-game tournament results:
- Hider win rate: 68.0%
- Seeker win rate: 32.0%
- Average game duration (hider wins): 240.4 steps
- Average game duration (seeker wins): 209.0 steps
Hider Behaviors:
- Terrain utilization (35% of games)
- Edge positioning (24% of games)
- Dynamic evasion (18% of games)
- Separation strategy (41% of games)
Seeker Behaviors:
- Systematic coverage (52% of games)
- High-ground scanning (28% of games)
- Rapid pursuit (32% of successful catches)
- Corner checking (19% of games)
.
├── python-backend/
│ ├── train_ppo.py # Main training script
│ ├── minecraft_env.py # Gym environment wrapper
│ ├── websocket_server.py # WebSocket communication
│ ├── metrics_tracker.py # Training metrics logging
│ └── config/ # Configuration files
├── browser-environment/
│ ├── src/
│ │ ├── ppo-training-bridge.js # WebSocket client
│ │ ├── state-encoder.js # Observation encoding
│ │ ├── reward-system.js # Reward calculation
│ │ ├── npc-vision-system.js # Ray-casting vision
│ │ └── npc-physics.js # Agent physics
│ ├── public/ # Static assets
│ └── package.json
├── checkpoints/ # Trained model checkpoints
├── paper/ # Research paper (LaTeX)
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
└── README.md
If you use this code in your research, please cite:
@article{stepanov2024marl,
title={Multi-Agent Reinforcement Learning for Hide-and-Seek in Browser-Based Voxel Environments},
author={Stepanov, Pavel},
journal={University of Miami, Department of Computer Science},
year={2024}
}- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Goldberg, K., ... & Stoica, I. (2018). RLlib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning (pp. 3053-3062).
- Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., & Mordatch, I. (2019). Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528.
- OpenAI. (2019). Emergent complexity and zero-shot transfer via unsupervised environment design. arXiv preprint arXiv:1901.01753.
- Smilkov, D., Thorat, N., Assogba, Y., Yuan, A., Kreeger, N., Yu, P., ... & Wattenberg, M. (2019). TensorFlow.js: Machine learning for the web and beyond. Proceedings of Machine Learning and Systems, 1, 309-321.
This project is licensed under the MIT License - see the LICENSE file for details.
Pavel Stepanov
Department of Computer Science
University of Miami
Email: pas273@miami.edu
This research was conducted as part of CSC411 at the University of Miami. Special thanks to the Department of Computer Science for providing computational resources and support.
For questions about the implementation, training process, or to report issues:
- Open an issue in the GitHub repository
- Refer to the detailed technical documentation in the
paper/directory - Contact the author via email
Planned extensions include:
- Curriculum learning with progressive terrain complexity
- Block modification capabilities (placement/removal)
- Human-AI interaction modes
- Generalization studies across varied terrain
- Alternative RL algorithms (LSTM policies, MAPPO, SAC)
- Enhanced perception systems with attention mechanisms