You can see the live demo here.
- π Quickstart π»
- π» Introduction π¨π»βπ»
- π Physics Simulation Engines π¦Ώ
- πͺ Environment π¦Ύ
- π¬ Algorithms π»
- π Run locally π²οΈ
Explore the project easily and quickly through the following colab notebooks:
Grasp: Pick-and-place with a robotic hand- this demo notebook compares first three algorithms and train agents onGraspenvironment byBrax. At the end, it also shows trainedPPO agentinteraction with the environment.
Step-by-step training with PPO- this notebook shows step-by-step training ofPPO agentonGraspenvironment byBrax.
The field of robotics has seen incredible advancements in recent years, with the development of increasingly sophisticated machines capable of performing a wide range of tasks. One area of particular interest is the ability for robots to manipulate objects in their environment, known as grasping. In this project, we have chosen to focus on a specific grasping task - training a robotic hand to pick up a moving ball object and place it in a specific target location using the Brax physics simulation engine.
Grasp β robotic hand which picks a moving ball and moves it to a specific target
The reason for choosing this project is twofold. Firstly, the ability for robots to grasp and manipulate objects is a fundamental skill that is crucial for many real-world applications, such as manufacturing, logistics, and service industries. Secondly, the use of a physics simulation engine allows us to train our robotic hand in a realistic and controlled environment, without the need for expensive hardware and the associated costs and safety concerns.
Reinforcement learning is a powerful tool for training robots to perform complex tasks, as it allows the robot to learn through trial and error. In this project, we will be using reinforcement learning techniques to train our robotic hand, and we hope to demonstrate the effectiveness of this approach in solving the grasping task.
The use of a physics simulation engine is essential for training a robotic hand to perform the grasping task, as it allows us to simulate the real-world physical interactions between the robot and the ball. Without a physics simulation engine, it would be difficult to accurately model the dynamics of the task, including the forces and torques required for the robotic hand to pick up the ball and move it to the target location.
In this project, we explored several different physics simulation engines, including:
Each of these engines has its own strengths and weaknesses, and we carefully considered the trade-offs between them before making a final decision.
Ultimately, we chose to use Brax due to its highly scalable and parallelizable architecture, which makes it well-suited for accelerated hardware (XLA backends such as GPUs and TPUs). This allows us to simulate the grasping task at a high level of realism and detail, while also taking advantage of the increased computational power of modern hardware to speed up the training process.
The grasping environment provided by Brax is a simple pick-and-place task, where a 4-fingered claw hand must pick up and move a ball to a target location. The environment is designed to simulate the physical interactions between the robotic hand and the ball, including the forces and torques required for the hand to grasp the ball and move it to the target location.
The hand is able to pick up the ball and carry it to a series of red targets. Once the ball gets close to the red target, the red target is respawned at a different random location
In the environment, the robotic hand is represented by a 4-fingered claw, which is capable of opening and closing to grasp the ball. The ball is placed in a random location at the beginning of each episode, and the target location is also randomly chosen. The goal of the robotic hand is to move the ball to the target location as quickly and efficiently as possible. For more details, check 4.2.2.
The environment observes three main bodies: the Hand, the Object, and the Target. The agent uses these observations to learn how to control the robotic hand and move the object to the target location.
-
The
Handobservation includes information about the state of the robotic hand, such as the position and orientation of the fingers, the joint angles, and the forces and torques applied to the hand. This information is used by the agent to control the hand and pick up the object. -
The
Objectobservation includes information about the state of the object, such as its position, velocity, and orientation. This information is used by the agent to track the object and move it to the target location. -
The
Targetobservation includes information about the target location, such as its position and orientation. This information is used by the agent to navigate the hand and the object to the target location.
When the object reaches the target location, the agent is rewarded. The agent is also given a penalty if the object falls or if the hand collides with any obstacle. The agent's goal is to maximize the reward, which means reaching the target location as quickly and efficiently as possible.
Overall, the observations provided by the Grasp environment are designed to give the agent the information it needs to learn how to control the robotic hand and move the object to the target location. The combination of the Hand, Object, and Target observations allows the agent to learn from the environment and improve its performance over time.
The action has 19 dimensions, itβs the handβs position and the jointsβ angles, and it is normalized to the [-1, 1] as continuous values.
The reward function is calculated using following equation:
where,
each minor step approaching the task completeness will be rewarded, while the
$\text{target hit}$ will gain the biggest reward.
We will use the braxβs optimized algorithms: PPO, ES, ARS and SAC.
Proximal Policy Optimization (PPO) is a model-free online policy gradient reinforcement learning algorithm, developed at OpenAI in 2017. PPO strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. Generally speaking, it is a clipper version A2C algorithm.
Evolution Strategy (ES) is inspired by natural evolution, it is a powerful black-box optimization technique. A group of random noise is tested for the network parameters, and the highest scoring parameter vectors are chosen to evolute the network. It is a different method compared with using the loss function to back propagate the network. ES can be parallelized using XLA backend (CPU/GPU/TPU) to speed up the training.
Augmented Random Search (ARS) is a random search method for training linear policies for continuous control problems. It operates directly on the policy weights, each epoch the agent perturbs its current policy N times, and collects 2N rollouts using the modified policies. The rewards from these rollouts are used to update the current policy weights, repeat until completion. The algorithm is known to have high variance; not all seeds obtain high rewards, but to our knowledge their work in many ways represents the state of the art on these benchmarks.
Soft Actor-Critic (SAC) is an off-policy model-free reinforcement framework. The actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible, and that is why itβs called soft. SAC has better sample efficiency than PPO.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Clone the repository
git clone https://github.com/mohammadzainabbas/Reinforcement-Learning-CS.git
cd Reinforcement-Learning-CS/- Create a new enviornment and install all dependencies
First, install mamba, a fast and efficient package manager for conda.
conda install mamba -n base -c conda-forgeThen, create a new environment and install all dependencies, and activate it.
mamba env create -n reinforcement_learning -f docs/config/reinforcement_learning_env.yaml
conda activate reinforcement_learning- Run the code
train_ppo.py - train the reinforcement learning agent using PPO algorithm:
python src/train_ppo.pyYou will get the following output files:
ppo_training.png- Training progress plotresult_with_ppo.html- Simulation of the trained agent (in HTML format)ppo_params- Trained parameters of the agent
train_sac.py - train the reinforcement learning agent using SAC algorithm:
python src/train_sac.pyyou will get the same output files as
PPOalgorithm.
generate_results.py - generate the results of the trained PPO agent:
python src/generate_results.pyyou can see the live output here.
ppo_with_pytorch.py - implementation of PPO algorithm with PyTorch.


