Tonghe Zhang
Installation |
Quick Start |
Implementation Details |
Add Dataset/Environment
Debug & Known Issues |
License |
Acknowledgement |
Citation
This is the official implementation of "ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning".
If you like our work, it will be wonderful if you give us a star β!
- [2025/11/7] Update limitation section
- [2025/11/5] Update tips on hyperparameter tuning.
- [2025/11/2] π₯ We scaled up ReinFlow to fine-tune VLA models such as
$\pi_0$ and$\pi_{0.5}$ .
The code and checkpoint for the LIBERO environment are available at RLinf-pi0. A technical report including results on LIBERO and ManiSkill/Simpler is available at$\pi_{\texttt{RL}}$ Online RL Fine-tuning for Flow-based Vision-Language-Action Models: arXiv:2510.25889). - [2025/09/18] ReinFlow paper is accepted at NeurIPS 2025.
- [2025/08/18] All training metrics (losses, reward, etc) released in WandB to help you reproduce our results.
- [2025/07/30] Fixed the rendering bug in Robomimic. Now supports rendering at 1080p resolution.
- [2025/07/29] Add tutorial on how to record videos during evaluation in the docs
- [2025/06/14] Updated webpage for a detailed explanation to the algorithm design.
- [2025/05/28] Paper is posted on arXiv!
ReinFlow is a flexible policy gradient framework for fine-tuning flow matching policies at any denoising step.
How does it work?
π First, train flow policies using imitation learning (behavior cloning).
π Then, fine-tune them with online reinforcement learning using ReinFlow!
π§© Supports:
- β 1-Rectified Flow
- β Shortcut Models
- β Any other policy defined by ODEs (in principle)
π Empirical Results: ReinFlow achieves strong performance across a variety of robotic tasks:
- 𦡠Legged Locomotion (OpenAI Gym)
- β State-based manipulation (Franka Kitchen)
- π Visual manipulation (Robomimic)
π§ Key Innovation: ReinFlow trains a noise injection network end-to-end:
- β Makes policy probabilities tractable, even with very few denoising steps (e.g., 4, 2, or 1)
- β Robust to discretization and Monte Carlo approximation errors
Learn more on our π project website or check out the arXiv paper.
Please follow the steps in installation/reinflow-setup.md.
To fully reproduce our experiments, please refer to ReproduceExps.md.
To download our training data and reproduce the plots in the paper, please refer to ReproduceFigs.md.
Please refer to Implement.md for descriptions of key hyperparameters of FQL, DPPO, and ReinFlow.
Please refer to Custom.md.
Please refer to KnownIssues.md to see how to resolve errors you encounter.
After training flow policies with RL in multiple benchmarks (OpenAI Gym, Franka Kitchen, Robomimic, LIBERO, ManiSkill, MetaWorld) and scaling model size from 3M to 3B, we discover that these hyperparameters are critical to RL's success, especially in visual manipulation from sparse reward:
SFT success rate. RL cannot train visual manipulation policies easily from scratch, so try to optimize your SFT success rate before starting RL. The stronger your SFT is, the easier it will be for RL.Noise level. When the SFT success rate is low, tune down noise to [0.04, 0.10] or [0.05, 0.12] to avoid too much erroneous behaviors in early-stage exploration. When the SFT success rate is high, relax the noise logvariance to [0.08, 0.16] is usually a good practice.Entropy coefficient. Turn it off first. When pocliy struggles to improve, add a small coefficient of 0.005 may help. When the policy is small and the problem is simple (dense reward, low-dim input), use larger entropy coefficient. Otherwise be cautious of increasing this constant.Critic warmup. The stronger your SFT checkpoint is, the more you need a critic warmup. Try to pick the correct critic network architecture and add some rounds of warmup before policy gradient ascent. Try to make the critic loss decrease smoothly after the warmup phase, and keep a keen eye on the explained variance--it should quickly increase to a higher level. However, even without warmup, ReinFlow should be able to increase success rate eventually, but that usually slows down convergence.
Based on community feedback, we have added a limitations section to highlight the shortcomings of our algorithm and note important caveats. We hope this discussion will inspire future research.
- ReinFlow may not be an optimal method to train RL agents from scratch. Our method is designed for fine-tuning purposes, not pre-training.
- Release pi0, pi0.5 fine-tuning results.
- Release WandB metrics
- Release docs
- Release checkpoints
- Release codebase
This repository is released under the MIT license. See LICENSE. If you use our code, we appreciate it if you paste the license at the beginning of the script.
This repository was developed from multiple open-source projects. Major references include:
- TorchCFM, Tong et al.: Conditional flow-matching repository.
- Shortcut Models, Francs et al.: One-step Diffusion via Shortcut Models.
- DPPO, Ren et al.: DPPO official implementation.
We also thank our collaborators from the open-source RL infrastructure project RLinf for their generous support, which enabled scaling ReinFlow to models of up to 3 billion parameters across 320 highly randomized visual manipulation environments with thousands of object-scene-task-pose combinations.
For more references, please refer to Acknowledgement.md.
@misc{zhang2025reinflowfinetuningflowmatching,
title={ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning},
author={Tonghe Zhang and Chao Yu and Sichang Su and Yu Wang},
year={2025},
eprint={2505.22094},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.22094},
}


