Skip to content
View ReinFlow's full-sized avatar

Block or report ReinFlow

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ReinFlow/README.md

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

πŸ’ Paper accepted at NeurIPS 2025

Tonghe Zhang$^1$, Chao Yu$^{2,3}$, Sichang Su$^4$, Yu Wang$^2$

$^1$ Carnegie Mellon University $^2$ Tsinghua University $^3$ Beijing Zhongguancun Academy $^4$ University of Texas at Austin


Architecture Diagram

Shortcut Flow Can Shortcut Transport


Installation | Quick Start | Implementation Details | Add Dataset/Environment
Debug & Known Issues | License | Acknowledgement | Citation

This is the official implementation of "ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning".

If you like our work, it will be wonderful if you give us a star ⭐!

πŸ“’ News

  • [2025/11/7] Update limitation section
  • [2025/11/5] Update tips on hyperparameter tuning.
  • [2025/11/2] πŸ”₯ We scaled up ReinFlow to fine-tune VLA models such as $\pi_0$ and $\pi_{0.5}$.
    The code and checkpoint for the LIBERO environment are available at RLinf-pi0. A technical report including results on LIBERO and ManiSkill/Simpler is available at $\pi_{\texttt{RL}}$ Online RL Fine-tuning for Flow-based Vision-Language-Action Models: arXiv:2510.25889).
  • [2025/09/18] ReinFlow paper is accepted at NeurIPS 2025.
  • [2025/08/18] All training metrics (losses, reward, etc) released in WandB to help you reproduce our results.
  • [2025/07/30] Fixed the rendering bug in Robomimic. Now supports rendering at 1080p resolution.
  • [2025/07/29] Add tutorial on how to record videos during evaluation in the docs
  • [2025/06/14] Updated webpage for a detailed explanation to the algorithm design.
  • [2025/05/28] Paper is posted on arXiv!

πŸš€ About ReinFlow

ReinFlow is a flexible policy gradient framework for fine-tuning flow matching policies at any denoising step.

How does it work?
πŸ‘‰ First, train flow policies using imitation learning (behavior cloning).
πŸ‘‰ Then, fine-tune them with online reinforcement learning using ReinFlow!

🧩 Supports:

  • βœ… 1-Rectified Flow
  • βœ… Shortcut Models
  • βœ… Any other policy defined by ODEs (in principle)

πŸ“ˆ Empirical Results: ReinFlow achieves strong performance across a variety of robotic tasks:

  • 🦡 Legged Locomotion (OpenAI Gym)
  • βœ‹ State-based manipulation (Franka Kitchen)
  • πŸ‘€ Visual manipulation (Robomimic)

🧠 Key Innovation: ReinFlow trains a noise injection network end-to-end:

  • βœ… Makes policy probabilities tractable, even with very few denoising steps (e.g., 4, 2, or 1)
  • βœ… Robust to discretization and Monte Carlo approximation errors

Learn more on our πŸ”— project website or check out the arXiv paper.

πŸš€ Installation

Please follow the steps in installation/reinflow-setup.md.

πŸš€ Quick Start: Reproduce Our Results

To fully reproduce our experiments, please refer to ReproduceExps.md.

To download our training data and reproduce the plots in the paper, please refer to ReproduceFigs.md.

πŸš€ Implementation Details

Please refer to Implement.md for descriptions of key hyperparameters of FQL, DPPO, and ReinFlow.

πŸš€ Adding Your Own Dataset or Environment

Please refer to Custom.md.

πŸš€ Debug Aid and Known Issues

Please refer to KnownIssues.md to see how to resolve errors you encounter.

πŸš€ Tips on Hyperparameter Tuning

After training flow policies with RL in multiple benchmarks (OpenAI Gym, Franka Kitchen, Robomimic, LIBERO, ManiSkill, MetaWorld) and scaling model size from 3M to 3B, we discover that these hyperparameters are critical to RL's success, especially in visual manipulation from sparse reward:

  • SFT success rate. RL cannot train visual manipulation policies easily from scratch, so try to optimize your SFT success rate before starting RL. The stronger your SFT is, the easier it will be for RL.
  • Noise level. When the SFT success rate is low, tune down noise to [0.04, 0.10] or [0.05, 0.12] to avoid too much erroneous behaviors in early-stage exploration. When the SFT success rate is high, relax the noise logvariance to [0.08, 0.16] is usually a good practice.
  • Entropy coefficient. Turn it off first. When pocliy struggles to improve, add a small coefficient of 0.005 may help. When the policy is small and the problem is simple (dense reward, low-dim input), use larger entropy coefficient. Otherwise be cautious of increasing this constant.
  • Critic warmup. The stronger your SFT checkpoint is, the more you need a critic warmup. Try to pick the correct critic network architecture and add some rounds of warmup before policy gradient ascent. Try to make the critic loss decrease smoothly after the warmup phase, and keep a keen eye on the explained variance--it should quickly increase to a higher level. However, even without warmup, ReinFlow should be able to increase success rate eventually, but that usually slows down convergence.

πŸš€ Limitation and Caveats

Based on community feedback, we have added a limitations section to highlight the shortcomings of our algorithm and note important caveats. We hope this discussion will inspire future research.

  • ReinFlow may not be an optimal method to train RL agents from scratch. Our method is designed for fine-tuning purposes, not pre-training.

⭐ Todo

  • Release pi0, pi0.5 fine-tuning results.
  • Release WandB metrics
  • Release docs
  • Release checkpoints
  • Release codebase

License

This repository is released under the MIT license. See LICENSE. If you use our code, we appreciate it if you paste the license at the beginning of the script.

Acknowledgement

This repository was developed from multiple open-source projects. Major references include:

We also thank our collaborators from the open-source RL infrastructure project RLinf for their generous support, which enabled scaling ReinFlow to models of up to 3 billion parameters across 320 highly randomized visual manipulation environments with thousands of object-scene-task-pose combinations.

For more references, please refer to Acknowledgement.md.

Cite our work

@misc{zhang2025reinflowfinetuningflowmatching,
    title={ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning},
    author={Tonghe Zhang and Chao Yu and Sichang Su and Yu Wang},
    year={2025},
    eprint={2505.22094},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2505.22094},
}

Star History

Star History Chart

Pinned Loading

  1. ReinFlow ReinFlow Public

    [NeurIPS 2025] Flow x RL. "ReinFlow: Fine-tuning Flow Policy with Online Reinforcement Learning". Support VLAs e.g., pi0, pi0.5. Fully open-sourced.

    Python 163 14