GitHub - yjyddq/DARE: Official repository of DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

DARE: dLLM Alignment and Reinforcement Executor

🎯 Overview

We introduce DARE (dLLM Alignment and Reinforcement Executor), a flexible and efficient supervised-finetuning (SFT) and reinforcement learning (RL) training framework designed specifically for diffusion large language models (dLLMs). DARE also integrates dLLMs into a comprehensive evaluation platform. It aims to be both flexible and user-friendly to use with:

Easy extension of diverse RL algorithms for dLLMs
Easy extension of extra benchmark evaluations for dLLMs
Easy integration of popular and upcoming dLLM infras and HuggingFace weights

DARE is a work in progress, we plan to support more models and algorithm for training and evaluation. We warmly welcome the research community to collaborations, give feedback and share suggestions. Let's advance the diffusion large language models together !!!👊

📢 News

[2026-03-16]: Support bgpo and ebpo for LLaDA2.X.
[2026-03-13]: Fix bugs in SDAR SGLang rollout and dp actor.
[2026-03-12]: Support sp for SDAR family, LLaDA2.0 and 2.1.
[2026-03-10]: Support evaluation of SDAR-30B-A3B-Chat with SGLang and lmdeploy.
[2026-03-09]: Support evaluation of LLaDA2.1-mini with SGLang.
[2026-03-03]: Support mdpo for LLaDA and Dream.
[2026-03-02]: Support SGLang for SDAR rl rollout.
[2026-02-28]: Fix bugs and update features for LLaDA and Dream sequence parallel.
[2026-02-27]: Support evaluation of SDAR with SGLang.
[2026-02-26]: Update LLaDA cj-grpo and add Dream cj-grpo.

2025

[2025-12-28]: Fix bugs and update features in dp_actor_algorithm.
[2025-12-24]: Support online rl (online weight update of rollout) for SDAR.
[2025-12-23]: Support vrpo (preference optimization) for Dream.
[2025-12-16]: Support vrpo (preference optimization) for LLaDA.
[2025-12-12]: Support sft and peft of SDAR.
[2025-12-11]: Support evaluation of LLaDAMoE and LLaDA2.0-mini with SGLang.
[2025-12-08]: Support coupled-grpo, cj-grpo and spg algorithm.
[2025-12-03]: Support sequence parallel to enable longer generation ability for dLLMs.
[2025-12-01]: We initialize the codebase of DARE (dLLM Alignment and Reinforcement Executor), including training (sft, peft, and rl) and faster evaluation.

🏆 Key Features

Acceleration Inference/Rollout for dLLMs
- Block cache (Fast-dLLM) for LLaDAs and Dreams 2.2x faster rollout
- Inference engine (lmdeploy, sglang) for SDARs 2-4× faster rollout
Parallelism for dLLMs
- Support sequence parallel for LLaDA and Dream
Attention Backend
- Support flash_attn, flash_attn_varlen, and flash_attn_with_kvcache backend for LLaDA
Model Diversity
- Masked diffusion language models (e.g., LLaDA and Dream)
- Block diffusion language model (e.g., SDAR and LLaDA2.0)
Comprehensive Evaluation for dLLMs
- Integrate faster dLLM evaluation in opencompass
Upcoming Features
- Support Multi-Modal, Omni, etc.

🛠️ Installation and Setup

Our training framework is built on top of verl, providing a robust foundation for supervised finetuning and reinforcement learning experiments, and our evaluation framework is built on the top of opencompass, providing a comprehensive and fast evaluations.

Note

Due to some irreconcilable dependency conflicts between packages, we strongly recommend using two separate virtual environments, for training and evaluation, respectively.

🚀 Quick Installation

Clone the DARE repo:

git clone https://github.com/yjyddq/DARE

Build training vitual environment:

# Create and activate environment
conda create -n DARE python=3.10 -y
conda activate DARE

# Install dependencies
cd DARE
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolation
# or (Recommend)
# install from whl
# wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

# You can go to https://flashattn.dev/ to find the compiled FlashAttention .whl that corresponds to your Python, PyTorch, and CUDA versions.
# After download the .whl, install it
pip install flash_attn-2.8.x+cu12xtorch2.x-cp3xx-cp3xx-linux_x86_64.whl # For example

Build evaluation vitual environment:

# Create and activate environment
conda create --name opencompass python=3.10 -y
conda activate opencompass

# Install dependencies
cd DARE/opencompass
pip install -e .

# For HumanEval evaluation, install the additional dependency:
git clone https://github.com/open-compass/human-eval.git
cd human-eval && pip install -e .
cd ..

# For Math evaluation, pip install the additional dependency:
pip install math_verify latex2sympy2_extended

## Full installation (with support for more datasets)
# pip install "opencompass[full]"

## Environment with model acceleration frameworks
# pip install "opencompass[lmdeploy]"
# or
# pip install lmdeploy==0.10.1

🔧 Model Setup

After downloading LLaDA-8B-Instruct, replace the source files with our modified versions to enable several key features:

# Copy modified files to your LLaDA model directory
cp models/xxx/* <path_to_llada_model>/

Or you can move the model weights (.safetensors) to model/xxx/*

# Copy weights to models/xxx/ directory
cp <path_to_llada_model>/*.safetensors models/xxx/

Also for Dream, SDAR, etc.

Note

Since optimization plan in RL pipeline (various attention-computation backend), this step is indispensable.

🗄️ Dataset Setup

Preprocessed datasets is under data/preprocessed. Please refer verl.utils.preprocess to organize datasets.

🏋️ Training

🚀 SFT Quick Start

bash scripts/run_sft.sh # | scripts/run_sft_peft.sh

Alternatively, use/write scripts in recipe/xxx/run_xxx.sh

# peft for llada_8b_instruct
bash recipe/run_sft_peft_llada_8b_instruct.sh 

# sft for dream_7b_instruct
bash recipe/run_sft_dream_7b_instruct.sh 

# peft for sdar_8b_chat
bash recipe/run_sft_peft_sdar_8b_chat.sh

🚀 RL Quick Start

# online rl for llada_8b_instruct
bash recipe/run_d1_llada_8b_instruct.sh --task math # use Fast-dLLM for rollout acceleration

# online rl for dream_7b_instruct
bash recipe/run_coupled_grpo_dream_7b_instruct.sh --task math # use Fast-dLLM for rollout acceleration

# online rl for sdar_8b_chat
bash recipe/run_bgpo_sdar_8b_chat.sh --task math # use lmdeploy engine for rollout acceleration

🚀 DPO/VRPO Quick Start

Run an example for preference optimization. First download argilla/ultrafeedback-binarized-preferences-cleaned, then run scripts/preprocess_dpo_dataset.sh to save ultrafeedback.parquet under data/preprocessed/dpo/train and data/preprocessed/dpo/test

# preference optimization for llada_8b_instruct
bash recipe/run_vrpo_llada_8b_instruct.sh --task ultrafeedback

# preference optimization for dream_7b_instruct
bash recipe/run_vrpo_dream_7b_instruct.sh --task ultrafeedback

🚀 Convert Weights

Run an example for preference optimization.

# convert FSDP sharded checkpoints to HuggingFace safetensors format
bash scripts/convert_ckpt_to_hf.sh \
    --model_path models/LLaDA-8B-Instruct \
    --ckpt_path ./ckpts/<project>/<exp_name>/global_step_xxx/actor \
    --output_path ./converted_models/llada_8b_stepxxx \
    --model_name llada \
    --fsdp_strategy fsdp2 \
    --n_gpus 8

📊 Evaluation

🚀 Convert FSDP Sharded Checkpoints to HuggingFace Safetensors

🚀 Eval on OpenCompass's Bench Quick Start

First, please follow opencompass for benchmark dataset preparation. Then, you need to specify the model path in opencompass/opencompass/configs/models/dllm/*. For example llada_instruct_8b.py:

from opencompass.models import LLaDAModel

models = [
    dict(
        type=LLaDAModel,
        abbr='llada-8b-instruct',
        path='/TO/YOUR/PATH', # Need to modify
        max_out_len=1024,
        batch_size=1,
        run_cfg=dict(num_gpus=1),
    )
]

Evaluation of LLaDA-8B-Instruct on mmlu with hf backend:

bash scripts/eval_llada.sh --task mmlu

Evaluation of SDAR-8B-Chat on mmlu with lmdeploy backend:

bash scripts/eval_sdar_8b_chat.sh --task mmlu --engine lmdeploy

🚀 Eval on Local Bench Quick Start

If you want to add more benchmarks, models, or custom datasets, please refer to the Evaluation Guideline.

📦 Supported Models

Model	Params	Training Support	Evaluation Support	Inference Acceleration
LLaDA-8B-Base	8B	sft/rl	✅	hf Fast-dLLM
LLaDA-8B-Instruct	8B	sft/rl	✅	hf Fast-dLLM
LLaDA-1.5	8B	sft/rl	✅	hf Fast-dLLM
Dream-7B-Instruct	7B	sft/rl	✅	hf Fast-dLLM
SDAR-1.7B-Chat	1.7B	sft/rl	✅	lmdeploy SGLang
SDAR-4B-Chat	4B	sft/rl	✅	lmdeploy SGLang
SDAR-8B-Chat	8B	sft/rl	✅	lmdeploy SGLang
SDAR-30B-A3B-Chat	30BA3B	sft	✅	lmdeploy SGLang
LLaDA2.0-mini	16BA1B	sft/rl	✅	SGLang
LLaDA2.1-mini	16BA1B	sft/rl	✅	SGLang

🌱 Supported RL Algorithms

Algorithm	Arxiv	Source Code
d1	2504.12216	dllm-reasoning/d1
vrpo	2505.19223	ML-GSAI/LLaDA-1.5 (closed source)
coupled-grpo	2506.20639	apple/ml-diffucoder
mdpo	2508.13148	autonomousvision/mdpo
cj-grpo	2509.23924	yjyddq/EOSER-ASS-RL
spg	2510.09541	facebookresearch/SPG
bgpo	2510.11683	THU-KEG/BGPO
ebpo	2602.08676	inclusionAI/LLaDA2.X (closed source)

📈 Performance

Evaluation Result Reproduction

Bench/Model	LLaDA-8B	Dream-7B	SDAR-8B-Chat	SDAR-30B-A3B	LLaDA2.0-mini	LLaDA2.1-mini
MMLU	65.24	66.83	77.23	79.16	72.54	69.91
MMLU-Pro	36.82	31.89	56.49	25.59	57.10	57.52
Hellaswag	75.30	63.23	87.59	92.81	82.35	78.00
ARC-C	87.80	81.36	86.78	78.98	85.76	83.39
GSM8k	79.68	83.24	91.36	92.49	88.48	86.13
MATH	41.08	48.02	78.40	68.56	81.50	84.56
GPQA	31.82	26.77	41.40	36.36	34.34	34.34
AIME24	2.08	0.83	13.33	13.33	16.67	26.67
AIME25	0.42	0.00	16.67	6.67	23.33	26.67
OlympiadBench	9.70	12.22	24.93	32.90	38.82	40.31
HumanEval	46.34	78.05	79.88	84.15	81.10	81.10
MBPP	38.80	56.40	71.60	52.00	64.80	62.60

Algorithm Comparison (LLaDA-8B-Instruct)

Bench/Algo	Baseline	d1	Coupled-GRPO	VRPO	CJ-GRPO	SPG	BGPO
			Mathematics
GSM8k	76.5	83.7	85.3	81.9	85.6	83.5	82.3
MATH	34.6	40.6	41.0	35.8	39.2	40.6	40.0
			Coding
HumanEval	46.9	47.6	45.1	52.4	45.1	48.8	45.1
MBPP	37.9	39.1	38.1	42.8	40.9	41.9	40.3
			Planning
Countdown	16.8	10.7	77.9	21.5	65.2	10.1	10.0
Sudoku	26.2	31.8	21.3	29.0	25.0	27.9	42.6

Algorithm Comparison (Dream-7B-Instruct)

Bench/Algo	Baseline	d1	Coupled-GRPO	CJ-GRPO	SPG	BGPO
			Mathematics
GSM8k	77.2	82.5	80.3	85.7	59.4	83.9
MATH	39.6	49.7	40.4	50.7	25.2	48.9
			Coding
HumanEval	57.9	60.7	61.6	58.5	17.7	56.7
MBPP	56.2	56.5	60.3	57.5	54.4	58.7

📧 Contact

For any questions or collaboration inquiries, feel free to reach out Jingyi Yang at: yangjingyi946@gmail.com.

👷‍♂️ Contributor

Waiting for your joining and contribution.

📚 Citation

If you find our work useful, please consider citing:

@article{yang2025dare,
  title={DARE: dLLM Alignment and Reinforcement Executor},
  author={Yang, Jingyi, Jiang Yuxian, Hu Xuhao, Shao Jing},
  journal={URL https://github.com/yjyddq/DARE}
}

@article{yang2025taming,
  title={Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step},
  author={Yang, Jingyi and Chen, Guanxu and Hu, Xuhao and Shao, Jing},
  journal={arXiv preprint arXiv:2509.23924},
  year={2025}
}

🙏 Acknowledgments

We thank the open-source community for their wonderful work and valuable contributions:

Models Source: GSAI-ML, inclusionAI, Dream-org, JetLM, huggingface for model supply or hosting
Algorithm: d1, CJ-GRPO, MDPO, Coupled-GRPO, SPG, BGPO, LLaDA2.X
Training Framework: verl, BGPO, Dream, dFactory
Inference Acceleration/Engine: Fast-dLLM, lmdeploy, SGLang
Evaluation Framework: opencompass, LLaDA
Environment Dependencies: verl, opencompass, flash-attn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DARE: dLLM Alignment and Reinforcement Executor

🎯 Overview

📢 News

🔍 Catalogue

🏆 Key Features

🛠️ Installation and Setup

🚀 Quick Installation

🔧 Model Setup

🗄️ Dataset Setup

🏋️ Training

🚀 SFT Quick Start

🚀 RL Quick Start

🚀 DPO/VRPO Quick Start

🚀 Convert Weights

📊 Evaluation

🚀 Convert FSDP Sharded Checkpoints to HuggingFace Safetensors

🚀 Eval on OpenCompass's Bench Quick Start

🚀 Eval on Local Bench Quick Start

📦 Supported Models

🌱 Supported RL Algorithms

📈 Performance

📧 Contact

👷‍♂️ Contributor

📚 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
assets		assets
data/preprocessed		data/preprocessed
models		models
opencompass		opencompass
recipe		recipe
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DARE: dLLM Alignment and Reinforcement Executor

🎯 Overview

📢 News

🔍 Catalogue

🏆 Key Features

🛠️ Installation and Setup

🚀 Quick Installation

🔧 Model Setup

🗄️ Dataset Setup

🏋️ Training

🚀 SFT Quick Start

🚀 RL Quick Start

🚀 DPO/VRPO Quick Start

🚀 Convert Weights

📊 Evaluation

🚀 Convert FSDP Sharded Checkpoints to HuggingFace Safetensors

🚀 Eval on OpenCompass's Bench Quick Start

🚀 Eval on Local Bench Quick Start

📦 Supported Models

🌱 Supported RL Algorithms

📈 Performance

📧 Contact

👷‍♂️ Contributor

📚 Citation

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages