Skip to content

The official code of TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

License

Notifications You must be signed in to change notification settings

Tencent-Hunyuan/Thinking-Free_Policy_Initialization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

arXiv
1Hunyuan LLM Department, Tencent 
2The Hong Kong University of Science and Techology 
3The University of Hong Kong 

Overview

Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to Reinforcement Learning with Verifiable Reward (RLVR) that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple ThinkingFree operation, explicitly discarding the thinking content via a direct append, to reduce token usage during inference. Training with ThinkingFree-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we can train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench with extremely low training compute.

📝 News

  • [2025/12/22] We released the codes.
  • [2025/11/7] We released the model checkpoints.
  • [2025/9/30] We released the paper!

🚀 Quick Start

Installation

1. Environment setup

conda create -n TFPI python=3.10 -y
conda activate TFPI

2. Requirements installation

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.8.5.post1
pip install -e .
pip install vertexai
pip install sentence_transformers
pip install flash-attn==2.7.4.post1 --no-build-isolation

Run Training

The training dataset is the training prompts in Polaris-53K.

First, download and transform the format of training data using the following Python script:

python scripts/download_train.py

The training data is saved in ./data/train/tfpi-polaris53k.parquet

Next, adapt the training script in "./scripts/train/qwen3-4b-tfpi.sh" by setting the WandB key, model path and dataset path.

Finally, run the following commands at the master node:

bash ./scripts/ray_start.sh # start ray
bash ./scripts/train/qwen3-4b-tfpi.sh # submit training

Run Evaluation

First, download the evaluation datasets using

hf download xx18/TFPI-EVA --repo-type=dataset --local-dir ./data/eval

All test datasets are downloaded to the folder data/eval.

for evaluation, use:

bash ./scripts/ray_start.sh # start ray, use pssh to run on multiple nodes if necessary
bash scripts/eval/start_generate.sh

The resulted metrics and evaluation outputs will be saved under the folder your_model_path/eval_results

For IFEval, please refer to the official repo IFEval evaluation.

🤗 Datasets and Models

we are open-sourcing our complete codes, and training details for the research community. All our resulted checkpoints can be found in TFPI Collection.

Name Link Remarks
Evaluation Sets TFPI-EVA All evaluation datasets used in the TFPI paper, including AIME24, AIME25, BeyondAIME, LiveCodeBench, GPQA, and IFEval
Training set Polaris-53K -
1.5B TFPI Stage 1 TFPI-DeepSeek-Qwen-1.5B-Stage1 Results in Table 1; Training Response Length 2048
1.5B TFPI Stage 2 TFPI-DeepSeek-Qwen-1.5B-Stage2 Results in Table 1; Training Response Length 4096
1.5B TFPI Stage 3 TFPI-DeepSeek-Qwen-1.5B-Stage3 Results in Table 1; Training Response Length 8192
1.5B TFPI Stage 3 + DAPO TFPI-DeepSeek-Qwen-1.5B-Stage3_then_RL Results in Table 7; Training Response Length 16K;
1.5B Direct RL checkpoint 1 DirectRL_DeepSeek-Qwen-1.5B_baseline1 Results in Table 1; Training Response Length 16K; Traning Time = 3 stages of TFPI
1.5B Direct RL checkpoint 2 DirectRL_DeepSeek-Qwen-1.5B_baseline2 Results in Table 7; Training Response Length 16K; Traning Time = ''TFPI+RL''
Qwen3-4B TFPI Stage 1 TFPI-Qwen3-4B-Stage1 Results in Table 1; Training Response Length 4096
Qwen3-4B TFPI Stage 2 TFPI-Qwen3-4B-Stage2 Results in Table 1; Training Response Length 8192
Qwen3-4B TFPI Stage 3 TFPI-Qwen3-4B-Stage3 Results in Table 1; Training Response Length 16K
Qwen3-4B TFPI Stage 3 + DAPO TFPI-Qwen3-4B-Stage3_then_RL Results in Table 2; Training Response Length 32K
Qwen3-4B Direct RL checkpoint 1 DirectRL_Qwen3-4B_baseline1 Results in Table 1; Training Response Length 32K; Traning Time = 3 stages of TFPI
Qwen3-4B Direct RL checkpoint 2 DirectRL_Qwen3-4B_baseline2 Results in Table 2; Training Response Length 32K; Traning Time = ''TFPI+RL''
Qwen3-4B-Thinking-2507 Stage 3 TFPI-Qwen3-4B-Thinking-2507-Stage3 Results in Table 2; Training Response Length 16K

🤝 Acknowledgement

We are deeply grateful for the following GitHub repositories, as their valuable code and efforts have been incredibly helpful:

✏️ Citation

Bib

If you find TFPI useful for your research and applications, please cite using this BibTeX:

@article{xu2025tfpi,
  title={Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners},
  author={Xu, Xin and AI, Cliveb and Yang, Kai and Chen, Tianhao and Wang, Yang and Yang, Saiyong and Yang, Can},
  journal={arXiv preprint arXiv:2509.26226},
  year={2025}
}

About

The official code of TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published