TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Xin Xu^1,^2,, Cliveb AI¹, Kai Yang^1,, Tianhao Chen^2,, Yang Wang^3,, Saiyong Yang^1,, Can Yang^2,,

¹Hunyuan LLM Department, Tencent
²The Hong Kong University of Science and Techology
³The University of Hong Kong

Overview

Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to Reinforcement Learning with Verifiable Reward (RLVR) that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple ThinkingFree operation, explicitly discarding the thinking content via a direct append, to reduce token usage during inference. Training with ThinkingFree-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we can train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench with extremely low training compute.

📝 News

[2025/12/22] We released the codes.
[2025/11/7] We released the model checkpoints.
[2025/9/30] We released the paper!

🚀 Quick Start

Installation

1. Environment setup

conda create -n TFPI python=3.10 -y
conda activate TFPI

2. Requirements installation

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.8.5.post1
pip install -e .
pip install vertexai
pip install sentence_transformers
pip install flash-attn==2.7.4.post1 --no-build-isolation

Run Training

The training dataset is the training prompts in Polaris-53K.

First, download and transform the format of training data using the following Python script:

python scripts/download_train.py

The training data is saved in ./data/train/tfpi-polaris53k.parquet

Next, adapt the training script in "./scripts/train/qwen3-4b-tfpi.sh" by setting the WandB key, model path and dataset path.

Finally, run the following commands at the master node:

bash ./scripts/ray_start.sh # start ray
bash ./scripts/train/qwen3-4b-tfpi.sh # submit training

Run Evaluation

First, download the evaluation datasets using

hf download xx18/TFPI-EVA --repo-type=dataset --local-dir ./data/eval

All test datasets are downloaded to the folder data/eval.

for evaluation, use:

bash ./scripts/ray_start.sh # start ray, use pssh to run on multiple nodes if necessary
bash scripts/eval/start_generate.sh

The resulted metrics and evaluation outputs will be saved under the folder your_model_path/eval_results

For IFEval, please refer to the official repo IFEval evaluation.

🤗 Datasets and Models

we are open-sourcing our complete codes, and training details for the research community. All our resulted checkpoints can be found in TFPI Collection.

Name	Link	Remarks
Evaluation Sets	TFPI-EVA	All evaluation datasets used in the TFPI paper, including AIME24, AIME25, BeyondAIME, LiveCodeBench, GPQA, and IFEval
Training set	Polaris-53K	-
1.5B TFPI Stage 1	TFPI-DeepSeek-Qwen-1.5B-Stage1	Results in Table 1; Training Response Length 2048
1.5B TFPI Stage 2	TFPI-DeepSeek-Qwen-1.5B-Stage2	Results in Table 1; Training Response Length 4096
1.5B TFPI Stage 3	TFPI-DeepSeek-Qwen-1.5B-Stage3	Results in Table 1; Training Response Length 8192
1.5B TFPI Stage 3 + DAPO	TFPI-DeepSeek-Qwen-1.5B-Stage3_then_RL	Results in Table 7; Training Response Length 16K;
1.5B Direct RL checkpoint 1	DirectRL_DeepSeek-Qwen-1.5B_baseline1	Results in Table 1; Training Response Length 16K; Traning Time = 3 stages of TFPI
1.5B Direct RL checkpoint 2	DirectRL_DeepSeek-Qwen-1.5B_baseline2	Results in Table 7; Training Response Length 16K; Traning Time = ''TFPI+RL''
Qwen3-4B TFPI Stage 1	TFPI-Qwen3-4B-Stage1	Results in Table 1; Training Response Length 4096
Qwen3-4B TFPI Stage 2	TFPI-Qwen3-4B-Stage2	Results in Table 1; Training Response Length 8192
Qwen3-4B TFPI Stage 3	TFPI-Qwen3-4B-Stage3	Results in Table 1; Training Response Length 16K
Qwen3-4B TFPI Stage 3 + DAPO	TFPI-Qwen3-4B-Stage3_then_RL	Results in Table 2; Training Response Length 32K
Qwen3-4B Direct RL checkpoint 1	DirectRL_Qwen3-4B_baseline1	Results in Table 1; Training Response Length 32K; Traning Time = 3 stages of TFPI
Qwen3-4B Direct RL checkpoint 2	DirectRL_Qwen3-4B_baseline2	Results in Table 2; Training Response Length 32K; Traning Time = ''TFPI+RL''
Qwen3-4B-Thinking-2507 Stage 3	TFPI-Qwen3-4B-Thinking-2507-Stage3	Results in Table 2; Training Response Length 16K

🤝 Acknowledgement

We are deeply grateful for the following GitHub repositories, as their valuable code and efforts have been incredibly helpful:

✏️ Citation

Bib

If you find TFPI useful for your research and applications, please cite using this BibTeX:

@article{xu2025tfpi,
  title={Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners},
  author={Xu, Xin and AI, Cliveb and Yang, Kai and Chen, Tianhao and Wang, Yang and Yang, Saiyong and Yang, Can},
  journal={arXiv preprint arXiv:2509.26226},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
deepscaler		deepscaler
rllm		rllm
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Overview

📝 News

🚀 Quick Start

Installation

1. Environment setup

2. Requirements installation

Run Training

Run Evaluation

🤗 Datasets and Models

🤝 Acknowledgement

✏️ Citation

Bib

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Tencent-Hunyuan/Thinking-Free_Policy_Initialization

Folders and files

Latest commit

History

Repository files navigation

TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Overview

📝 News

🚀 Quick Start

Installation

1. Environment setup

2. Requirements installation

Run Training

Run Evaluation

🤗 Datasets and Models

🤝 Acknowledgement

✏️ Citation

Bib

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages