TFPI: Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
2The Hong Kong University of Science and Techology
3The University of Hong Kong
Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to Reinforcement Learning with Verifiable Reward (RLVR) that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple ThinkingFree operation, explicitly discarding the thinking content via a direct append, to reduce token usage during inference. Training with ThinkingFree-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we can train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench with extremely low training compute.
- [2025/12/22] We released the codes.
- [2025/11/7] We released the model checkpoints.
- [2025/9/30] We released the paper!
conda create -n TFPI python=3.10 -y
conda activate TFPIpip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.8.5.post1
pip install -e .
pip install vertexai
pip install sentence_transformers
pip install flash-attn==2.7.4.post1 --no-build-isolationThe training dataset is the training prompts in Polaris-53K.
First, download and transform the format of training data using the following Python script:
python scripts/download_train.pyThe training data is saved in ./data/train/tfpi-polaris53k.parquet
Next, adapt the training script in "./scripts/train/qwen3-4b-tfpi.sh" by setting the WandB key, model path and dataset path.
Finally, run the following commands at the master node:
bash ./scripts/ray_start.sh # start ray
bash ./scripts/train/qwen3-4b-tfpi.sh # submit trainingFirst, download the evaluation datasets using
hf download xx18/TFPI-EVA --repo-type=dataset --local-dir ./data/evalAll test datasets are downloaded to the folder data/eval.
for evaluation, use:
bash ./scripts/ray_start.sh # start ray, use pssh to run on multiple nodes if necessary
bash scripts/eval/start_generate.shThe resulted metrics and evaluation outputs will be saved under the folder your_model_path/eval_results
For IFEval, please refer to the official repo IFEval evaluation.
we are open-sourcing our complete codes, and training details for the research community. All our resulted checkpoints can be found in TFPI Collection.
| Name | Link | Remarks |
|---|---|---|
| Evaluation Sets | TFPI-EVA | All evaluation datasets used in the TFPI paper, including AIME24, AIME25, BeyondAIME, LiveCodeBench, GPQA, and IFEval |
| Training set | Polaris-53K | - |
| 1.5B TFPI Stage 1 | TFPI-DeepSeek-Qwen-1.5B-Stage1 | Results in Table 1; Training Response Length 2048 |
| 1.5B TFPI Stage 2 | TFPI-DeepSeek-Qwen-1.5B-Stage2 | Results in Table 1; Training Response Length 4096 |
| 1.5B TFPI Stage 3 | TFPI-DeepSeek-Qwen-1.5B-Stage3 | Results in Table 1; Training Response Length 8192 |
| 1.5B TFPI Stage 3 + DAPO | TFPI-DeepSeek-Qwen-1.5B-Stage3_then_RL | Results in Table 7; Training Response Length 16K; |
| 1.5B Direct RL checkpoint 1 | DirectRL_DeepSeek-Qwen-1.5B_baseline1 | Results in Table 1; Training Response Length 16K; Traning Time = 3 stages of TFPI |
| 1.5B Direct RL checkpoint 2 | DirectRL_DeepSeek-Qwen-1.5B_baseline2 | Results in Table 7; Training Response Length 16K; Traning Time = ''TFPI+RL'' |
| Qwen3-4B TFPI Stage 1 | TFPI-Qwen3-4B-Stage1 | Results in Table 1; Training Response Length 4096 |
| Qwen3-4B TFPI Stage 2 | TFPI-Qwen3-4B-Stage2 | Results in Table 1; Training Response Length 8192 |
| Qwen3-4B TFPI Stage 3 | TFPI-Qwen3-4B-Stage3 | Results in Table 1; Training Response Length 16K |
| Qwen3-4B TFPI Stage 3 + DAPO | TFPI-Qwen3-4B-Stage3_then_RL | Results in Table 2; Training Response Length 32K |
| Qwen3-4B Direct RL checkpoint 1 | DirectRL_Qwen3-4B_baseline1 | Results in Table 1; Training Response Length 32K; Traning Time = 3 stages of TFPI |
| Qwen3-4B Direct RL checkpoint 2 | DirectRL_Qwen3-4B_baseline2 | Results in Table 2; Training Response Length 32K; Traning Time = ''TFPI+RL'' |
| Qwen3-4B-Thinking-2507 Stage 3 | TFPI-Qwen3-4B-Thinking-2507-Stage3 | Results in Table 2; Training Response Length 16K |
We are deeply grateful for the following GitHub repositories, as their valuable code and efforts have been incredibly helpful:
If you find TFPI useful for your research and applications, please cite using this BibTeX:
@article{xu2025tfpi,
title={Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners},
author={Xu, Xin and AI, Cliveb and Yang, Kai and Chen, Tianhao and Wang, Yang and Yang, Saiyong and Yang, Can},
journal={arXiv preprint arXiv:2509.26226},
year={2025}
}