Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.
git clone https://github.com/bytedance/SIFThinker.git
cd SIFThinker/GRPO-SIF
conda create -n SIFThinker python=3.10 -y && conda activate SIFThinker
bash setup.shIf the installed trl version conflicts with our repository, replace it with the local copy by running:
cp -rf ../package/trl /home/tiger/anaconda3/envs/SIFThinker/lib/python3.10/site-packages/Some may need to install:
pip install httpx==0.23.0
apt install libgl1-mesa-glxDataset is available in Here !!!
Dataset can be constructed using Data Generation scripts and named SIF-50K. Include SIF-50K-sampled-200.json for SFT, SIF-50K-sampled-200.json for RL. Please place them under data/ folder.
We also provide scripts for reproduing SIF-50K dataset. If you don't want to produce the dataset again, you can skip this section. You can run the data generation as the following steps:
- step1: Download the VisCoT. Change the image folder and JSON url in Line 313 and 484 in
data_generation/produce.pyto where you download the VisCoT. - step2: Follow the instructions of DepthAnything to setup environment.
- step3: Add api key and llm config to Line 403, 489 and 490 in
data_generation/produce.py - step4: Run the data generation script.
cd data_generation
python produce.pyRemark: Same operation can be used to the TallyQA, refer to data_generation/misc/produce_tallyqa.py
You can follow LLaMA-Factory for environment setup and SFT training. Our hyperparameter and setting has been included in SFT folder. Specifically:
- You can use the setting under
SFT/envto setup the environment.
conda create -n SFT python=3.10 -y && conda activate SFT
cd SFT/env
pip install -e ".[torch,metrics]" --no-build-isolation- Run warm up training as:
cd ..
llamafactory-cli train train_sft.yamlIn GRPO-SIF, the key modification lies in the reward function used during training.
Taking Qwen2.5-VL as an example, the reward function is defined in: GRPO-SIF/src/open-r1-multimodal/src/open_r1/vlm_modules/qwen_module.py.
Progressive learning defination is from SIFThinker/GRPO-SIF/src/open-r1-multimodal/src/open_r1/trainer/grpo_trainer.py.
You can run the GRPO-SIF training as the following steps:
- step1: Add api key and llm config to Line 167, 168 and 261 in
GRPO-SIF/src/open-r1-multimodal/src/open_r1/vlm_modules/qwen_module.py - step2: We use
SIF-50K-sampled-200.jsonfor trainning. Please place the dataset ahead underdata/folder. - step3: Run the training script.
bash run_scripts/train_grpo_sif.shRemember to merge the weight after each trainning phases under scripts.
llamafactory-cli export merge.yamlYou can selectively choose VLLM/Huggingface for inferencing.
API_PORT=8020 llamafactory-cli api inference.yamlThen, you can use the scripts scripts/infer.py to infer.
We following VisCoT, SpatialBot, SAT, V*, CV-Bench,ect. to eval the results. Some modifications of the scripts are in scripts/evaluation/ folder. (We use vllm-8020 to infer.)
The repo also benifits form VLM-R1, Open-R1-Multimodel, Visual-CoT, LLaVA, SpatialBot, SAT, V*, OVD-Eval, trl, Cambrian.
Thanks for their wonderful works.
If you find SIFThinker helpful for your work, please cite
@article{chen2025sifthinker,
title={SIFThinker: Spatially-Aware Image Focus for Visual Reasoning},
author={Chen, Zhangquan and Zhao, Ruihui and Luo, Chuwei and Sun, Mingze and Yu, Xinlei and Kang, Yangyang and Huang, Ruqi},
journal={arXiv preprint arXiv:2508.06259},
year={2025}
}
