Skip to content

Yangr116/VST

Repository files navigation

Visual Spatial Tuning

Paper Project Page Weights

We introduce Visual Spatial Tuning (VST), a comprehensive framework designed to cultivate Vision-Language Models (VLMs) with human-like visuospatial abilities—from spatial perception to advanced reasoning.

Teaser Image


🔥 News

  • Training code has been updated and verified, please see Train, which is very efficient because of data packing.

💡 Key Highlights

VST-P: 4.1M samples across 19 skills, spanning single images, multi-image scenarios, and videos—boosting spatial perception in VLMs.
VST-R: 135K curated samples that teach models to reason in space, including step-by-step reasoning and rule-based data for reinforcement learning.
Progressive Training Pipeline: Start with supervised fine-tuning to build foundational spatial perception, then reinforce spatial reasoning abilities via RL. VST achieves state-of-the-art results on spatial benchmarks (34.8% on MMSI-Bench, 61.2% on VSIBench) without compromising general capabilities.
Vision-Language-Action Models Enhanced: The VST paradigm significantly strengthens robotic learning.


📊 Dataset Overview

Dataset Image

🖼️ VST-Perception (VST-P)

  • 4.1M samples across 19 tasks for supervised fine-tuning.
  • Covers three primary vision scenarios: single-image, multi-image, and video.
  • VLMs tuned on VST-P show strong improvements in spatial perception:
    • ~20% boost on CVBench-3D
    • ~5% increase on BLINK
    • ~16% gain on VSIBench

🧠 VST-Reasoning (VST-R)

  • 135K samples, split into:
    • Reasoning steps (CoT): Teach models how to reason spatially.
    • Rule-checkable data: Used in online RL to further enhance reasoning skills.
  • VLMs tuned on VST-R demonstrate:
    • 8.9% improvement on MMSI-Bench

🏷️ Model Card

Model Name 🤗 HuggingFace
VST-3B-SFT rayruiyang/VST-3B-SFT
VST-3B-RL rayruiyang/VST-3B-RL
VST-7B-SFT rayruiyang/VST-7B-SFT
VST-7B-RL rayruiyang/VST-7B-RL
Click to see performance 📈

📈 Spatial & General Benchmarks

ModelsCV3DSRMMSIBLINKVSIMMStarMMBRealworldQAMMMUOCRBAI2D
VST-3B-SFT84.454.130.259.157.958.080.968.445.283.782.5
VST-3B-RL84.256.531.357.257.758.980.568.549.880.982.4
VST-7B-SFT85.554.632.062.160.663.183.372.250.685.584.9
VST-7B-RL86.560.134.862.661.263.583.068.549.486.183.5

📈 VSIBench

MethodsAvg.Obj. CountAbs. Dist.Obj. SizeRoom SizeRel. DistRel. Dir.Route PlanAppr. Order
VST-3B-SFT57.969.345.471.862.459.046.038.770.2
VST-3B-RL57.766.645.072.860.959.947.640.768.3
VST-7B-SFT60.672.044.474.368.359.755.844.965.2
VST-7B-RL61.271.643.875.569.260.055.644.369.2

📈 SUN RGBD 3D Object Detection

MethodsAP@15
Seed1.5-VL33.5
Gemini-2.0-Pro32.5
Gemini Robotics-ER48.3
VST-3B-SFT37.3
VST-3B-RL40.1
VST-7B-SFT41.6
VST-7B-RL44.2

⚡ Getting Started

pip install transformers
# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install qwen-vl-utils

Cookbook

Using 🤗 Transformers to Chat

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

THINK_SYSTEM_PROMPT = "You are a helpful assistant. You should first think about the reasoning process in the mind and then provide the user with the answer. The reasoning process is enclosed within <think> </think> tags, i.e. <think> reasoning process here </think> answer here."
think_mesg = {
                "role": "system",
                "content": [{"type": "text", "text": THINK_SYSTEM_PROMPT}],
            }

enable_thinking=False

model_path="rayruiyang/VST-7B-RL"

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained(model_path, min_pixels = 256*28*28, max_pixels=1280*28*28)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "http://images.cocodataset.org/train2017/000000075668.jpg",
            },
            {"type": "text", "text": "Consider the real-world 3D locations of the objects. Is the 'no motorcycle' sign directly above the red bus?"},
        ],
    }
]

if enable_thinking:
    messages.insert(0, think_mesg)


# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Train

git clone https://github.com/Yangr116/VST
cd VST
# install veomni
git clone -b v0.1.3 https://github.com/ByteDance-Seed/VeOmni.git third_party/VeOmni
cd third_party/VeOmni
pip install -e .
# install requirements
cd ../..
pip install -r requirements.txt
# install flash-attn (recommend)
pip install flash-attn --no-build-isolation

NOTE: we use torch2.5.1+cu124, other torch version is also fine.

Please follow docs/train.md to prepare data and train models.

Evaluation

Please see docs/evaluation.md

📜 License

This project is licensed under the Apache License. See the LICENSE file for details.

The VST-3B model is fine-tuned from Qwen2.5VL-3B, its license is Qwen2.5VL-3B LICENSE.

Acknowledgement

Thanks for the projects: Qwen2.5VL, VeOmni, EasyR1, and VLMEvalKit.

If you find VST useful for your research or applications, please ⭐ star the repo or cite our work:

@article{vst,
  title={Visual Spatial Tuning},
  author={Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao},
  journal={arXiv preprint arXiv:2511.05491},
  year={2025}
}