Vision-Language-Action (VLA) models often struggle with precise spatial grounding and robustness due to monolithic end-to-end designs. In this project, we introduce that decouples high-level reasoning and low-level execution via a structured visual prompting interface, enabling more precise and reliable robotic manipulation.
VP-VLA_supp_video.mp4
VP-VLA demonstrates the following features:
-
Dual-System Architecture: VP-VLA decomposes robotic manipulation into:
- System 2 Planner (high-level reasoning)
- System 1 Controller (low-level execution)
-
Visual Prompt Interface: Instead of relying solely on text, VP-VLA converts language instructions into structured visual prompts (crosshairs and bounding boxes), enabling precise spatial grounding.
-
Improved Spatial Precision & Robustness: By grounding actions in visual space, the framework significantly improves performance in:
- Novel object scenarios
- Out-of-distribution (OOD) spatial configurations
-
General Multi-Stage Manipulation: VP-VLA supports complex, multi-step tasks via:
- Task decomposition
- Event-driven planning
- Dynamic visual prompt updates
[Mar 24th, 2026] 🔥 📖 Paper released! Code will be released within two weeks.
VP-VLA consists of two key components:
- Decomposes instructions into subtasks
- Identifies:
- Target object
- Target location
- Generates structured visual prompts
- Takes:
- Original observation
- Visual prompt overlay
- Produces:
- Continuous robot actions
Instead of solving everything in one forward pass, VP-VLA does the following:
- Language → Visual Prompts → Actions
This transforms the problem into visuomotor tracking of explicit spatial cues, improving precision and interpretability.
@article{wang2026vpvla,
title={VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models},
author={Wang, Zixuan and Chen, Yuxin and Liu, Yuqi and Ye, Jinhui and Chen, Pengguang and Lu, Changsheng and Liu, Shu and Jia, Jiaya},
journal={arXiv preprint arXiv:2603.22003},
year={2026}
}We would like to thank the following repos for their great work:

