VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Vision-Language-Action (VLA) models often struggle with precise spatial grounding and robustness due to monolithic end-to-end designs. In this project, we introduce that decouples high-level reasoning and low-level execution via a structured visual prompting interface, enabling more precise and reliable robotic manipulation.

VP-VLA_supp_video.mp4

Overview of VP-VLA

VP-VLA demonstrates the following features:

Dual-System Architecture: VP-VLA decomposes robotic manipulation into:
- System 2 Planner (high-level reasoning)
- System 1 Controller (low-level execution)
Visual Prompt Interface: Instead of relying solely on text, VP-VLA converts language instructions into structured visual prompts (crosshairs and bounding boxes), enabling precise spatial grounding.
Improved Spatial Precision & Robustness: By grounding actions in visual space, the framework significantly improves performance in:
- Novel object scenarios
- Out-of-distribution (OOD) spatial configurations
General Multi-Stage Manipulation: VP-VLA supports complex, multi-step tasks via:
- Task decomposition
- Event-driven planning
- Dynamic visual prompt updates

News

[Mar 24th, 2026] 🔥 📖 Paper released! Code will be released within two weeks.

Model

VP-VLA consists of two key components:

System 2 Planner (High-Level Reasoning)

Decomposes instructions into subtasks
Identifies:
- Target object
- Target location
Generates structured visual prompts

System 1 Controller (Low-Level Execution)

Takes:
- Original observation
- Visual prompt overlay
Produces:
- Continuous robot actions

Key Idea

Instead of solving everything in one forward pass, VP-VLA does the following:

Language → Visual Prompts → Actions

This transforms the problem into visuomotor tracking of explicit spatial cues, improving precision and interpretability.

Citation

@article{wang2026vpvla,
  title={VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models},
  author={Wang, Zixuan and Chen, Yuxin and Liu, Yuqi and Ye, Jinhui and Chen, Pengguang and Lu, Changsheng and Liu, Shu and Jia, Jiaya},
  journal={arXiv preprint arXiv:2603.22003},
  year={2026}
}

Acknowledgement

We would like to thank the following repos for their great work:

This work is built upon starVLA
This work utilizes models from Qwen3-VL and SAM3

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Overview of VP-VLA

News

Contents

Model

System 2 Planner (High-Level Reasoning)

System 1 Controller (Low-Level Execution)

Key Idea

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Overview of VP-VLA

News

Contents

Model

System 2 Planner (High-Level Reasoning)

System 1 Controller (Low-Level Execution)

Key Idea

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Packages