GitHub - xukechun/BayesVLA: [arXiv 2025] Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

Seeing to Act, Prompting to Specify:
A Bayesian Factorization of Vision Language Action Policy

Kechun Xu · Zhenjie Zhu · Anzhe Chen · Shuqi Zhao · Qing Huang · Yifei Yang
Haojian Lu · Rong Xiong · Masayoshi Tomizuka · Yue Wang

TL; DR: BayesVLA decomposes the policy into a vision-action prior and a language-conditioned likelihood. The vision-action prior leverages visual information for action generation (seeing to act), while the language-conditioned likelihood aligns these action priors with the language instruction (prompting to specify).

🏆 Highlights

🔍 Key Findings: modality imbalance in VLA data encourage "visual shortcut", degrading generalization in language condition

✨ Key Insights:

Bayesian factorization to VA prior and VLA likelihood to structurally address the data imbalance during fine-tuning. The prior focuses on action modeling and generation, while the likelihood focuses on language grounding and alignment.

$$ \pi(\mathbf{a}\mid\mathbf{v},\ell) \propto\ \pi^{p}(\mathbf{a}\mid\mathbf{v}) L(\ell\mid\mathbf{v},\mathbf{a}) $$

Information-theoretical analysis reveals that the small conditional entropy $H(\ell\mid\mathbf{v})$ is the key reason for shortcut learning on visual cues, motivating self-bulit benchmarks demonstrating diverse language conditions.
Contact-aware architecture implementation: unified prior-likelihood formulation for both pre-contact and post-contact phases.

🧩 Overview

paper_video_pub.mp4

Given VLA datasets with modality imbalance, BayesVLA models the VLA policy using a prior and a likelihood, trained with two stage procedure: For stage 1, we train a prior model that takes in visual input to generate multimodal action distribution. Based on the prior, for stage 2, we train the likelihood model to further align the action priors with the language instruction.

📘 Usage

Data Preparation

Droid

We use the processed data from cadence/droid_1.0.1 as it has camera extrinsic attached. Download it to anywhere you like, and make a symbolic link to it as ./data_raw/droid_1.0.1.
```
bash scripts/data_preprocessing/process_libero.sh
```
LIBERO

Download the LIBERO dataset and make a symbolink to ./data_raw/libero.
```
bash scripts/data_preprocessing/process_libero.sh
```
Self-collected Datasets

We upload the processed datasets, including Pick-Place collected in IssacSim and Articulated Object Manipulation collected in IssacLab, and ALOHA collected in the real-world.

Pre-training (Optional)

Note that pretraining is only for post-contact phase.

  bash scripts/postcontact/pretrain.sh

Post-training

Pre-contact Phase

bash scripts/precontact/finetune.sh --config finetune_pp_arti

Post-contact Phase

# stage 0: va finetuning
bash scripts/postcontact/finetune.sh --stage 0 --config finetune_pp_arti --va-name YOUR_VA_NAME
# stage 1: vla finetuning
bash scripts/postcontact/finetune.sh --stage 1 --config finetune_pp_arti --va-name YOUR_VA_NAME --vla-name YOUR_VLA_NAME

Evaluation

Launch the pyro4 naming server (something like roscore).
```
pyro4-ns
```
By default the naming server runs on localhost:9090.

Launch remote service of your fine-tuned model:

python -m infer_utils.remote_service \
  --precontact_ckpt PRECONTACT_CKPT_PATH \
  --postcontact_ckpt POSTCONTACT_CKPT_PATH \ 
  --uri CUSTOM_URI_NAME

🤝 Acknowledgements

This projects builds upon OpenPi and E2VLA. We thank these teams for their open-source contributions.

📚 Citation

If you find this work useful, please consider citing:

@article{xu2025bayesvla,
      title={Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy},
      author={Xu, Kechun and Zhu, Zhenjie and Chen, Anzhe and Zhao, Shuqi and Huang, Qing and Yang, Yifei and Lu, Haojian and Xiong, Rong and Tomizuka, Masayoshi and Wang, Yue},
      journal={arXiv preprint arXiv:2512.11218},
      year={2025}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data_preprocess		data_preprocess
data_utils		data_utils
infer_utils		infer_utils
models		models
scripts		scripts
shm_transport		shm_transport
LICENSE		LICENSE
README.md		README.md
configs.py		configs.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeing to Act, Prompting to Specify:
A Bayesian Factorization of Vision Language Action Policy

🏆 Highlights

🧩 Overview

📘 Usage

Data Preparation

Pre-training (Optional)

Post-training

Evaluation

🤝 Acknowledgements

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Seeing to Act, Prompting to Specify:A Bayesian Factorization of Vision Language Action Policy

🏆 Highlights

🧩 Overview

📘 Usage

Data Preparation

Pre-training (Optional)

Post-training

Evaluation

🤝 Acknowledgements

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Seeing to Act, Prompting to Specify:
A Bayesian Factorization of Vision Language Action Policy

Packages