Skip to content

xukechun/BayesVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bayesvla

Seeing to Act, Prompting to Specify:
A Bayesian Factorization of Vision Language Action Policy

Kechun Xu · Zhenjie Zhu · Anzhe Chen · Shuqi Zhao · Qing Huang · Yifei Yang
Haojian Lu · Rong Xiong · Masayoshi Tomizuka · Yue Wang

Paper PDF Project Page Video

TL; DR: BayesVLA decomposes the policy into a vision-action prior and a language-conditioned likelihood. The vision-action prior leverages visual information for action generation (seeing to act), while the language-conditioned likelihood aligns these action priors with the language instruction (prompting to specify).

🏆 Highlights

🔍 Key Findings: modality imbalance in VLA data encourage "visual shortcut", degrading generalization in language condition

Key Insights:

  • Bayesian factorization to VA prior and VLA likelihood to structurally address the data imbalance during fine-tuning. The prior focuses on action modeling and generation, while the likelihood focuses on language grounding and alignment.

$$ \pi(\mathbf{a}\mid\mathbf{v},\ell) \propto\ \pi^{p}(\mathbf{a}\mid\mathbf{v}) L(\ell\mid\mathbf{v},\mathbf{a}) $$

  • Information-theoretical analysis reveals that the small conditional entropy $H(\ell\mid\mathbf{v})$ is the key reason for shortcut learning on visual cues, motivating self-bulit benchmarks demonstrating diverse language conditions.

  • Contact-aware architecture implementation: unified prior-likelihood formulation for both pre-contact and post-contact phases.

🧩 Overview

paper_video_pub.mp4

Given VLA datasets with modality imbalance, BayesVLA models the VLA policy using a prior and a likelihood, trained with two stage procedure: For stage 1, we train a prior model that takes in visual input to generate multimodal action distribution. Based on the prior, for stage 2, we train the likelihood model to further align the action priors with the language instruction.

bayesvla

📘 Usage

Data Preparation

  • Droid

    We use the processed data from cadence/droid_1.0.1 as it has camera extrinsic attached. Download it to anywhere you like, and make a symbolic link to it as ./data_raw/droid_1.0.1.

    bash scripts/data_preprocessing/process_libero.sh
  • LIBERO

    Download the LIBERO dataset and make a symbolink to ./data_raw/libero.

    bash scripts/data_preprocessing/process_libero.sh
  • Self-collected Datasets

    We upload the processed datasets, including Pick-Place collected in IssacSim and Articulated Object Manipulation collected in IssacLab, and ALOHA collected in the real-world.

Pre-training (Optional)

Note that pretraining is only for post-contact phase.

  bash scripts/postcontact/pretrain.sh

Post-training

  • Pre-contact Phase
    bash scripts/precontact/finetune.sh --config finetune_pp_arti
  • Post-contact Phase
    # stage 0: va finetuning
    bash scripts/postcontact/finetune.sh --stage 0 --config finetune_pp_arti --va-name YOUR_VA_NAME
    # stage 1: vla finetuning
    bash scripts/postcontact/finetune.sh --stage 1 --config finetune_pp_arti --va-name YOUR_VA_NAME --vla-name YOUR_VLA_NAME

Evaluation

  • Launch the pyro4 naming server (something like roscore).

    pyro4-ns

    By default the naming server runs on localhost:9090.

  • Launch remote service of your fine-tuned model:

    python -m infer_utils.remote_service \
      --precontact_ckpt PRECONTACT_CKPT_PATH \
      --postcontact_ckpt POSTCONTACT_CKPT_PATH \ 
      --uri CUSTOM_URI_NAME

🤝 Acknowledgements

This projects builds upon OpenPi and E2VLA. We thank these teams for their open-source contributions.

📚 Citation

If you find this work useful, please consider citing:

@article{xu2025bayesvla,
      title={Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy},
      author={Xu, Kechun and Zhu, Zhenjie and Chen, Anzhe and Zhao, Shuqi and Huang, Qing and Yang, Yifei and Lu, Haojian and Xiong, Rong and Tomizuka, Masayoshi and Wang, Yue},
      journal={arXiv preprint arXiv:2512.11218},
      year={2025}
    }

About

[arXiv 2025] Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors