MotionBind

Multi-Modal Human Motion Alignment for Cross-Modal Retrieval, Recognition, and Generation

MotionBind extends LanguageBind to incorporate human motion, enabling cross-modal retrieval, zero-shot action recognition, and any-to-motion generation from text, video, or audio.

Architecture

MotionBind consists of two main components:

MuTMoT (Multi-Scale Temporal Motion Transformer)

MuTMoT is a transformer-based hierarchical encoder-decoder architecture that encodes motion sequences into compact embeddings aligned with a shared multi-modal space. It captures motion dynamics at multiple temporal resolutions, producing representations that reflect both fine-grained pose transitions and high-level action semantics.

REALM (Retrieval-Augmented Latent diffusion Model)

REALM is a retrieval-augmented latent diffusion model that generates motion sequences conditioned on any modality (text, video, or audio). It operates in the compact latent space defined by the MuTMoT encoder and uses temporal conditioning with learnable frame tokens that dynamically attend to conditioning context throughout the denoising process.

Citation

@inproceedings{kinfu2025motionbind,
  title={MotionBind: Multi-Modal Human Motion Alignment for Retrieval, Recognition, and Generation},
  author={Kaleab A Kinfu and Rene Vidal},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=sUjwDdyspc}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MotionBind

Architecture

MuTMoT (Multi-Scale Temporal Motion Transformer)

REALM (Retrieval-Augmented Latent diffusion Model)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MotionBind

Architecture

MuTMoT (Multi-Scale Temporal Motion Transformer)

REALM (Retrieval-Augmented Latent diffusion Model)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages