Contributors: Be2R Lab (ITMO), SBER Robotics Center.
Full Abstract: We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.
# Build new docker image
make build
# Run docker image
make DATA_DIR={YOUR_DATA_DIR} run
# Inside docker
pip install --no-build-isolation -e .
## Usage
Example usages:
```bash
# Running the full pipeline.
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH
# Running the pose-only pipeline without depth estimation.
python run.py pipeline=default streams=raw_mp4_stream streams.base_path=YOUR_VIDEO_OR_DIR_PATH pipeline.post.depth_align_model=nullViPE is built on top of many great open-source research projects and codebases. Some of these include (not exhaustive):
If you find KM-ViPE useful in your research or application, please consider citing the following paper:
@misc{nasser2025kmvipeonlinetightlycoupled,
title={KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM},
author={Zaid Nasser and Mikhail Iumanov and Tianhao Li and Maxim Popov and Jaafar Mahmoud and Malik Mohrat and Ilya Obrubov and Ekaterina Derevyanka and Ivan Sosin and Sergey Kolyubin},
year={2025},
eprint={2512.01889},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.01889},
}
This project will download and install additional third-party models and softwares. Note that these models or softwares are not distributed by NVIDIA. Review the license terms of these models and projects before use. This source code is released under the Apache 2 License.
