XVO: Generalized Visual Odometry via Cross-Modal Self-Training

This repository contains the code that accompanies our ICCV 2023 paper XVO: Generalized Visual Odometry via Cross-Modal Self-Training. Please find our project page for more details.

Overview

We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. Our XVO can efficiently learn to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task.

Datasets

We use KITTI, Argoverse 2 and nuScenes dataset along with in-the-wild YouTube videos. Please find their websites for dataset setup.

Datasets	Download Link
KITTI	The KITTI dataset can be downloaded from the official source here. All other datasets, after processing, will adhere to the same directory structure as the KITTI dataset.
Argoverse 2	The Argoverse 2 dataset can be downloaded from the official source here. Once downloaded, the subset corresponding to the VO task can be extracted using the provided script located in the data directory.
nuScenes	The nuScenes dataset can be downloaded from the official source here. Once downloaded, the subset corresponding to the VO task can be extracted using the provided script located in the data directory.
YouTube	Approximately 30 hours of driving footage were selected from videos published on the YouTube channel J Utah, featuring a diverse range of driving scenarios. A more comprehensive list of driving videos from YouTube can be found here.

The directory structure within the data folder is organized as follows:

data/ 
├── KITTI/
│   ├── sequences/ 
│   └── poses/
├── Argoverse 2/
│   ├── sequences/ 
│   └── poses/
├── nuScenes/
│   ├── segmentations/ 
│   ├── sequences/ 
│   └── poses/

The nuScenes dataset requires segmentation labels to support the cross-modal self-training process. Precomputed segmentations are available here. Alternatively, users may regenerate the segmentation annotations using recent state-of-the-art segmentation models.

Environment Requirements and Installation

# create a new environment
conda create -n XVO python=3.9
conda activate XVO
# install pytorch
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c iopath iopath
# install pytorch3d
wget https://anaconda.org/pytorch3d/pytorch3d/0.7.5/download/linux-64/pytorch3d-0.7.5-py39_cu117_pyt201.tar.bz2
conda install pytorch3d-0.7.5-py39_cu117_pyt201.tar.bz2
sudo rm pytorch3d-0.7.5-py39_cu117_pyt201.tar.bz2
# export CUDA 11.7 
export CUDA_HOME=/usr/local/cuda-11.7
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
pip install PyYAML==6.0.2 timm==1.0.16 matplotlib==3.5.3 pandas==2.3.0 opencv-python==4.11.0.86 a-unet==0.0.16 mmcv-full==1.7.2 numpy==1.26.4 pillow==11.0.0 av2==0.2.1 nuscenes-devkit==1.1.11

Training

Install the correlation package
The correlation package must be installed first:
```
cd model/correlation_package
python setup.py install
```
Preprocess the dataset
The labels are available in the poses directory. To regenerate the labels or review the corresponding implementation details, please refer to the code and execute the following command:
```
python3 preprocess.py
```
Download initial weights

Download initial weights to init_weights directory. Initial weights can be found here.

Run training

Supervised Training on KITTI:

# update params.py
self.train_video = {'KITTI': ['00', '02', '08', '09'],}
self.multi_modal = False
self.checkpoint_path = 'saved_models/xvo_kitti_sl'

Cross-Modal Self-Training on nuScenes and YouTube:

# update params.py
self.train_video = {
    'NUSC': nusc_scene_map['singapore-hollandvillage'],
    'YouTube': ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11'],
    }
self.multi_modal = True
self.checkpoint_path = 'saved_models/xvo_nusc_ytb_ssl'

and run:

python3 main.py

Test

Dwonload model checkpoints to saved_models directory. Model checkpoints can be found here.

Supervised Training on KITTI. We evaluate our model on the remaining KITTI sequences. Two checkpoints were released, achieving the following metrics respectively:

Checkpoint 1:
- Translation Error (t_err): 3.35
- Rotation Error (r_err): 1.61
- Absolute Trajectory Error (ATE): 13.29
- Scale Error (s_err): 0.04
Checkpoint 2:
- Translation Error (t_err): 3.22
- Rotation Error (r_err): 1.65
- Absolute Trajectory Error (ATE): 12.94
- Scale Error (s_err): 0.04

# update test_utils.py
par.multi_modal = False
par.checkpoint_path = "saved_models/xvo_kitti_sl"
par.test_video = {'KITTI': {'KITTI': ['03', '04', '05', '06', '07', '10']}}

Cross-Modal Self-Training on nuScenes and YouTube (We test on the KITTI, Argoverse 2, and the unseen regions in nuScenes):

# update test_utils.py
par.multi_modal = False
par.checkpoint_path = "saved_models/xvo_nusc_ytb_ssl"
# Evaluate the model sequentially on each of these datasets.
# par.test_video = {'ARGO2': {'ARGO2': [str(i).zfill(3) for i in range(150)]}}
# par.test_video = {'NUSC': {'NUSC': nusc_scene_map['boston-seaport']+nusc_scene_map['singapore-queenstown']+nusc_scene_map['singapore-onenorth']}}
# par.test_video = {'KITTI': {'KITTI': ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10']}}

and run:

python3 test_utils.py

Evaluation

cd odom-eval
# update eval.py
eval_dirs = ['xvo_kitti_sl']

and run:

python3 eval.py

VO evaluation tool is revised from https://github.com/Huangying-Zhan/kitti-odom-eval.

Result

We find that incorporating audio and segmentation tasks as part of the semi-supervised learning process significantly improves ego-pose estimation on KITTI.

Contact

If you have any questions or comments, please feel free to contact us at leilai@bu.edu or sgzk@bu.edu.

License

Our work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
data		data
fisher		fisher
init_weights		init_weights
misc		misc
model		model
odom-eval		odom-eval
poses		poses
saved_models		saved_models
.gitignore		.gitignore
README.md		README.md
argo2_logs_map.py		argo2_logs_map.py
dataset.py		dataset.py
main.py		main.py
nusc_scene_map.py		nusc_scene_map.py
params.py		params.py
preprocess.py		preprocess.py
test_utils.py		test_utils.py
xvo_env.txt		xvo_env.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XVO: Generalized Visual Odometry via Cross-Modal Self-Training

Overview

Datasets

Environment Requirements and Installation

Training

Test

Evaluation

Result

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

h2xlab/XVO

Folders and files

Latest commit

History

Repository files navigation

XVO: Generalized Visual Odometry via Cross-Modal Self-Training

Overview

Datasets

Environment Requirements and Installation

Training

Test

Evaluation

Result

Contact

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages