Daniel Korth1,2,*, Xavier Anadon1,3,*, Marc Pollefeys1,4, Zuria Bauer1, Daniel Barath1,5
1ETH Zurich, 2Technical University of Munich, 3University of Zaragoza, 4Microsoft, 5HUN-REN SZTAKI
*Equal contribution
teaser.mp4
Work done during 2-month ETH Summer Research Fellowship (SSRF & RSF). Advised by Zuria Bauer and Daniel Barath.
tl;dr: RGB-D recording + camera poses -> SAM2 Video Tracking -> Lift Mask + Features to 3D -> Scene Graph.
Please check the project page for more details.
conda create -n dsg python=3.10
conda activate dsg
pip install -e .
# installing sam2
cd sam2
pip install -e .
# download sam2 checkpoints
sh scripts/download_sam.shWe use Hydra for configuration management.
Before running any scripts, you need to configure the paths and dataset settings:
Set the PROJECT_ROOT environment variable to point to your project directory:
export PROJECT_ROOT=/path/to/your/dsg/projectOr add it to your shell profile (e.g., ~/.bashrc or ~/.zshrc):
echo "export PROJECT_ROOT=/path/to/your/dsg/project" >> ~/.bashrc
source ~/.bashrcOur data structure follows the ZED extraction scripts, but you can use your own RGB-D data. If using different formats, adjust the paths in configs/paths/default.yaml and configs/video_tracking.yaml.
Default structure (from ZED extraction):
data/
zed/
your_recording_name/
images/ # Original RGB images
poses.txt # Camera poses
images_undistorted_crop/ # Undistorted RGB + depth images (after undistortion)
left000000.png # Undistorted left camera images
left000001.png
...
leftXXXXXX.png
depth000000.png # Undistorted depth images
depth000001.png
...
depthXXXXXX.png
intrinsics.txt # Camera intrinsics (after undistortion)
If you already have RGB-D images and camera poses:
-
Run SAM2 multitrack segmentation:
# Process every 10th frame with max 100 frames python dsg/video_tracking.py recording=<recording_name> subsample=10 max_frames=100
Check
configs/video_tracking.yamlfor all configurations. -
Visualize and build scene graph:
# Basic visualization python dsg/viz_rerun.py recording=<recording_name> # Advanced visualization with graph updates python dsg/viz_rerun_teaser.py recording=<recording_name> # Text-based feature retrieval python dsg/viz_clip_similarity.py recording=<recording_name> # Object reconstruction python dsg/viz_obj_reconstruction.py recording=<recording_name>
If you have a raw ZED recording:
- Record data with ZED Mini camera and save as
.svo2file - Extract frames and poses:
bash scripts/extract_zed.sh
- Follow steps above.
Our work builds heavily on foundations models such as SAM and CLIP and SALAD. We thank the authors for their work and open-source code.
@article{korth2025dynamic,
author = {Korth, Daniel and Anadon, Xavier and Pollefeys, Marc and Bauer, Zuria and Barath, Daniel},
title = {Dynamic 3D Scene Graphs from RGB-D},
year = {2025},
}