Xiaoxue Chen¹,²*, Ziyi Xiong¹,²*, Yuantao Chen¹, Gen Li¹, Nan Wang¹,
Hongcheng Luo², Long Chen², Haiyang Sun²†, Bing Wang², Guang Chen², Hangjun Ye²,✉,
Hongyang Li³, Ya-Qin Zhang¹, Hao Zhao¹,⁴,✉
¹ AIR, Tsinghua University
² Xiaomi EV
³ The University of Hong Kong
⁴ Beijing Academy of Artificial Intelligence
* These authors contributed equally † Project leader
Our method introduces a fully pose-free feedforward framework DGGT for reconstructing dynamic driving scenes directly from unposed RGB images. The model predicts camera poses, 3D Gaussian maps, dynamic motion in a single pass — without per-scene optimization or camera calibration.
CLICK for the full abstract
Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.
- [√] Release pre-trained checkpoints on Waymo, NuScenes and Argoverse2
- [√] Release the inference code of our model to facilitate further research and reproducibility.
- Release the training code [after paper accepted]
This codebase provides support for Waymo Open Dataset, Nuscenes and Argoverse2. We provide instructions and scripts on how to download and preprocess these datasets:
| Dataset | Instruction |
|---|---|
| Waymo | Data Process Instruction |
| NuScenes | Data Process Instruction |
| Argoverse2 | Data Process Instruction |
- Create conda environment
conda create -n dggt python=3.10
conda activate dggt
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1
pip install -r requirements.txt- Compile pointops2
cd third_party/pointops2
python setup.py install
cd ../..Download our pretrained inference model (trained on Waymo Open Dataset, 1 views) checkpoint here to pretrained/model_latest_waymo.pth.
Download our pretrained diffusion model checkpoint here to pretrained/diffusion_model.pth.
Download TAPIP3D model checkpoint here to pretrained/tracking_model.pth.
Other checkpoints will be coming soon.
You can test existing models on the Waymo Open dataset.
python inference.py \
--image_dir /path/to/images \
--scene_names 3 5 7 \
--input_views 1 \
--intervals 2 \
--sequence_length 4 \
--start_idx 0 \
--mode 2 \
--ckpt_path /path/to/checkpoint.pth \
--output_path /path/to/output \
-images \
-depth \
-diffusion \
-metrics --image_dir <path>: Specifies the directory containing the input images (required).
--scene_names <names>: A string representing the scene names to process, supporting formats like 3 5 7 or "(3,7)" (required).
--mode <mode>: Specifies the processing mode, with acceptable values of 1--train, 2--reconstruction, or 3--interplation (required).
--ckpt_path <path>: The path to the pre-trained model weights file (required).
--output_path <path>: The directory where the output results will be saved (required).
--input_views <views>: Number of input cameras like 1 or 3(required).
--intervals <interval>: The interval of interpolation frames when performing frame interpolation (mode=3), defaulting to 2 (optional).
--sequence_length <length>: Defines the number of input frames to consider for each inference, defaulting to 4 (optional).
--start_idx <index>: Indicates the starting index of the frames to process, defaulting to 0 (optional).
-images: A flag that, when specified, enables the output of rendered images for each frame (optional).
-depth: A flag that, when specified, enables the output of depth maps in .npy format for each frame (optional).
-metrics: A flag that, when specified, enables the output of evaluation metrics (PSNR, SSIM, LPIPS) after processing (optional).
-diffusion: Whether to use diffusion model to optimize the rendered images (time-consuming) (optional).
Quantitative Comparison under Trained and Zero-Shot Settings on nuScenes and Argoverse2 datasets.
You can evaluate the model in two complementary settings to demonstrate both generalization and adaptability:
You can use the model trained on Waymo to perform inference directly on the Argoverse2 or nuScenes datasets — without any retraining or pose calibration.
This setting highlights the model’s strong cross-dataset generalization and robustness to unseen driving domains.
Argoverse2/Nuscenes
python inference.py \
--image_dir /path/to/argoverse_or_nuscenes_images \
--scene_names 3 5 7 \
--input_views 1 \
--sequence_length 4 \
--start_idx 0 \
--mode 2 \
--ckpt_path /path/to/waymo_checkpoint.pth \
--output_path /path/to/output \
-images \
-depth \
-metrics \You can also train the model on the target dataset (e.g., Argoverse2) and evaluate it on the same domain.
This setting measures the model’s in-domain adaptability, showing its capacity to achieve state-of-the-art reconstruction quality when optimized for the target environment.
Argoverse2/Nuscenes
python inference.py \
--image_dir /path/to/argoverse_or_nuscenes_images \
--scene_names 3 5 7 \
--input_views 1 \
--sequence_length 4 \
--start_idx 0 \
--mode 2 \
--ckpt_path /path/to/argoverse_or_nuscenes_checkpoint.pth \
--output_path /path/to/output \Together, these two experiments verify that our model not only generalizes well across unseen scenes, but also scales effectively to achieve top performance when fine-tuned on new domains.
If you find this project useful, please consider citing:
@article{chenfeedforward,
title={Feedforward 4D Reconstruction for Dynamic Driving Scenes using Unposed Images},
author={Chen, Xiaoxue and Xiong, Ziyi and Chen, Yuantao and Li, Gen and Wang, Nan and Luo, Hongcheng and Chen, Long and Sun, Haiyang and WANG, BING and Chen, Guang and others}
}
This project is licensed under the Apache License 2.0. Some files in this repository are derived from VGGT (facebookresearch/vggt) and are licensed under the VGGT upstream license. See NOTICE for details.