This repository serves as the official resource collection for the paper "Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems".
In this work, we establish a systematic taxonomy for the field, unifying terminology, scope, and evaluation benchmarks. We organize existing methodologies into three complementary paradigms based on information flow and abstraction level:
- π· Single-Modality Pre-Training
The Bedrock of Perception. Focuses on extracting foundational features from individual sensor streams (Camera or LiDAR) via self-supervised learning techniques, such as Contrastive Learning, Masked Modeling, and Forecasting. This paradigm establishes the fundamental representations for sensor-specific tasks. - π Multi-Modality Pre-Training
Bridging the Semantic-Geometric Gap. Leverages cross-modal synergy to fuse heterogeneous sensor data. This category includes LiDAR-Centric (distilling visual semantics into geometry), Camera-Centric (injecting geometric priors into vision), and Unified frameworks that jointly learn modality-agnostic representations. - π Open-World Perception and Planning
The Frontier of Embodied Autonomy. Represents the evolution from passive perception to active decision-making. This paradigm encompasses Generative World Models (e.g., video/occupancy generation), Embodied Vision-Language-Action (VLA) models, and systems capable of Open-World reasoning.
π Paper Link
If you find this work helpful for your research, please kindly consider citing our paper:
@article{wang2026forging,
title = {Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems},
author = {Song Wang and Lingdong Kong and Xiaolu Liu and Hao Shi and Wentong Li and Jianke Zhu and Steven C. H. Hoi},
journal = {arXiv preprint arXiv:2512.24385},
year = {2025}
}- 1. Benchmarks & Datasets
- 2. Single-Modality Pre-Training
- 3. Multi-Modality Pre-Training
- 4. Open-World Perception and Planning
- 5. Acknowledgements
| Dataset | Venue | Sensor | Task | Website |
|---|---|---|---|---|
KITTI |
CVPR'12 | 2 Cam(RGB), 2 Cam(Gray), 1 LiDAR(64) | 3D Det, Stereo, Optical Flow, SLAM | |
ApolloScape |
TPAMI'19 | 2 Cam, 2 LiDAR | 3D Det, HD Map | |
nuScenes |
CVPR'20 | 6 Cam(RGB), 1 LiDAR(32), 5 Radar | 3D Det, Seg, Occ, Map | |
SemanticKITTI |
ICCV'19 | 4 Cam, 1 LiDAR(64) | 3D Det, Occ | |
Waymo |
CVPR'20 | 5 Cam(RGB), 5 LiDAR | Perception (Det, Seg, Track), Motion | |
Argoverse |
CVPR'19 | 7 Cam(RGB), 2 LiDAR(32) | 3D Tracking, Forecasting, Map | |
Lyft L5 |
CoRL'20 | 7 Cam(RGB), 3 LiDAR, 5 Radar | 3D Det, Motion Forecasting/Planning | |
A*3D |
ICRA'20 | 2 Cam, 1 LiDAR(64) | 3D Det | |
KITTI-360 |
TPAMI'22 | 4 Cam, 1 LiDAR(64) | 3D Det, Occ | |
A2D2 |
arXiv'20 | 6 Cam, 5 LiDAR(16) | 3D Det | |
PandaSet |
ITSC'21 | 6 Cam(RGB), 2 LiDAR(64) | 3D Det, LiDAR Seg | |
Cirrus |
ICRA'21 | 1 Cam, 2 LiDAR(64) | 3D Det | |
ONCE |
NeurIPS'21 | 7 Cam(RGB), 1 LiDAR(40) | 3D Det (Self-supervised/Semi-supervised) | |
Shifts |
arXiv'21 | - | 3D Det, HD Map | |
nuPlan |
arXiv'21 | 8 Cam, 5 LiDAR | 3D Det, HD Map, E2E Plan | |
Argoverse2 |
NeurIPS'21 | 7 Cam, 2 LiDAR(32) | 3D Det, Occ, HD Map, E2E Plan | |
MONA |
ITSC'22 | 3 Cam | 3D Det, HD Map | |
Dual Radar |
Sci. Data'25 | 1 Cam, 1 LiDAR(80) 2 Radar | 3D Det | |
MAN TruckScenes |
NeurIPS'24 | 4 Cam, 6 LiDAR(64), 6 RADAR | 3D Det | |
OmniHD-Scenes |
arXiv'24 | 6 Cam, 1 LiDAR(128), 6 RADAR | 3D Det, Occ, HD Map | |
AevaScenes |
2025 | 6 Cam, 6 LiDAR | 3D Det, HD Map | |
PhysicalAI-AV |
2025 | 7 Cam, 1 LiDAR, 11 RADAR | E2E Plan |
| Dataset | Venue | Sensor | Task | Website |
|---|---|---|---|---|
Campus |
ECCV'16 | 1 Cam | Target Forecasting/ Tracking | |
UAV123 |
ECCV'16 | 1 Cam | UAV Trackong | |
CarFusion |
CVPR'18 | 22 Cam | 3D Vehicle Reconstruction | |
UAVDT |
ECCV'18 | 1 Cam | 2D Object Detection/ Tracking | |
DOTA |
CVPR'18 | Multi-Scoure | 2D Object Detection | |
VisDrone |
TPAMI'21 | 1 Cam | 2D Object Detection/ Tracking | |
DOTA V2.0 |
TPAMI'21 | Multi-Scoure | 2D Object Detection | |
MOR-UAV |
MM'20 | 1 Cam | Moving Object Recognation | |
AU-AIR |
ICRA'20 | 1 Cam | 2D Object Detection | |
UAVid |
ISPRS JPRS'20 | 1 Cam | Semantic Segmentation | |
MOHR |
Neuro'21 | 3 Cam | 2D Object Detection | |
SensatUrban |
CVPR'21 | 1 Cam | 2D Object Detection | |
UAVDark135 |
TMC'22 | 1 Cam | 2D Object Tracking | |
MAVREC |
CVPR'24 | 1 Cam | 2D Obejct Detection | |
BioDrone |
IJCV'24 | 1 Cam | 2D Object Tracking | |
PDT |
ECCV'24 | 1 Cam, 1 LiDAR | 2D Object Detection | |
UAV3D |
NeurIPS'24 | 5 Cam | 3D Object Detection/ Tracking | |
IndraEye |
arXiv'24 | 1 Cam | 2D Object Detection/ Semantic Segmentation | |
UAVScenes |
ICCV'25 | 1 Cam, 1 LiDAR | Semantic Segmentation, Visual Localization |
| Dataset | Venue | Platform | Sensors | Website |
|---|---|---|---|---|
RailSem19 |
CVPRW'19 | Railway | 1Γ Camera | |
FRSign |
arXiv'20 | Railway | 2Γ Camera (Stereo) | |
RAWPED |
TVT'20 | Railway | 1Γ Camera | |
SRLC |
AutCon'21 | Railway | 1Γ LiDAR | |
Rail-DB |
MM'22 | Railway | 1Γ Camera | |
RailSet |
IPAS'22 | Railway | 1Γ Camera | |
OSDaR23 |
ICRAE'23 | Railway | 9Γ Camera, 6Γ LiDAR, 1Γ Radar | |
Rail3D |
Infra'24 | Railway | 4Γ Camera, 1Γ LiDAR | |
WHU-Railway3D |
TITS'24 | Railway | 1Γ LiDAR | |
FloW |
ICCV'21 | USV (Water) | 2Γ Camera, 1Γ 4D Radar | |
DartMouth |
IROS'21 | USV (Water) | 3Γ Camera, 1Γ LiDAR | |
MODS |
TITS'21 | USV (Water) | 2Γ Camera, 1Γ LiDAR | |
SeaSAW |
CVPRW'22 | USV (Water) | 5Γ Camera | |
WaterScenes |
T-ITS'24 | USV (Water) | 1Γ Camera, 1Γ 4D Radar | |
MVDD13 |
Appl. Ocean Res.'24 | USV (Water) | 1Γ Camera | |
SeePerSea |
TFR'25 | USV (Water) | 1Γ Camera, 1Γ LiDAR | |
WaterVG |
TITS'25 | USV (Water) | 1Γ Camera, 1Γ 4D Radar | |
Han et al. |
NMI'24 | Legged Robot | 1Γ Depth Camera | |
Luo et al. |
CVPR'25 | Legged Robot | 1Γ Panoramic Camera | |
QuadOcc |
arXiv'25 | Legged Robot | 1Γ Panoramic Camera, 1Γ LiDAR | |
M3ED |
CVPRW'23 | Multi-Robot | 3Γ Camera, 2Γ Event Camera, 1Γ LiDAR | |
Pi3DET |
ICCV'25 | Multi-Robot | 3Γ Camera, 2Γ Event Camera, 1Γ LiDAR |
Methods utilizing Point Cloud Contrastive Learning, Masked Autoencoders (MAE), or Forecasting.
Self-supervised learning from image sequences for driving/robotics.
| Model | Paper | Venue | GitHub |
|---|---|---|---|
INoD |
Injected Noise Discriminator for Self-Supervised Representation | RA-L 2023 | |
TempO |
Self-Supervised Representation Learning From Temporal Ordering | RA-L 2024 | |
LetsMap |
Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping | ECCV 2024 | |
NeRF-MAE |
Masked AutoEncoders for Self-Supervised 3D Representation Learning | ECCV 2024 | |
VisionPAD |
A Vision-Centric Pre-training Paradigm for Autonomous Driving | arXiv 2024 |
Enhancing LiDAR representations using Vision foundation models (Knowledge Distillation).
Learning 3D Geometry from Camera inputs using LiDAR supervision.
Joint optimization of multi-modal encoders for unified representations.
Incorporating additional modalities into pre-training frameworks for representation learning.
| Model | Paper | Venue | GitHub |
|---|---|---|---|
RadarContrast |
Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation | DCOSS-IoT 2024 | |
AssociationNet |
Radar camera fusion via representation learning in autonomous driving | CVPRW 2021 | |
MVRAE |
Multi-View Radar Autoencoder for Self-Supervised Automotive Radar Representation Learning | IV 2024 | |
SSRLD |
Self-supervised representation learning for the object detection of marine radar | ICCAI 2022 | |
U-MLPNet |
Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception | Remote Sens 2024 | |
4D-ROLLS |
4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision | arXiv 2025 | |
SS-RODNet |
Pre-Training For mmWave Radar Object Detection Through Masked Image Modeling | SS-RODNet | |
Radical |
Bootstrapping autonomous driving radars with self-supervised learning | CVPR 2024 | |
RiCL |
Leveraging Self-Supervised Instance Contrastive Learning for Radar Object Detection | arXiv 2024 | |
RSLM |
Radar spectra-language model for automotive scene parsing | RADAR 2024 |
| Model | Paper | Venue | GitHub |
|---|---|---|---|
ECDP |
Event Camera Data Pre-training | ICCV 2023 | |
MEM |
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras | WACV 2024 | |
DMM |
Data-efficient event camera pre-training via disentangled masked modeling | arXiv 2024 | |
STP |
Enhancing Event Camera Data Pretraining via Prompt-Tuning with Visual Models | - | |
ECDDP |
Event Camera Data Dense Pre-training | ECCV2024 | |
EventBind |
Eventbind: Learning a unified representation to bind them all for event-based open-world understanding | ECCV2024 | |
EventFly |
EventFly: Event Camera Perception from Ground to the Sky | CVPR 2025 |
We thank the authors of the referenced papers for their open-source contributions.
