Skip to content

worldbench/awesome-spatial-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Forging Spatial Intelligence

A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Awesome Visitors PR's Welcome

Taxonomy of Spatial Intelligence
Figure 1: Taxonomy of Multi-Modal Representation Learning for Spatial Intelligence.

This repository serves as the official resource collection for the paper "Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems".

In this work, we establish a systematic taxonomy for the field, unifying terminology, scope, and evaluation benchmarks. We organize existing methodologies into three complementary paradigms based on information flow and abstraction level:

  • πŸ“· Single-Modality Pre-Training
    The Bedrock of Perception. Focuses on extracting foundational features from individual sensor streams (Camera or LiDAR) via self-supervised learning techniques, such as Contrastive Learning, Masked Modeling, and Forecasting. This paradigm establishes the fundamental representations for sensor-specific tasks.
  • πŸ”„ Multi-Modality Pre-Training
    Bridging the Semantic-Geometric Gap. Leverages cross-modal synergy to fuse heterogeneous sensor data. This category includes LiDAR-Centric (distilling visual semantics into geometry), Camera-Centric (injecting geometric priors into vision), and Unified frameworks that jointly learn modality-agnostic representations.
  • 🌍 Open-World Perception and Planning
    The Frontier of Embodied Autonomy. Represents the evolution from passive perception to active decision-making. This paradigm encompasses Generative World Models (e.g., video/occupancy generation), Embodied Vision-Language-Action (VLA) models, and systems capable of Open-World reasoning.

πŸ“„ Paper Link


Citation

If you find this work helpful for your research, please kindly consider citing our paper:

@article{wang2026forging,
    title   = {Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems},
    author  = {Song Wang and Lingdong Kong and Xiaolu Liu and Hao Shi and Wentong Li and Jianke Zhu and Steven C. H. Hoi},
    journal = {arXiv preprint arXiv:2512.24385},
    year    = {2025}
}

Table of Contents


1. Benchmarks & Datasets

Vehicle-Based Datasets

Dataset Venue Sensor Task Website
KITTI CVPR'12 2 Cam(RGB), 2 Cam(Gray), 1 LiDAR(64) 3D Det, Stereo, Optical Flow, SLAM Website
ApolloScape TPAMI'19 2 Cam, 2 LiDAR 3D Det, HD Map Website
nuScenes CVPR'20 6 Cam(RGB), 1 LiDAR(32), 5 Radar 3D Det, Seg, Occ, Map Website
SemanticKITTI ICCV'19 4 Cam, 1 LiDAR(64) 3D Det, Occ Website
Waymo CVPR'20 5 Cam(RGB), 5 LiDAR Perception (Det, Seg, Track), Motion Website
Argoverse CVPR'19 7 Cam(RGB), 2 LiDAR(32) 3D Tracking, Forecasting, Map Website
Lyft L5 CoRL'20 7 Cam(RGB), 3 LiDAR, 5 Radar 3D Det, Motion Forecasting/Planning Website
A*3D ICRA'20 2 Cam, 1 LiDAR(64) 3D Det Website
KITTI-360 TPAMI'22 4 Cam, 1 LiDAR(64) 3D Det, Occ Website
A2D2 arXiv'20 6 Cam, 5 LiDAR(16) 3D Det Website
PandaSet ITSC'21 6 Cam(RGB), 2 LiDAR(64) 3D Det, LiDAR Seg Website
Cirrus ICRA'21 1 Cam, 2 LiDAR(64) 3D Det Website
ONCE NeurIPS'21 7 Cam(RGB), 1 LiDAR(40) 3D Det (Self-supervised/Semi-supervised) Website
Shifts arXiv'21 - 3D Det, HD Map Website
nuPlan arXiv'21 8 Cam, 5 LiDAR 3D Det, HD Map, E2E Plan Website
Argoverse2 NeurIPS'21 7 Cam, 2 LiDAR(32) 3D Det, Occ, HD Map, E2E Plan Website
MONA ITSC'22 3 Cam 3D Det, HD Map Website
Dual Radar Sci. Data'25 1 Cam, 1 LiDAR(80) 2 Radar 3D Det Website
MAN TruckScenes NeurIPS'24 4 Cam, 6 LiDAR(64), 6 RADAR 3D Det Website
OmniHD-Scenes arXiv'24 6 Cam, 1 LiDAR(128), 6 RADAR 3D Det, Occ, HD Map Website
AevaScenes 2025 6 Cam, 6 LiDAR 3D Det, HD Map Website
PhysicalAI-AV 2025 7 Cam, 1 LiDAR, 11 RADAR E2E Plan Website

Drone-Based Datasets

Dataset Venue Sensor Task Website
Campus ECCV'16 1 Cam Target Forecasting/ Tracking Website
UAV123 ECCV'16 1 Cam UAV Trackong Website
CarFusion CVPR'18 22 Cam 3D Vehicle Reconstruction Website
UAVDT ECCV'18 1 Cam 2D Object Detection/ Tracking Website
DOTA CVPR'18 Multi-Scoure 2D Object Detection Website
VisDrone TPAMI'21 1 Cam 2D Object Detection/ Tracking Website
DOTA V2.0 TPAMI'21 Multi-Scoure 2D Object Detection Website
MOR-UAV MM'20 1 Cam Moving Object Recognation Website
AU-AIR ICRA'20 1 Cam 2D Object Detection Website
UAVid ISPRS JPRS'20 1 Cam Semantic Segmentation Website
MOHR Neuro'21 3 Cam 2D Object Detection Website
SensatUrban CVPR'21 1 Cam 2D Object Detection Website
UAVDark135 TMC'22 1 Cam 2D Object Tracking Website
MAVREC CVPR'24 1 Cam 2D Obejct Detection Website
BioDrone IJCV'24 1 Cam 2D Object Tracking Website
PDT ECCV'24 1 Cam, 1 LiDAR 2D Object Detection Website
UAV3D NeurIPS'24 5 Cam 3D Object Detection/ Tracking Website
IndraEye arXiv'24 1 Cam 2D Object Detection/ Semantic Segmentation Website
UAVScenes ICCV'25 1 Cam, 1 LiDAR Semantic Segmentation, Visual Localization Website

Other Robotic Platforms

Dataset Venue Platform Sensors Website
RailSem19 CVPRW'19 Railway 1Γ— Camera Website
FRSign arXiv'20 Railway 2Γ— Camera (Stereo) Website
RAWPED TVT'20 Railway 1Γ— Camera Website
SRLC AutCon'21 Railway 1Γ— LiDAR
Rail-DB MM'22 Railway 1Γ— Camera Website
RailSet IPAS'22 Railway 1Γ— Camera
OSDaR23 ICRAE'23 Railway 9Γ— Camera, 6Γ— LiDAR, 1Γ— Radar Website
Rail3D Infra'24 Railway 4Γ— Camera, 1Γ— LiDAR Website
WHU-Railway3D TITS'24 Railway 1Γ— LiDAR Website
FloW ICCV'21 USV (Water) 2Γ— Camera, 1Γ— 4D Radar Website
DartMouth IROS'21 USV (Water) 3Γ— Camera, 1Γ— LiDAR Website
MODS TITS'21 USV (Water) 2Γ— Camera, 1Γ— LiDAR Website
SeaSAW CVPRW'22 USV (Water) 5Γ— Camera Website
WaterScenes T-ITS'24 USV (Water) 1Γ— Camera, 1Γ— 4D Radar Website
MVDD13 Appl. Ocean Res.'24 USV (Water) 1Γ— Camera Website
SeePerSea TFR'25 USV (Water) 1Γ— Camera, 1Γ— LiDAR Website
WaterVG TITS'25 USV (Water) 1Γ— Camera, 1Γ— 4D Radar Website
Han et al. NMI'24 Legged Robot 1Γ— Depth Camera Website
Luo et al. CVPR'25 Legged Robot 1Γ— Panoramic Camera Website
QuadOcc arXiv'25 Legged Robot 1Γ— Panoramic Camera, 1Γ— LiDAR Website
M3ED CVPRW'23 Multi-Robot 3Γ— Camera, 2Γ— Event Camera, 1Γ— LiDAR Website
Pi3DET ICCV'25 Multi-Robot 3Γ— Camera, 2Γ— Event Camera, 1Γ— LiDAR Website

2. Single-Modality Pre-Training

LiDAR-Only

Methods utilizing Point Cloud Contrastive Learning, Masked Autoencoders (MAE), or Forecasting.

Model Paper Venue GitHub
PointContrast Unsupervised Pre-training for 3D Point Cloud Understanding ECCV 2020 GitHub
DepthContrast Self-supervised Pretraining of 3D Features on any Point-Cloud ICCV 2021 GitHub
GCC-3D Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection ICCV 2021
ContrastiveSceneContexts Exploring data-efficient 3d scene understanding with contrastive scene contexts CVPR 2021 GitHub
SegContrast 3D Point Cloud Feature Representation Learning through Self-supervised Segment Discrimination RA-L 2021 GitHub
GroupContrast Groupcontrast: Semantic-aware self-supervised representation learning for 3d understanding CVPR 2024 GitHub
ProposalContrast Unsupervised Pre-training for LiDAR-Based 3D Object Detection ECCV 2022 GitHub
Occupancy-MAE Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders T-IV 2023 GitHub
ALSO Automotive LiDAR Self-supervision by Occupancy Estimation CVPR 2023 GitHub
GD-MAE Generative Decoder for MAE Pre-training on LiDAR Point Clouds CVPR 2023 GitHub
AD-PT Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset NeurIPS 2023 GitHub
E-SSL Equivariant spatio-temporal self-supervision for lidar object detection ECCV 2024
PatchContrast Self-Supervised Pre-training for 3D Object Detection CVPRW 2025
MV-JAR Mv-jar: Masked voxel jigsaw and reconstruction for lidar-based self-supervised pre-training CVOR 2023 GitHub
Occupancy-MAE Occupancy-mae: Self-supervised pre-training large-scale lidar point clouds with masked occupancy autoencoders TIV 2023 GitHub
Core Core: Cooperative reconstruction for multi-agent perception ICCV 2023 GitHub
MAELi Masked Autoencoder for Large-Scale LiDAR Point Clouds WACV 2024
BEV-MAE Bird's Eye View Masked Autoencoders for Point Cloud Pre-training AAAI 2024 GitHub
AD-L-JEPA AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data AAAI 2026 GitHub
UnO Unsupervised Occupancy Fields for Perception and Forecasting CVPR 2024
BEVContrast Self-Supervision in BEV Space for Automotive Lidar Point Clouds 3DV 2024 GitHub
4DContrast 4dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding ECCV 2022 GitHub
Copilot4D Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion ICLR 2024
T-MAE Temporal Masked Autoencoders for Point Cloud Representation Learning ECCV 2024 GitHub
PICTURE Point Cloud Reconstruction Is Insufficient to Learn 3D Representations ACM MM 2024
LSV-MAE Rethinking Masked-Autoencoder-Based 3D Point Cloud Pretraining IV 2024
UNIT Unsupervised Online Instance Segmentation through Time arXiv 2024 GitHub
R-MAE Sense Less, Generate More: Pre-training LiDAR Perception with Masked Autoencoders arXiv 2024 GitHub
TurboTrain TurboTrain: Towards efficient and balanced multi-task learning for multi-agent perception and prediction ICCV 2025 GitHub
NOMAE Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds CVPR 2025
4D Occ Point cloud forecasting as a proxy for 4d occupancy forecasting CVPR 2023 GitHub
GPICTURE Mutual information-driven self-supervised point cloud pre-training KBS 2025
CooPre CooPre: Cooperative pretraining for v2x cooperative perception IROS 2025 GitHub
TREND TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception arXiv 2024

Camera-Only

Self-supervised learning from image sequences for driving/robotics.

Model Paper Venue GitHub
INoD Injected Noise Discriminator for Self-Supervised Representation RA-L 2023 GitHub
TempO Self-Supervised Representation Learning From Temporal Ordering RA-L 2024
LetsMap Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping ECCV 2024
NeRF-MAE Masked AutoEncoders for Self-Supervised 3D Representation Learning ECCV 2024 GitHub
VisionPAD A Vision-Centric Pre-training Paradigm for Autonomous Driving arXiv 2024

3. Multi-Modality Pre-Training

LiDAR-Centric Pre-Training

Enhancing LiDAR representations using Vision foundation models (Knowledge Distillation).

Model Paper Venue GitHub
SLidR Image-to-Lidar Self-Supervised Distillation CVPR 2022 GitHub
SimIPU Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations AAAI 2022 GitHub
SSPC-Im Self-supervised pre-training of 3d point cloud networks with image data CoRL 2022
ST-SLidR Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss CVPR 2023
I2P-MAE Learning 3D Representations from 2D Pre-trained Models via Image-to-Point MAE CVPR 2023 GitHub
TriCC Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast CVPR 2023
Seal Segment Any Point Cloud Sequences by Distilling Vision FMs NeurIPS 23 GitHub
PRED Pre-training via Semantic Rendering on LiDAR Point Clouds NeurIPS 23
LiMA Beyond one shot, beyond one perspective: Cross-view and long-horizon distillation for better lidar representations ICCV 2025 GitHub
ImageTo360 360Β° from a Single Camera: A Few-Shot Approach for LiDAR Segmentation ICCVW 2023
ScaLR Three Pillars improving Vision Foundation Model Distillation for Lidar CVPR 2024
CSC Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception CVPR 2024 GitHub
GPC Pre-Training LiDAR-Based 3D Object Detectors Through Colorization ICLR 2024 GitHub
Cross-Modal SSL Cross-Modal Self-Supervised Learning with Effective Contrastive Units IROS 2024 GitHub
SuperFlow 4D Contrastive Superflows are Dense 3D Representation Learners ECCV 2024 GitHub
Rel Image-to-Lidar Relational Distillation for Autonomous Driving Data ECCV 2024
HVDistill Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation IJCV 2024 GitHub
RadarContrast Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation DCOSS-IoT 2024
CM3D Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection CoRL 2024 GitHub
OLIVINE Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models NeurIPS 2024 GitHub
EUCA-3DP Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining arXiv 2024
GASP Gasp: Unifying geometric and semantic self-supervised pre-training for autonomous driving arXiv 2025 GitHub
BALViT Label-Efficient LiDAR Scene Understanding with 2D-3D Vision Transformer Adapters ICRAW 2025

Camera-Centric Pre-Training

Learning 3D Geometry from Camera inputs using LiDAR supervision.

Model Paper Venue GitHub
DD3D Is Pseudo-Lidar needed for Monocular 3D Object detection? ICCV 2021 GitHub
DEPT Delving into the Pre-training Paradigm of Monocular 3D Object Detection arXiv 2022
OccNet Scene as Occupancy ICCV 2023 GitHub
GeoMIM Towards Better 3D Knowledge Transfer via Masked Image Modeling ICCV 2023 GitHub
GAPretrain Geometric-aware Pretraining for Vision-centric 3D Object Detection arXiv 2023 GitHub
UniScene Multi-Camera Unified Pre-training via 3D Scene Reconstruction RA-L 2024 GitHub
SelfOcc Self-Supervised Vision-Based 3D Occupancy Prediction CVPR 2024 GitHub
ViDAR Visual Point Cloud Forecasting enables Scalable Autonomous Driving CVPR 2024 GitHub
DriveWorld 4D Pre-trained Scene Understanding via World Models CVPR 2024
OccFeat Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation CVPRW 2024
OccWorld Learning a 3D Occupancy World Model for Autonomous Driving ECCV 2024 GitHub
MVS3D Exploiting the Potential of Multi-Frame Stereo Depth Estimation Pre-training IJCNN 2024
OccSora 4D Occupancy Generation Models as World Simulators arXiv 2024 GitHub
MIM4D Masked Modeling with Multi-View Video for Autonomous Driving arXiv 2024 GitHub
GaussianPretrain A Simple Unified 3D Gaussian Representation for Visual Pre-training arXiv 2024 GitHub
S3PT S3pt: Scene semantics and structure guided clustering to boost self-supervised pre-training for autonomous driving WACV 2025
UniFuture Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception arXiv 2025 GitHub
GaussianOcc Gaussianocc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting ICCV 2025 GitHub
GaussianTR Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding CVPR 2025 GitHub
DistillNeRF Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features NeurIPS 2024 GitHub

Unified Pre-Training

Joint optimization of multi-modal encoders for unified representations.

Model Paper Venue GitHub
PonderV2 Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm arXiv 2023 GitHub
UniPAD A Universal Pre-training Paradigm for Autonomous Driving CVPR 2024 GitHub
UniM2AE Multi-Modal Masked Autoencoders with Unified 3D Representation ECCV 2024 GitHub
ConDense Consistent 2D/3D Pre-training for Dense and Sparse Features ECCV 2024
Unified Pretrain Learning Shared RGB-D Fields: Unified Self-supervised Pre-training arXiv 2024 GitHub
BEVWorld A Multimodal World Simulator for Autonomous Driving via Unified BEV Latent Space arXiv 2024 GitHub
NS-MAE Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception arXiv 2024
CLAP CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning arXiv 2024
GS3 Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting arXiv 2024
Hermes Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation ICCV 2025 GitHub
LRS4Fusion Self-Supervised Sparse Sensor Fusion for Long Range Perception ICCV 2025 GitHub
Gaussian2Scene Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting arXiv 2025

Incorporating Additional Sensors: With Radar

Incorporating additional modalities into pre-training frameworks for representation learning.

Model Paper Venue GitHub
RadarContrast Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation DCOSS-IoT 2024
AssociationNet Radar camera fusion via representation learning in autonomous driving CVPRW 2021
MVRAE Multi-View Radar Autoencoder for Self-Supervised Automotive Radar Representation Learning IV 2024
SSRLD Self-supervised representation learning for the object detection of marine radar ICCAI 2022
U-MLPNet Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception Remote Sens 2024
4D-ROLLS 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision arXiv 2025 GitHub
SS-RODNet Pre-Training For mmWave Radar Object Detection Through Masked Image Modeling SS-RODNet
Radical Bootstrapping autonomous driving radars with self-supervised learning CVPR 2024 GitHub
RiCL Leveraging Self-Supervised Instance Contrastive Learning for Radar Object Detection arXiv 2024
RSLM Radar spectra-language model for automotive scene parsing RADAR 2024

Incorporating Additional Sensors: With Event Camera

Model Paper Venue GitHub
ECDP Event Camera Data Pre-training ICCV 2023 GitHub
MEM Masked Event Modeling: Self-Supervised Pretraining for Event Cameras WACV 2024 GitHub
DMM Data-efficient event camera pre-training via disentangled masked modeling arXiv 2024
STP Enhancing Event Camera Data Pretraining via Prompt-Tuning with Visual Models -
ECDDP Event Camera Data Dense Pre-training ECCV2024 GitHub
EventBind Eventbind: Learning a unified representation to bind them all for event-based open-world understanding ECCV2024 GitHub
EventFly EventFly: Event Camera Perception from Ground to the Sky CVPR 2025

4. Open-World Perception and Planning

Text-Grounded Understanding

Model Paper Venue GitHub
CLIP2Scene Towards Label-efficient 3D Scene Understanding by CLIP CVPR 2023 GitHub
OpenScene 3D Scene Understanding with Open Vocabularies CVPR 2023 GitHub
CLIP-ZSPCS Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation MM 2023
CLIP-FO3D Learning Free Open-world 3D Scene Representations from 2D Dense CLIP ICCVW 2023
POP-3D Open-Vocabulary 3D Occupancy Prediction from Images NeurIPS 2023 GitHub
VLM2Scene Self-Supervised Image-Text-LiDAR Learning with Foundation Models AAAI 2024 GitHub
IntraCorr3D Hierarchical intra-modal correlation learning for label-free 3d semantic segmentation CVPR 2024
SAL Better Call SAL: Towards Learning to Segment Anything in Lidar ECCV 2024 GitHub
Affinity3D Propagating Instance-Level Semantic Affinity for Zero-Shot Semantic Seg ACM MM 2024
UOV 3D Unsupervised Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving arXiv 2024 GitHub
OVO OVO: Open-Vocabulary Occupancy arXiv 2023 GitHub
LangOcc Langocc: Self-supervised open vocabulary occupancy estimation via volume rendering 3DV 2025 GitHub
VEON VEON: Vocabulary-Enhanced Occupancy Prediction ECCV 2024
LOcc Language Driven Occupancy Prediction ICCV 2025 GitHub
UP-VL Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving ICCV 2023
ZPCS-MM See more and know more: Zero-shot point cloud segmentation via multi-modal visual data ICCV 2023
CNS Towards label-free scene understanding by vision foundation models NeurIPS GitHub
3DOV-VLD 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation ECCV 2024
CLIP^2 CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data CVPR 2023
AdaCo Adaco: Overcoming visual foundation model noise in 3d semantic segmentation via adaptive label correction AAAI 2025
TT-Occ TT-Occ: Test-Time Compute for Self-Supervised Occupancy via Spatio-Temporal Gaussian Splatting arXiv 2025
AutoOcc AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting ICCV 2025

Unified World Representation for Action

Model Paper Venue GitHub
OccWorld OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving ECCV 2024 GitHub
GenAD Generalized Predictive Model for Autonomous Driving CVPR 2024 GitHub
OccSora OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving arXiv 2024 GitHub
OccLLaMA OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving arXiv 2024
OccVAR OccVAR: Scalable 4D Occupancy Prediction via Next-Scale Prediction -
RenderWorld Renderworld: World Model with Self-Supervised 3D Label ICRA 2025
Drive-OccWorld Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving AAAI 2025 GitHub
LAW Enhancing End-to-End Autonomous Driving with Latent World Model ICLR 2025 GitHub
FSF-Net FSF-Net: Enhance 4D occupancy forecasting with coarse BEV scene flow for autonomous driving PR 2025
DriveX DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving arXiv 2025
SPOT SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving TPAMI 2025 GitHub
WoTE End-to-End Driving with Online Trajectory Evaluation via BEV World Model ICCV 2025 GitHub
FASTopoWM FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models arXiv 2025 GitHub
OccTens OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction arXiv 2025
OccVLA Occvla: Vision-language-action model with implicit 3d occupancy supervision arXIv 2025
World4Drive World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model ICCV 2025 GitHub

5. Acknowledgements

We thank the authors of the referenced papers for their open-source contributions.