Forging Spatial Intelligence

A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems


Figure 1: Taxonomy of Multi-Modal Representation Learning for Spatial Intelligence.

This repository serves as the official resource collection for the paper "Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems".

In this work, we establish a systematic taxonomy for the field, unifying terminology, scope, and evaluation benchmarks. We organize existing methodologies into three complementary paradigms based on information flow and abstraction level:

📷 Single-Modality Pre-Training
The Bedrock of Perception. Focuses on extracting foundational features from individual sensor streams (Camera or LiDAR) via self-supervised learning techniques, such as Contrastive Learning, Masked Modeling, and Forecasting. This paradigm establishes the fundamental representations for sensor-specific tasks.
🔄 Multi-Modality Pre-Training
Bridging the Semantic-Geometric Gap. Leverages cross-modal synergy to fuse heterogeneous sensor data. This category includes LiDAR-Centric (distilling visual semantics into geometry), Camera-Centric (injecting geometric priors into vision), and Unified frameworks that jointly learn modality-agnostic representations.
🌍 Open-World Perception and Planning
The Frontier of Embodied Autonomy. Represents the evolution from passive perception to active decision-making. This paradigm encompasses Generative World Models (e.g., video/occupancy generation), Embodied Vision-Language-Action (VLA) models, and systems capable of Open-World reasoning.

📄 Paper Link

Citation

If you find this work helpful for your research, please kindly consider citing our paper:

@article{wang2026forging,
    title   = {Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems},
    author  = {Song Wang and Lingdong Kong and Xiaolu Liu and Hao Shi and Wentong Li and Jianke Zhu and Steven C. H. Hoi},
    journal = {arXiv preprint arXiv:2512.24385},
    year    = {2025}
}

1. Benchmarks & Datasets

Vehicle-Based Datasets

Dataset	Venue	Sensor	Task
`KITTI`	CVPR'12	2 Cam(RGB), 2 Cam(Gray), 1 LiDAR(64)	3D Det, Stereo, Optical Flow, SLAM
`ApolloScape`	TPAMI'19	2 Cam, 2 LiDAR	3D Det, HD Map
`nuScenes`	CVPR'20	6 Cam(RGB), 1 LiDAR(32), 5 Radar	3D Det, Seg, Occ, Map
`SemanticKITTI`	ICCV'19	4 Cam, 1 LiDAR(64)	3D Det, Occ
`Waymo`	CVPR'20	5 Cam(RGB), 5 LiDAR	Perception (Det, Seg, Track), Motion
`Argoverse`	CVPR'19	7 Cam(RGB), 2 LiDAR(32)	3D Tracking, Forecasting, Map
`Lyft L5`	CoRL'20	7 Cam(RGB), 3 LiDAR, 5 Radar	3D Det, Motion Forecasting/Planning
`A*3D`	ICRA'20	2 Cam, 1 LiDAR(64)	3D Det
`KITTI-360`	TPAMI'22	4 Cam, 1 LiDAR(64)	3D Det, Occ
`A2D2`	arXiv'20	6 Cam, 5 LiDAR(16)	3D Det
`PandaSet`	ITSC'21	6 Cam(RGB), 2 LiDAR(64)	3D Det, LiDAR Seg
`Cirrus`	ICRA'21	1 Cam, 2 LiDAR(64)	3D Det
`ONCE`	NeurIPS'21	7 Cam(RGB), 1 LiDAR(40)	3D Det (Self-supervised/Semi-supervised)
`Shifts`	arXiv'21	-	3D Det, HD Map
`nuPlan`	arXiv'21	8 Cam, 5 LiDAR	3D Det, HD Map, E2E Plan
`Argoverse2`	NeurIPS'21	7 Cam, 2 LiDAR(32)	3D Det, Occ, HD Map, E2E Plan
`MONA`	ITSC'22	3 Cam	3D Det, HD Map
`Dual Radar`	Sci. Data'25	1 Cam, 1 LiDAR(80) 2 Radar	3D Det
`MAN TruckScenes`	NeurIPS'24	4 Cam, 6 LiDAR(64), 6 RADAR	3D Det
`OmniHD-Scenes`	arXiv'24	6 Cam, 1 LiDAR(128), 6 RADAR	3D Det, Occ, HD Map
`AevaScenes`	2025	6 Cam, 6 LiDAR	3D Det, HD Map
`PhysicalAI-AV`	2025	7 Cam, 1 LiDAR, 11 RADAR	E2E Plan

Drone-Based Datasets

Dataset	Venue	Sensor	Task
`Campus`	ECCV'16	1 Cam	Target Forecasting/ Tracking
`UAV123`	ECCV'16	1 Cam	UAV Trackong
`CarFusion`	CVPR'18	22 Cam	3D Vehicle Reconstruction
`UAVDT`	ECCV'18	1 Cam	2D Object Detection/ Tracking
`DOTA`	CVPR'18	Multi-Scoure	2D Object Detection
`VisDrone`	TPAMI'21	1 Cam	2D Object Detection/ Tracking
`DOTA V2.0`	TPAMI'21	Multi-Scoure	2D Object Detection
`MOR-UAV`	MM'20	1 Cam	Moving Object Recognation
`AU-AIR`	ICRA'20	1 Cam	2D Object Detection
`UAVid`	ISPRS JPRS'20	1 Cam	Semantic Segmentation
`MOHR`	Neuro'21	3 Cam	2D Object Detection
`SensatUrban`	CVPR'21	1 Cam	2D Object Detection
`UAVDark135`	TMC'22	1 Cam	2D Object Tracking
`MAVREC`	CVPR'24	1 Cam	2D Obejct Detection
`BioDrone`	IJCV'24	1 Cam	2D Object Tracking
`PDT`	ECCV'24	1 Cam, 1 LiDAR	2D Object Detection
`UAV3D`	NeurIPS'24	5 Cam	3D Object Detection/ Tracking
`IndraEye`	arXiv'24	1 Cam	2D Object Detection/ Semantic Segmentation
`UAVScenes`	ICCV'25	1 Cam, 1 LiDAR	Semantic Segmentation, Visual Localization

Other Robotic Platforms

Dataset	Venue	Platform	Sensors
`RailSem19`	CVPRW'19	Railway	1× Camera
`FRSign`	arXiv'20	Railway	2× Camera (Stereo)
`RAWPED`	TVT'20	Railway	1× Camera
`SRLC`	AutCon'21	Railway	1× LiDAR
`Rail-DB`	MM'22	Railway	1× Camera
`RailSet`	IPAS'22	Railway	1× Camera
`OSDaR23`	ICRAE'23	Railway	9× Camera, 6× LiDAR, 1× Radar
`Rail3D`	Infra'24	Railway	4× Camera, 1× LiDAR
`WHU-Railway3D`	TITS'24	Railway	1× LiDAR
`FloW`	ICCV'21	USV (Water)	2× Camera, 1× 4D Radar
`DartMouth`	IROS'21	USV (Water)	3× Camera, 1× LiDAR
`MODS`	TITS'21	USV (Water)	2× Camera, 1× LiDAR
`SeaSAW`	CVPRW'22	USV (Water)	5× Camera
`WaterScenes`	T-ITS'24	USV (Water)	1× Camera, 1× 4D Radar
`MVDD13`	Appl. Ocean Res.'24	USV (Water)	1× Camera
`SeePerSea`	TFR'25	USV (Water)	1× Camera, 1× LiDAR
`WaterVG`	TITS'25	USV (Water)	1× Camera, 1× 4D Radar
`Han et al.`	NMI'24	Legged Robot	1× Depth Camera
`Luo et al.`	CVPR'25	Legged Robot	1× Panoramic Camera
`QuadOcc`	arXiv'25	Legged Robot	1× Panoramic Camera, 1× LiDAR
`M3ED`	CVPRW'23	Multi-Robot	3× Camera, 2× Event Camera, 1× LiDAR
`Pi3DET`	ICCV'25	Multi-Robot	3× Camera, 2× Event Camera, 1× LiDAR

2. Single-Modality Pre-Training

LiDAR-Only

Methods utilizing Point Cloud Contrastive Learning, Masked Autoencoders (MAE), or Forecasting.

Model	Paper	Venue
`PointContrast`	Unsupervised Pre-training for 3D Point Cloud Understanding	ECCV 2020
`DepthContrast`	Self-supervised Pretraining of 3D Features on any Point-Cloud	ICCV 2021
`GCC-3D`	Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection	ICCV 2021
`ContrastiveSceneContexts`	Exploring data-efficient 3d scene understanding with contrastive scene contexts	CVPR 2021
`SegContrast`	3D Point Cloud Feature Representation Learning through Self-supervised Segment Discrimination	RA-L 2021
`GroupContrast`	Groupcontrast: Semantic-aware self-supervised representation learning for 3d understanding	CVPR 2024
`ProposalContrast`	Unsupervised Pre-training for LiDAR-Based 3D Object Detection	ECCV 2022
`Occupancy-MAE`	Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders	T-IV 2023
`ALSO`	Automotive LiDAR Self-supervision by Occupancy Estimation	CVPR 2023
`GD-MAE`	Generative Decoder for MAE Pre-training on LiDAR Point Clouds	CVPR 2023
`AD-PT`	Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset	NeurIPS 2023
`E-SSL`	Equivariant spatio-temporal self-supervision for lidar object detection	ECCV 2024
`PatchContrast`	Self-Supervised Pre-training for 3D Object Detection	CVPRW 2025
`MV-JAR`	Mv-jar: Masked voxel jigsaw and reconstruction for lidar-based self-supervised pre-training	CVOR 2023
`Occupancy-MAE`	Occupancy-mae: Self-supervised pre-training large-scale lidar point clouds with masked occupancy autoencoders	TIV 2023
`Core`	Core: Cooperative reconstruction for multi-agent perception	ICCV 2023
`MAELi`	Masked Autoencoder for Large-Scale LiDAR Point Clouds	WACV 2024
`BEV-MAE`	Bird's Eye View Masked Autoencoders for Point Cloud Pre-training	AAAI 2024
`AD-L-JEPA`	AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data	AAAI 2026
`UnO`	Unsupervised Occupancy Fields for Perception and Forecasting	CVPR 2024
`BEVContrast`	Self-Supervision in BEV Space for Automotive Lidar Point Clouds	3DV 2024
`4DContrast`	4dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding	ECCV 2022
`Copilot4D`	Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion	ICLR 2024
`T-MAE`	Temporal Masked Autoencoders for Point Cloud Representation Learning	ECCV 2024
`PICTURE`	Point Cloud Reconstruction Is Insufficient to Learn 3D Representations	ACM MM 2024
`LSV-MAE`	Rethinking Masked-Autoencoder-Based 3D Point Cloud Pretraining	IV 2024
`UNIT`	Unsupervised Online Instance Segmentation through Time	arXiv 2024
`R-MAE`	Sense Less, Generate More: Pre-training LiDAR Perception with Masked Autoencoders	arXiv 2024
`TurboTrain`	TurboTrain: Towards efficient and balanced multi-task learning for multi-agent perception and prediction	ICCV 2025
`NOMAE`	Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds	CVPR 2025
`4D Occ`	Point cloud forecasting as a proxy for 4d occupancy forecasting	CVPR 2023
`GPICTURE`	Mutual information-driven self-supervised point cloud pre-training	KBS 2025
`CooPre`	CooPre: Cooperative pretraining for v2x cooperative perception	IROS 2025
`TREND`	TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception	arXiv 2024

Camera-Only

Self-supervised learning from image sequences for driving/robotics.

Model	Paper	Venue
`INoD`	Injected Noise Discriminator for Self-Supervised Representation	RA-L 2023
`TempO`	Self-Supervised Representation Learning From Temporal Ordering	RA-L 2024
`LetsMap`	Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping	ECCV 2024
`NeRF-MAE`	Masked AutoEncoders for Self-Supervised 3D Representation Learning	ECCV 2024
`VisionPAD`	A Vision-Centric Pre-training Paradigm for Autonomous Driving	arXiv 2024

3. Multi-Modality Pre-Training

LiDAR-Centric Pre-Training

Enhancing LiDAR representations using Vision foundation models (Knowledge Distillation).

Model	Paper	Venue
`SLidR`	Image-to-Lidar Self-Supervised Distillation	CVPR 2022
`SimIPU`	Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations	AAAI 2022
`SSPC-Im`	Self-supervised pre-training of 3d point cloud networks with image data	CoRL 2022
`ST-SLidR`	Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss	CVPR 2023
`I2P-MAE`	Learning 3D Representations from 2D Pre-trained Models via Image-to-Point MAE	CVPR 2023
`TriCC`	Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast	CVPR 2023
`Seal`	Segment Any Point Cloud Sequences by Distilling Vision FMs	NeurIPS 23
`PRED`	Pre-training via Semantic Rendering on LiDAR Point Clouds	NeurIPS 23
`LiMA`	Beyond one shot, beyond one perspective: Cross-view and long-horizon distillation for better lidar representations	ICCV 2025
`ImageTo360`	360° from a Single Camera: A Few-Shot Approach for LiDAR Segmentation	ICCVW 2023
`ScaLR`	Three Pillars improving Vision Foundation Model Distillation for Lidar	CVPR 2024
`CSC`	Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception	CVPR 2024
`GPC`	Pre-Training LiDAR-Based 3D Object Detectors Through Colorization	ICLR 2024
`Cross-Modal SSL`	Cross-Modal Self-Supervised Learning with Effective Contrastive Units	IROS 2024
`SuperFlow`	4D Contrastive Superflows are Dense 3D Representation Learners	ECCV 2024
`Rel`	Image-to-Lidar Relational Distillation for Autonomous Driving Data	ECCV 2024
`HVDistill`	Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation	IJCV 2024
`RadarContrast`	Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation	DCOSS-IoT 2024
`CM3D`	Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection	CoRL 2024
`OLIVINE`	Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models	NeurIPS 2024
`EUCA-3DP`	Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining	arXiv 2024
`GASP`	Gasp: Unifying geometric and semantic self-supervised pre-training for autonomous driving	arXiv 2025
`BALViT`	Label-Efficient LiDAR Scene Understanding with 2D-3D Vision Transformer Adapters	ICRAW 2025

Camera-Centric Pre-Training

Learning 3D Geometry from Camera inputs using LiDAR supervision.

Model	Paper	Venue
`DD3D`	Is Pseudo-Lidar needed for Monocular 3D Object detection?	ICCV 2021
`DEPT`	Delving into the Pre-training Paradigm of Monocular 3D Object Detection	arXiv 2022
`OccNet`	Scene as Occupancy	ICCV 2023
`GeoMIM`	Towards Better 3D Knowledge Transfer via Masked Image Modeling	ICCV 2023
`GAPretrain`	Geometric-aware Pretraining for Vision-centric 3D Object Detection	arXiv 2023
`UniScene`	Multi-Camera Unified Pre-training via 3D Scene Reconstruction	RA-L 2024
`SelfOcc`	Self-Supervised Vision-Based 3D Occupancy Prediction	CVPR 2024
`ViDAR`	Visual Point Cloud Forecasting enables Scalable Autonomous Driving	CVPR 2024
`DriveWorld`	4D Pre-trained Scene Understanding via World Models	CVPR 2024
`OccFeat`	Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation	CVPRW 2024
`OccWorld`	Learning a 3D Occupancy World Model for Autonomous Driving	ECCV 2024
`MVS3D`	Exploiting the Potential of Multi-Frame Stereo Depth Estimation Pre-training	IJCNN 2024
`OccSora`	4D Occupancy Generation Models as World Simulators	arXiv 2024
`MIM4D`	Masked Modeling with Multi-View Video for Autonomous Driving	arXiv 2024
`GaussianPretrain`	A Simple Unified 3D Gaussian Representation for Visual Pre-training	arXiv 2024
`S3PT`	S3pt: Scene semantics and structure guided clustering to boost self-supervised pre-training for autonomous driving	WACV 2025
`UniFuture`	Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception	arXiv 2025
`GaussianOcc`	Gaussianocc: Fully self-supervised and efficient 3d occupancy estimation with gaussian splatting	ICCV 2025
`GaussianTR`	Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding	CVPR 2025
`DistillNeRF`	Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features	NeurIPS 2024

Unified Pre-Training

Joint optimization of multi-modal encoders for unified representations.

Model	Paper	Venue
`PonderV2`	Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm	arXiv 2023
`UniPAD`	A Universal Pre-training Paradigm for Autonomous Driving	CVPR 2024
`UniM2AE`	Multi-Modal Masked Autoencoders with Unified 3D Representation	ECCV 2024
`ConDense`	Consistent 2D/3D Pre-training for Dense and Sparse Features	ECCV 2024
`Unified Pretrain`	Learning Shared RGB-D Fields: Unified Self-supervised Pre-training	arXiv 2024
`BEVWorld`	A Multimodal World Simulator for Autonomous Driving via Unified BEV Latent Space	arXiv 2024
`NS-MAE`	Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception	arXiv 2024
`CLAP`	CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning	arXiv 2024
`GS3`	Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting	arXiv 2024
`Hermes`	Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation	ICCV 2025
`LRS4Fusion`	Self-Supervised Sparse Sensor Fusion for Long Range Perception	ICCV 2025
`Gaussian2Scene`	Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting	arXiv 2025

Incorporating Additional Sensors: With Radar

Incorporating additional modalities into pre-training frameworks for representation learning.

Model	Paper	Venue
`RadarContrast`	Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation	DCOSS-IoT 2024
`AssociationNet`	Radar camera fusion via representation learning in autonomous driving	CVPRW 2021
`MVRAE`	Multi-View Radar Autoencoder for Self-Supervised Automotive Radar Representation Learning	IV 2024
`SSRLD`	Self-supervised representation learning for the object detection of marine radar	ICCAI 2022
`U-MLPNet`	Learning Omni-Dimensional Spatio-Temporal Dependencies for Millimeter-Wave Radar Perception	Remote Sens 2024
`4D-ROLLS`	4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision	arXiv 2025
`SS-RODNet`	Pre-Training For mmWave Radar Object Detection Through Masked Image Modeling	SS-RODNet
`Radical`	Bootstrapping autonomous driving radars with self-supervised learning	CVPR 2024
`RiCL`	Leveraging Self-Supervised Instance Contrastive Learning for Radar Object Detection	arXiv 2024
`RSLM`	Radar spectra-language model for automotive scene parsing	RADAR 2024

Incorporating Additional Sensors: With Event Camera

Model	Paper	Venue
`ECDP`	Event Camera Data Pre-training	ICCV 2023
`MEM`	Masked Event Modeling: Self-Supervised Pretraining for Event Cameras	WACV 2024
`DMM`	Data-efficient event camera pre-training via disentangled masked modeling	arXiv 2024
`STP`	Enhancing Event Camera Data Pretraining via Prompt-Tuning with Visual Models	-
`ECDDP`	Event Camera Data Dense Pre-training	ECCV2024
`EventBind`	Eventbind: Learning a unified representation to bind them all for event-based open-world understanding	ECCV2024
`EventFly`	EventFly: Event Camera Perception from Ground to the Sky	CVPR 2025

4. Open-World Perception and Planning

Text-Grounded Understanding

Model	Paper	Venue
`CLIP2Scene`	Towards Label-efficient 3D Scene Understanding by CLIP	CVPR 2023
`OpenScene`	3D Scene Understanding with Open Vocabularies	CVPR 2023
`CLIP-ZSPCS`	Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation	MM 2023
`CLIP-FO3D`	Learning Free Open-world 3D Scene Representations from 2D Dense CLIP	ICCVW 2023
`POP-3D`	Open-Vocabulary 3D Occupancy Prediction from Images	NeurIPS 2023
`VLM2Scene`	Self-Supervised Image-Text-LiDAR Learning with Foundation Models	AAAI 2024
`IntraCorr3D`	Hierarchical intra-modal correlation learning for label-free 3d semantic segmentation	CVPR 2024
`SAL`	Better Call SAL: Towards Learning to Segment Anything in Lidar	ECCV 2024
`Affinity3D`	Propagating Instance-Level Semantic Affinity for Zero-Shot Semantic Seg	ACM MM 2024
`UOV`	3D Unsupervised Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving	arXiv 2024
`OVO`	OVO: Open-Vocabulary Occupancy	arXiv 2023
`LangOcc`	Langocc: Self-supervised open vocabulary occupancy estimation via volume rendering	3DV 2025
`VEON`	VEON: Vocabulary-Enhanced Occupancy Prediction	ECCV 2024
`LOcc`	Language Driven Occupancy Prediction	ICCV 2025
`UP-VL`	Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving	ICCV 2023
`ZPCS-MM`	See more and know more: Zero-shot point cloud segmentation via multi-modal visual data	ICCV 2023
`CNS`	Towards label-free scene understanding by vision foundation models	NeurIPS
`3DOV-VLD`	3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation	ECCV 2024
`CLIP^2`	CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data	CVPR 2023
`AdaCo`	Adaco: Overcoming visual foundation model noise in 3d semantic segmentation via adaptive label correction	AAAI 2025
`TT-Occ`	TT-Occ: Test-Time Compute for Self-Supervised Occupancy via Spatio-Temporal Gaussian Splatting	arXiv 2025
`AutoOcc`	AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting	ICCV 2025

Unified World Representation for Action

Model	Paper	Venue
`OccWorld`	OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving	ECCV 2024
`GenAD`	Generalized Predictive Model for Autonomous Driving	CVPR 2024
`OccSora`	OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving	arXiv 2024
`OccLLaMA`	OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving	arXiv 2024
`OccVAR`	OccVAR: Scalable 4D Occupancy Prediction via Next-Scale Prediction	-
`RenderWorld`	Renderworld: World Model with Self-Supervised 3D Label	ICRA 2025
`Drive-OccWorld`	Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving	AAAI 2025
`LAW`	Enhancing End-to-End Autonomous Driving with Latent World Model	ICLR 2025
`FSF-Net`	FSF-Net: Enhance 4D occupancy forecasting with coarse BEV scene flow for autonomous driving	PR 2025
`DriveX`	DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving	arXiv 2025
`SPOT`	SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving	TPAMI 2025
`WoTE`	End-to-End Driving with Online Trajectory Evaluation via BEV World Model	ICCV 2025
`FASTopoWM`	FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models	arXiv 2025
`OccTens`	OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction	arXiv 2025
`OccVLA`	Occvla: Vision-language-action model with implicit 3d occupancy supervision	arXIv 2025
`World4Drive`	World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model	ICCV 2025

5. Acknowledgements

We thank the authors of the referenced papers for their open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs/figures		docs/figures
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Forging Spatial Intelligence