WorldForge (CVPR 2026)

Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control

Chenxi Song¹, Yanming Yang¹, Tong Zhao¹, Ruibo Li², Chi Zhang^1*
¹AGI Lab, Westlake University ²Nanyang Technological University
^*Corresponding Author

Update

[2026.02] 🎉 Accepted by CVPR 2026!
[2026.02] 🔥 Code released! VGGT 3D warping, DepthCrafter 4D warping, Wan2.1 & LongCat-Video inference are now available.
[2025.09] arXiv preprint is available.
[2025.09] Project page is online.

Introduction

WorldForge is a training-free framework that unlocks the world-modeling potential of video diffusion models for controllable 3D/4D generation. By leveraging latent world priors in pretrained video diffusion models, WorldForge achieves precise trajectory control and photorealistic content generation — all without any additional training.

The core idea is a warping-and-repainting pipeline with three inference-time mechanisms (IRR, FLF, DSG) that jointly inject trajectory-aligned guidance while preserving visual fidelity. For details, please refer to our paper and project page.

What's Released

Module	Description
`vggt/`	VGGT-based 3D scene warping from a single image
`DepthCrafter/`	DepthCrafter-based 4D warping for dynamic video scenes
`wan_for_worldforge/`	WorldForge inference with Wan-2.1 (480p & 720p)
`longcat_for_worldforge/`	WorldForge inference with LongCat-Video (480p, distilled & upscaling)

Tips:

LongCat-Video supports a distilled mode (16 steps) with 480p→720p upscaling, offering faster generation. It works well on realistic scenes; for stylized or non-photorealistic content, Wan2.1 720p may yield better results.
Wan2.1 720p (50 steps) generally delivers higher visual quality, at the cost of longer inference time.

Installation

1. Create environment

conda create -n worldforge python=3.11 -y
conda activate worldforge

2. Install PyTorch

# Choose your CUDA version from https://pytorch.org
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

3. Install dependencies

pip install -r requirements.txt

4. Install GPU acceleration libraries

pip install xformers --index-url https://download.pytorch.org/whl/cu124

# flash-attn (use pre-built wheel, source build may have ABI issues)
# Torch 2.6 + CUDA 12 + Python 3.11:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl --no-deps
# Other combos: https://github.com/Dao-AILab/flash-attention/releases (pick cxx11abiFALSE)

5. Install pytorch3d (for DepthCrafter warping)

conda install -c nvidia cuda-toolkit -y
export TORCH_CUDA_ARCH_LIST=$(python -c "import torch; c=torch.cuda.get_device_capability(); print(f'{c[0]}.{c[1]}')")
export MAX_JOBS=$(( $(nproc) / 2 ))
pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git@stable"

6. Install LightGlue (for VGGT warping)

pip install --no-build-isolation git+https://github.com/cvg/LightGlue.git

Model Weights

Wan-2.1: Download model weights from the official Wan-2.1 repository. You will need Wan2.1-I2V-14B-480P and/or Wan2.1-I2V-14B-720P.
LongCat-Video: Download model weights from the official LongCat-Video repository. Both standard and distilled checkpoints are supported.

Set the model weight paths in the corresponding run_test_case.sh scripts before running.

Quick Start

VGGT 3D scene warping — generate warped frames from single/few images:

bash vggt/run_test_case.sh

DepthCrafter 4D dynamic warping — warp a video sequence with depth-guided trajectory:

bash DepthCrafter/run_test_case.sh

Wan2.1 WorldForge inference — high-quality video generation (edit MODELS_DIR in script first):

bash wan_for_worldforge/run_test_case.sh

LongCat-Video WorldForge inference — fast generation with optional distillation & upscaling (edit CHECKPOINT_DIR first):

bash longcat_for_worldforge/run_test_case.sh

💡 LongCat Interactive Notebook

We also provide a Jupyter Notebook (longcat_for_worldforge/longcat_interactive.ipynb) for continuous inference — models are loaded once and kept in VRAM, so you can run multiple experiments without reloading weights each time. See the parameter details in the sections below.

Running on Your Own Data

The overall pipeline is: warping → prompt → video inference. First generate warped frames and masks with a 3D vision model, then feed them into a video diffusion model for repainting.

Step 1: Warping

VGGT — 3D scene warping (single / few images)

cd vggt
python run_warp.py \
    --image_path <your_image_folder> \
    --output_path <output_folder> \
    --camera <camera_index> \
    --direction <direction> \
    --degree <angle> \
    --frame_single <num_frames> \
    --look_at_depth <depth>

Key parameters

Parameter	Description
`--camera`	Which input view to use as the source (index starting from 0)
`--direction`	Camera motion direction: `left`, `right`, `up`, `down`, `forward`, `backward`
`--degree`	Rotation angle in degrees (or movement ratio for forward/backward)
`--frame_single`	Number of output frames to generate
`--look_at_depth`	Focus depth coefficient (1.0 = scene mean depth; smaller = closer focus)

DepthCrafter — 4D dynamic warping (video input)

cd DepthCrafter
python warp_depthcrafter.py \
    --video_path <your_video_or_image_folder> \
    --output_path <output_folder> \
    --direction <direction> \
    --degree <angle> \
    --look_at_depth <depth> \
    --enable_edge_filter

Key parameters

Parameter	Description
`--direction`	Camera motion direction: `left`, `right`, `up`, `down`
`--degree`	Warping angle in degrees
`--look_at_depth`	Focus depth multiplier (similar to VGGT)
`--zoom`	Optional zoom mode: `zoom_in`, `zoom_out`, `none`
`--stable`	Stable mode: complete camera motion in first N frames, then hold
`--enable_edge_filter`	Reduce artifacts at depth boundaries (recommended)

The warping output (frames + masks) will be saved to the output folder. Pass this folder as --video-ref to the video model in the next step.

Step 2: Prepare Prompts

Add a text prompt for your scene to prompts.py. We recommend using an LLM (e.g., GPT, Gemini) to generate prompts — see existing examples in prompts.py for the expected style. For static scenes, emphasize stillness ("completely frozen", "utterly motionless") and avoid words implying movement.

Step 3: Video Inference

Our method offers flexible parameter combinations. The run_test_case.sh scripts provide commonly used parameter grids. We recommend tuning parameters for your specific scene to find the best combination.

Key tunable parameters (see run_test_case.sh for both Wan and LongCat):

Parameter	Description
`--resolution`	Output resolution: `480p` or `720p`
`--omega`	DSG guidance strength
`--guide-steps`	Number of initial denoising steps with guided fusion
`--resample-steps`	Resampling iterations per denoising step
`--resample-round`	Total steps for resample guidance (typically `guide-steps` + 0~1)
`--transition-distance`	Mask edge softening distance in pixels (0 = hard edge)
`--guidance-scale`	CFG scale (ignored when using distilled model)
`--max-replace`	(LongCat only) Max FLF replacement channels per step
`--use_distill`	(LongCat only) Enable 16-step distilled mode for faster inference
`--enable-upscale`	(LongCat only) Upscale 480p output to 720p

Wan2.1 and LongCat-Video have different model priors, so optimal parameter ranges may differ. We suggest starting from the defaults in run_test_case.sh and adjusting based on your results.

Quick tuning tips: Start by adjusting --guide-steps (IRR iterations) — for Wan's 50-step sampling, 10–25 typically works well; for LongCat's 50 steps, try 20–30. --resample-round adds extra IRR passes without warping injection for quality refinement (keep it 0 or 1; higher values lose trajectory). --omega controls DSG strength — 4 or 6 is recommended; too high causes artifacts. --resample-steps can be fixed at 2. --transition-distance softens mask edges — 15, 20, or 25 are good choices. --guidance-scale is usually fixed at 4. --max-replace (LongCat only) works best at 2 or 3 due to LongCat's different channel characteristics. These are empirical guidelines — some scenes may benefit from values outside these ranges.

Exploring Further

We encourage the community to try newer 3D vision models for the warping stage (e.g., DA3, PI3, etc.), and to port WorldForge to other video diffusion models (e.g., LTX-2) for even better results.

TODO

Paper released on arXiv
Project page available
Code released: VGGT warping, DepthCrafter warping, Wan2.1 inference, LongCat-Video inference
Mega-SAM dynamic video warping
UniDepth single-view warping
Lang-SAM video editing framework
Multi-GPU parallel inference acceleration

We welcome contributions, issues, and discussions! If you have ported WorldForge to other video diffusion models, feel free to share.

P.S. Stay tuned for our upcoming new work — a faster, simpler, and more unified world model architecture. Coming soon! 🚀🚀🚀

Acknowledgments

We thank the research community for their valuable contributions to video diffusion models and 3D/4D generation. Special thanks to the following open-source projects that inspired and supported our work:

Wan-2.1 - Large-scale video generation model
LongCat-Video - Efficient long video generation with distillation
SVD (Stable Video Diffusion) - Video diffusion model by Stability AI
VGGT - Visual Geometry Grounded Transformer
ReCamMaster - Trajectory-controlled video generation
TrajectoryCrafter - Trajectory-based video synthesis
NVS-Solver - Novel view synthesis solution
ViewExtrapolator - View extrapolation for 3D scenes
DepthCrafter - Video sequence depth estimation
Mega-SAM - Video depth and pose estimation

Citation

@misc{song2025worldforgeunlockingemergent3d4d,
  title={Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control}, 
  author={Chenxi Song and Yanming Yang and Tong Zhao and Ruibo Li and Chi Zhang},
  year={2025},
  url={https://arxiv.org/abs/2509.15130}, 
}

Contact

For questions and discussions, please feel free to open an issue or contact:

Chenxi Song: songchenxi@westlake.edu.cn
Chi Zhang: chizhang@westlake.edu.cn

🌟 Star us on GitHub if you find WorldForge useful! 🌟

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
DepthCrafter		DepthCrafter
longcat_for_worldforge		longcat_for_worldforge
teaser		teaser
test_case		test_case
vggt		vggt
wan_for_worldforge		wan_for_worldforge
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorldForge (CVPR 2026)

Update

Installation

1. Create environment

2. Install PyTorch

3. Install dependencies

4. Install GPU acceleration libraries

5. Install pytorch3d (for DepthCrafter warping)

6. Install LightGlue (for VGGT warping)

Model Weights

Quick Start

Running on Your Own Data

Step 1: Warping

VGGT — 3D scene warping (single / few images)

DepthCrafter — 4D dynamic warping (video input)

Step 2: Prepare Prompts

Step 3: Video Inference

Exploring Further

TODO

Acknowledgments

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WorldForge (CVPR 2026)

Update

Installation

1. Create environment

2. Install PyTorch

3. Install dependencies

4. Install GPU acceleration libraries

5. Install pytorch3d (for DepthCrafter warping)

6. Install LightGlue (for VGGT warping)

Model Weights

Quick Start

Running on Your Own Data

Step 1: Warping

VGGT — 3D scene warping (single / few images)

DepthCrafter — 4D dynamic warping (video input)

Step 2: Prepare Prompts

Step 3: Video Inference

Exploring Further

TODO

Acknowledgments

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages