Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Chenxi Song1, Yanming Yang1, Tong Zhao1, Ruibo Li2, Chi Zhang1*
1AGI Lab, Westlake University Β Β Β 2Nanyang Technological University
*Corresponding Author
- [2026.02] π Accepted by CVPR 2026!
- [2026.02] π₯ Code released! VGGT 3D warping, DepthCrafter 4D warping, Wan2.1 & LongCat-Video inference are now available.
- [2025.09] arXiv preprint is available.
- [2025.09] Project page is online.
Introduction
WorldForge is a training-free framework that unlocks the world-modeling potential of video diffusion models for controllable 3D/4D generation. By leveraging latent world priors in pretrained video diffusion models, WorldForge achieves precise trajectory control and photorealistic content generation β all without any additional training.
The core idea is a warping-and-repainting pipeline with three inference-time mechanisms (IRR, FLF, DSG) that jointly inject trajectory-aligned guidance while preserving visual fidelity. For details, please refer to our paper and project page.
What's Released
| Module | Description |
|---|---|
vggt/ |
VGGT-based 3D scene warping from a single image |
DepthCrafter/ |
DepthCrafter-based 4D warping for dynamic video scenes |
wan_for_worldforge/ |
WorldForge inference with Wan-2.1 (480p & 720p) |
longcat_for_worldforge/ |
WorldForge inference with LongCat-Video (480p, distilled & upscaling) |
Tips:
- LongCat-Video supports a distilled mode (16 steps) with 480pβ720p upscaling, offering faster generation. It works well on realistic scenes; for stylized or non-photorealistic content, Wan2.1 720p may yield better results.
- Wan2.1 720p (50 steps) generally delivers higher visual quality, at the cost of longer inference time.
conda create -n worldforge python=3.11 -y
conda activate worldforge# Choose your CUDA version from https://pytorch.org
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124pip install -r requirements.txtpip install xformers --index-url https://download.pytorch.org/whl/cu124
# flash-attn (use pre-built wheel, source build may have ABI issues)
# Torch 2.6 + CUDA 12 + Python 3.11:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl --no-deps
# Other combos: https://github.com/Dao-AILab/flash-attention/releases (pick cxx11abiFALSE)conda install -c nvidia cuda-toolkit -y
export TORCH_CUDA_ARCH_LIST=$(python -c "import torch; c=torch.cuda.get_device_capability(); print(f'{c[0]}.{c[1]}')")
export MAX_JOBS=$(( $(nproc) / 2 ))
pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git@stable"pip install --no-build-isolation git+https://github.com/cvg/LightGlue.git- Wan-2.1: Download model weights from the official Wan-2.1 repository. You will need
Wan2.1-I2V-14B-480Pand/orWan2.1-I2V-14B-720P. - LongCat-Video: Download model weights from the official LongCat-Video repository. Both standard and distilled checkpoints are supported.
Set the model weight paths in the corresponding run_test_case.sh scripts before running.
VGGT 3D scene warping β generate warped frames from single/few images:
bash vggt/run_test_case.shDepthCrafter 4D dynamic warping β warp a video sequence with depth-guided trajectory:
bash DepthCrafter/run_test_case.shWan2.1 WorldForge inference β high-quality video generation (edit MODELS_DIR in script first):
bash wan_for_worldforge/run_test_case.shLongCat-Video WorldForge inference β fast generation with optional distillation & upscaling (edit CHECKPOINT_DIR first):
bash longcat_for_worldforge/run_test_case.shπ‘ LongCat Interactive Notebook
We also provide a Jupyter Notebook (longcat_for_worldforge/longcat_interactive.ipynb) for continuous inference β models are loaded once and kept in VRAM, so you can run multiple experiments without reloading weights each time. See the parameter details in the sections below.
The overall pipeline is: warping β prompt β video inference. First generate warped frames and masks with a 3D vision model, then feed them into a video diffusion model for repainting.
cd vggt
python run_warp.py \
--image_path <your_image_folder> \
--output_path <output_folder> \
--camera <camera_index> \
--direction <direction> \
--degree <angle> \
--frame_single <num_frames> \
--look_at_depth <depth>Key parameters
| Parameter | Description |
|---|---|
--camera |
Which input view to use as the source (index starting from 0) |
--direction |
Camera motion direction: left, right, up, down, forward, backward |
--degree |
Rotation angle in degrees (or movement ratio for forward/backward) |
--frame_single |
Number of output frames to generate |
--look_at_depth |
Focus depth coefficient (1.0 = scene mean depth; smaller = closer focus) |
cd DepthCrafter
python warp_depthcrafter.py \
--video_path <your_video_or_image_folder> \
--output_path <output_folder> \
--direction <direction> \
--degree <angle> \
--look_at_depth <depth> \
--enable_edge_filterKey parameters
| Parameter | Description |
|---|---|
--direction |
Camera motion direction: left, right, up, down |
--degree |
Warping angle in degrees |
--look_at_depth |
Focus depth multiplier (similar to VGGT) |
--zoom |
Optional zoom mode: zoom_in, zoom_out, none |
--stable |
Stable mode: complete camera motion in first N frames, then hold |
--enable_edge_filter |
Reduce artifacts at depth boundaries (recommended) |
The warping output (frames + masks) will be saved to the output folder. Pass this folder as --video-ref to the video model in the next step.
Add a text prompt for your scene to prompts.py. We recommend using an LLM (e.g., GPT, Gemini) to generate prompts β see existing examples in prompts.py for the expected style. For static scenes, emphasize stillness ("completely frozen", "utterly motionless") and avoid words implying movement.
Our method offers flexible parameter combinations. The run_test_case.sh scripts provide commonly used parameter grids. We recommend tuning parameters for your specific scene to find the best combination.
Key tunable parameters (see run_test_case.sh for both Wan and LongCat):
| Parameter | Description |
|---|---|
--resolution |
Output resolution: 480p or 720p |
--omega |
DSG guidance strength |
--guide-steps |
Number of initial denoising steps with guided fusion |
--resample-steps |
Resampling iterations per denoising step |
--resample-round |
Total steps for resample guidance (typically guide-steps + 0~1) |
--transition-distance |
Mask edge softening distance in pixels (0 = hard edge) |
--guidance-scale |
CFG scale (ignored when using distilled model) |
--max-replace |
(LongCat only) Max FLF replacement channels per step |
--use_distill |
(LongCat only) Enable 16-step distilled mode for faster inference |
--enable-upscale |
(LongCat only) Upscale 480p output to 720p |
Wan2.1 and LongCat-Video have different model priors, so optimal parameter ranges may differ. We suggest starting from the defaults in
run_test_case.shand adjusting based on your results.
Quick tuning tips: Start by adjusting
--guide-steps(IRR iterations) β for Wan's 50-step sampling, 10β25 typically works well; for LongCat's 50 steps, try 20β30.--resample-roundadds extra IRR passes without warping injection for quality refinement (keep it 0 or 1; higher values lose trajectory).--omegacontrols DSG strength β 4 or 6 is recommended; too high causes artifacts.--resample-stepscan be fixed at 2.--transition-distancesoftens mask edges β 15, 20, or 25 are good choices.--guidance-scaleis usually fixed at 4.--max-replace(LongCat only) works best at 2 or 3 due to LongCat's different channel characteristics. These are empirical guidelines β some scenes may benefit from values outside these ranges.
We encourage the community to try newer 3D vision models for the warping stage (e.g., DA3, PI3, etc.), and to port WorldForge to other video diffusion models (e.g., LTX-2) for even better results.
- Paper released on arXiv
- Project page available
- Code released: VGGT warping, DepthCrafter warping, Wan2.1 inference, LongCat-Video inference
- Mega-SAM dynamic video warping
- UniDepth single-view warping
- Lang-SAM video editing framework
- Multi-GPU parallel inference acceleration
We welcome contributions, issues, and discussions! If you have ported WorldForge to other video diffusion models, feel free to share.
P.S. Stay tuned for our upcoming new work β a faster, simpler, and more unified world model architecture. Coming soon! πππ
We thank the research community for their valuable contributions to video diffusion models and 3D/4D generation. Special thanks to the following open-source projects that inspired and supported our work:
- Wan-2.1 - Large-scale video generation model
- LongCat-Video - Efficient long video generation with distillation
- SVD (Stable Video Diffusion) - Video diffusion model by Stability AI
- VGGT - Visual Geometry Grounded Transformer
- ReCamMaster - Trajectory-controlled video generation
- TrajectoryCrafter - Trajectory-based video synthesis
- NVS-Solver - Novel view synthesis solution
- ViewExtrapolator - View extrapolation for 3D scenes
- DepthCrafter - Video sequence depth estimation
- Mega-SAM - Video depth and pose estimation
@misc{song2025worldforgeunlockingemergent3d4d,
title={Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control},
author={Chenxi Song and Yanming Yang and Tong Zhao and Ruibo Li and Chi Zhang},
year={2025},
url={https://arxiv.org/abs/2509.15130},
}For questions and discussions, please feel free to open an issue or contact:
- Chenxi Song: songchenxi@westlake.edu.cn
- Chi Zhang: chizhang@westlake.edu.cn
