Ze Yuan*
·
Xin Yu*†
·
Yangtian Sun
·
Yuan-Chen Guo
·
Yan-Pei Cao
·
Ding Liang
·
Xiaojuan Qi✉
The University of Hong Kong | VAST
*Equal contribution †Project lead ✉Corresponding author
|
- 2025-12-12: Code release v1.0.0! 🎉
This repository is tested with 1 × NVIDIA A100 80GB GPU. For training, at least 8 GPUs with 80GB memory each are recommended. You can refer to the project page for a quick demo. We used 4 nodes to train for 1 week to obtain the results presented in the paper.
conda create -n seqtex python=3.10 -y
conda activate seqtex
conda install -c nvidia/label/cuda-11.8.0 cuda-toolkit -y
conda install -c conda-forge gxx=11 gcc=11 -y
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtpython launch.py --config configs/train.yaml --train \
--gpu 0,1,2,3,4,5,6,7 trainer.num_nodes=1 name="train" tag="overfit" \
data.scene_list=["data/indices/train_3d.jsonl","data/indices/train_image.jsonl"] \
data.train_indices=[[0,1],[0,1]] data.task_list=["img2tex","geo2mv"] \
data.eval_scene_list=["data/indices/train_3d.jsonl","data/indices/train_image.jsonl"] \
data.val_indices=[[0,1],[0,1]] data.eval_task_list=["img2tex","geo2mv"] \
data.extra_prompt_db=["data/indices/prompt_extension.json"] \
data.repeat=1000 trainer.val_check_interval=1.0
# # to resume training, add below
# system.weights=<main_ckpt_path>.ckpt system.ema_kwargs.cloud_or_local_key=<ema_path>.pthSeqTex training uses FSDP by default; otherwise, it won't fit in 80GB GPU memory.
python launch.py --config configs/test.yaml --test \
--gpu 0 trainer.num_nodes=1 name="test" tag="test_dtc" \
data.eval_scene_list=["data/indices/test_one.jsonl"] \
system.seqtex_transformer_name_or_path=VAST-AI/SeqTex-TransformerTry the above command to perform a quick test for image-conditioned texture generation.
Alternatively, if you want more controllability, you can use SDXL to convert your text prompt to an image condition and then generate the texture map.
python launch.py --config configs/test.yaml --test \
--gpu 0 trainer.num_nodes=1 name="test" tag="test_dtc_sdxl" \
data.eval_scene_list=["data/indices/test_one.jsonl"] \
system.seqtex_transformer_name_or_path=VAST-AI/SeqTex-Transformer \
system.use_generated_img_cond=trueIf everything goes well, you should get a result like the one shown here. You can use this as a sanity check.
Each result contains 3-4 files:
outputs_wan/test/test_dtc_sdxl@20251013-121842/save
├── it0-test-0_0-img_cond.png # (text2tex ONLY) image condition from SDXL/FLUX.
├── it0-test-0_0-mv-taskimg2tex.png # please refer to [assets/explain.png](assets/explain.png)
├── it0-test-0_0-prompt.json # the model id, and text prompt used
└── it0-test-0_0-uv-taskimg2tex.png # 3 rows: position and normal map, and the final texture map generatedTwo types of datasets are supported: 3D dataset and image dataset. You can prepare the data in the following format. The processing scripts may be partially found in our HF space demo (It has some known issues, e.g., fails to do UV unwrapping for some cases).
[Note]: Every 3D model with multiple parts should be merged into a single mesh with a single UV map and a single texture map. The texture map should preferably be an albedo map without lighting/shading. For image datasets, PBR shaded images are also supported.
For 3D datasets, each directory contains a 3D model and a corresponding texture map. This is the primary data format we need. It is the data used by the img2tex task.
data/examples
└── 99
└── 999c2f55-0cf6-4d46-913a-7671085d03f6
├── model.glb
└── model.jpeg # preferably an albedo texture mapFor image datasets, each directory contains a set of images. This is an alternative data format to improve the generalization ability of the model. It is the data used by the geo2mv task.
data/examples_train_image
└── f6
└── f63290551cac423883742cfdb8acc9ff
├── meta.json # transform_matrixs for each image
├── color_0000.webp # PBR shaded images are also supported
...
├── depth_0000.exr
...
└── normal_0005.webpIf SeqTex is used in your work, please cite:
@inproceedings{10.1145/3757377.3763863,
author = {Yuan, Ze and Yu, Xin and Sun, Yangtian and Guo, Yuan-Chen and Cao, Yan-Pei and Liang, Ding and Qi, Xiaojuan},
title = {SeqTex: Generate Mesh Textures in Video Sequence},
year = {2025},
isbn = {9798400721373},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://dl.acm.org/doi/10.1145/3757377.3763863},
doi = {10.1145/3757377.3763863},
booktitle = {Proceedings of the SIGGRAPH Asia 2025 Conference Papers},
articleno = {25},
numpages = {12},
keywords = {Video Diffusion Models, Diffusion Techniques, Texture Generation},
series = {SA Conference Papers '25}
}We sincerely thank the following open-source projects:
- MV-Adapter for the coding framework.
- Wan2.1 for the 3D prior from video sequences
- SDXL and FLUX for high fidelity text-to-image generation
- diffusers for the diffusion model implementation
- lightning for saving my time
Additionally, we thank the following people for their help:
- Special thanks to Toshihiro Hayashi for his valuable support and assistance in fixing bugs for our HF demo.
- We thank EasyMode-AI for their efforts in integrating SeqTex into ComfyUI. See here.
