- A fork of https://github.com/Stability-AI/generative-models, with focus on tokenizers
- Practical size implementation and training code for popular tokenizers, such as VQ, FSQ, LFQ, BSQ
 - Both stable diffusion unet and bsq vit backbone support
 - With pre-trained model and benchmark on ImageNet 256x256
 
 
- dependency in environment.yaml
conda env create --file=environment.yaml conda activate tokenizer
 
- from source
pip install . 
- It is recommend to list the dataset in advanced using
python scripts/create_dataset_list.py --root $PATH_TO_DATASET_FOLDER --ext $IMAGE_EXTENSION --out $PATH_TO_OUTFILE
 - It is not mandatory, just speed up training
 
- 
modify the yaml file according to your system, pay special attention to "trainer-device", "trainer-num_nodes", "data-train-params-root"
 - 
Gaussian VAE with stable diffusion UNet
python main.py --config sd3unet_gaussian_kl_0.64.yaml --wandb
 - 
FSQ with stable diffusion UNet
python main.py --config sd3unet_fsq_16.yaml --wandb
 - 
LFQ with stable diffusion UNet
python main.py --config sd3unet_lfq_16.yaml --wandb
 - 
check ./configs/ for more
 
- usage
python -m torch.distributed.launch --standalone --use-env \ --nproc-per-node=8 eval.py \ --bs=32 \ --base=$PATH_TO_YAML_CONFIG \ --ckpt=$PATH_TO_CKPT \ --dataset=$PATH_TO_DATASET_FOLDER
 
- All models are trained with ImageNet train set, on 8xA100 GPU for around 30 epochs, which takes around 24 hours
 - All models available in https://huggingface.co/xutongda/pytorch-image-tokenizer-models
 
| spec | config | model | PSNR | SSIM | LPIPS | rFID | 
|---|---|---|---|---|---|---|
| LFQ 2^16x1024 | sd3unet_lfq_16.yaml | sd3unet_lfq_16.ckpt | 22.65 | 0.635 | 0.141 | 3.523 | 
| FSQ 2^16x1024 | sd3unet_fsq_16.yaml | sd3unet_fsq_16.ckpt | 26.87 | 0.785 | 0.072 | 1.161 | 
| BSQ 2^16x1024 | sd3unet_bsq_16.yaml | sd3unet_bsq_16.ckpt | 25.62 | 0.754 | 0.086 | 1.080 | 
- main structure is a fork from: https://github.com/Stability-AI/generative-models
 - bsq, vit and evaluation from: https://github.com/zhaoyue-zephyrus/bsq-vit
 - lfq from: https://github.com/TencentARC/SEED-Voken
 - vq from: https://github.com/ai-forever/MoVQGAN
 - [VQ NIPS 17] Neural Discrete Representation Learning
 - [LFQ ICLR 24] Language Model Beats Diffusion: Tokenizer is key to visual generation
 - [FSQ ICLR 24] Finite Scalar Quantization: VQ-VAE Made Simple
 - [BSQ ICLR 25] Image and Video Tokenization with Binary Spherical Quantization