MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning (ICLR 2026)

Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, Shuhang Gu

⭐ If this work is helpful for you, please help star this repo. Thanks! 🤗

✨ Key Contributions

1️⃣ Efficiency Bottleneck: VAR exhibits scale and spatial redundancy, causing high GPU memory consumption.

2️⃣ Our Solution: The proposed method enables MVAR generation without relying on KV cache during inference, significantly reducing the memory footprint.

📚 Citation

Please cite our work if it is helpful for your research:

@inproceedings{
zhang2026mvar,
title={{MVAR}: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning},
author={Jinhua Zhang and Wei Long and Minghao Han and Weiyi You and Shuhang Gu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=mkr1ZrwgeJ}
}

📰 News

2026-02-05: 🧠 Codebase and Weights are now available.
2026-01-25: 🚀 MVAR is accepted by ICLR 2026.
2025-05-20: 📄 Our MVAR paper has been published on arXiv.

🛠️ Pipeline

MVAR introduces the Scale and Spatial Markovian Assumption:

Scale Markovian: Only adopts the adjacent preceding scale for next-scale prediction.
Spatial Markovian: Restricts the attention of each token to a localized neighborhood of size $k$ at corresponding positions on adjacent scales.

🥇 Results

MVAR achieves a 3.0× reduction in GPU memory footprint compared to VAR.

📊 Comparison of Quantitative Results: MVAR vs. VAR (Click to expand)

📈 ImageNet 256×256 Benchmark (Click to expand)

🧪 Ablation Study on Markovian Assumptions (Click to expand)

🦁 MVAR Model Zoo

We provide various MVAR models accessible via our Huggingface Repo.

📊 Model Performance & Weights

Model	FID ↓	IS ↑	sFID ↓	Prec. ↑	Recall ↑	Params	HF Weights 🤗
MVAR-d16	3.01	285.17	6.26	0.85	0.51	310M	link
MVAR-d16$^{*}$	3.37	295.35	6.10	0.86	0.48	310M	link
MVAR-d20$^{*}$	2.83	294.31	6.12	0.85	0.52	600M	link
MVAR-d24$^{*}$	2.15	298.85	5.62	0.84	0.56	1.0B	link

Note: $^{*}$ indicates models fine-tuned from VAR weights on ImageNet.

⚙️ Installation

Create conda environment:

conda create -n mvar python=3.11 -y
conda activate mvar

Install PyTorch and dependencies:

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \
    xformers==0.0.32.post2 \
    --index-url https://download.pytorch.org/whl/cu128

pip install accelerate einops tqdm huggingface_hub pytz tensorboard \
    transformers typed-argument-parser thop matplotlib seaborn wheel \
    scipy packaging ninja openxlab lmdb pillow

Install Neighborhood Attention:

You can also use the .whl file provided in HuggingFace
```
pip install natten-0.21.1+torch280cu128-cp311-cp311-linux_x86_64.whl
```

Prepare ImageNet dataset:

Click to view expected directory structure

/path/to/imagenet/:
    train/:
        n01440764/
        ...
    val/:
        n01440764/
        ...

🚀 Training & Evaluation

1.Requirements (Pre-trained VAR)

Before running MVAR, you must download the necessary VAR weight first:

You can use the huggingface-cli to download the entire model repository:

# Install huggingface_hub if you haven't
pip install huggingface_hub
# Download models to local directory
hf download FoundationVision/var --local-dir ./pretrained/FoundationVision/var

2.Download MVAR

# Download models to local directory
hf download CVLUESTC/MVAR --local-dir ./checkpoints

3.Flash-Attn and Xformers (Optional)

Install and compile flash-attn and xformers for faster attention computation. Our code will automatically use them if installed. See models/basic_mvar.py#L17-L48.

4.Caching VQ-VAE Latents and Code Index (Optional)

Given that our data augmentation consists of simple center cropping and random flipping, VQ-VAE latents and code indices can be pre-computed and saved to CACHED_PATH tto reduce computational overhead during MVAR training:

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 main_cache.py \
  --img_size 256 --data_path ${IMAGENET_PATH} \
  --cached_path ${CACHED_PATH}/train_cache_mvar \ # or ${CACHED_PATH}/val_cache_mvar 
  --train \ # specify train

5.Training Scripts

To train MVAR on ImageNet 256x256, you can use --use_cached=True to use the pre-computed cached latents and code index:

# Example for MVAR-d16
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=16 --bs=448 --ep=300 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} 

# Example for MVAR-d16 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=16 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True

# Example for MVAR-d20 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=20 --bs=192 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True

  # Example for MVAR-d24 (Fine-tuning)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=24 --bs=448 --ep=80 --fp16=1 --alng=1e-3 --wpe=0.1 \
  --data_path ${IMAGENET_PATH} --exp_name ${EXP_NAME} --finetune_from_var=True

6.Sampling & FID Evaluation

6.1. Generate images:

python run_mvar_evaluate.py \
  --cfg 2.7 --top_p 0.99 --top_k 1200 --depth 16 \
  --mvar_ckpt ${MVAR_CKPT}

Suggested CFG for models:

d16: cfg=2.7, top_p=0.99, top_k=1200
d16$^{*}$: cfg=2.0, top_p=0.99, top_k=1200
d20$^{*}$: cfg=1.5, top_p=0.96, top_k=900
d24$^{*}$: cfg=1.4, top_p=0.96, top_k=900

6.2. Run evaluation:

We use the OpenAI's FID evaluation toolkit and reference ground truth npz file of 256x256 to evaluate FID, IS, Precision, and Recall. First, you can create an environment using Docker, consistent with the setup in OpenAI's FID evaluation toolkit:

docker run --rm -it \
    --gpus all \
    -v /data0/home/zhangjinhua/:/workspace/ \
    -v /data0/home/zhangjinhua:/data0/home/zhangjinhua \
    -w /workspace/ \
    nvcr.io/nvidia/tensorflow:25.02-tf2-py3 bash

# Verify GPU availability
python -c "import tensorflow as tf; print('GPU devices:', tf.config.list_physical_devices('GPU'))"
# Expected output:
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Then run the evaluation script:

python utils/evaluations/c2i/evaluator.py \
  --ref_batch VIRTUAL_imagenet256_labeled.npz \
  --sample_batch ${SAMPLE_BATCH}

Related Repositories

NATTEN, VAR

📩 Contact

If you have any questions, feel free to reach out at jinhua.zjh@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
asset		asset
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main_cache.py		main_cache.py
run_mvar_demo.py		run_mvar_demo.py
run_mvar_evaluate.py		run_mvar_evaluate.py
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning (ICLR 2026)

✨ Key Contributions

📑 Contents

📚 Citation

📰 News

🛠️ Pipeline

🥇 Results

🦁 MVAR Model Zoo

📊 Model Performance & Weights

⚙️ Installation

🚀 Training & Evaluation

1.Requirements (Pre-trained VAR)

2.Download MVAR

3.Flash-Attn and Xformers (Optional)

4.Caching VQ-VAE Latents and Code Index (Optional)

5.Training Scripts

6.Sampling & FID Evaluation

Related Repositories

📩 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning (ICLR 2026)

✨ Key Contributions

📑 Contents

📚 Citation

📰 News

🛠️ Pipeline

🥇 Results

🦁 MVAR Model Zoo

📊 Model Performance & Weights

⚙️ Installation

🚀 Training & Evaluation

1.Requirements (Pre-trained VAR)

2.Download MVAR

3.Flash-Attn and Xformers (Optional)

4.Caching VQ-VAE Latents and Code Index (Optional)

5.Training Scripts

6.Sampling & FID Evaluation

Related Repositories

📩 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages