This is a repository for organizing papers, codes and other resources related to visual tokenizers.
A visual tokenizer is a mechanism that maps input visual signals (such as images or videos) into a set of compact and structured visual units (tokens), which may be continuous vectors, discrete indices, or a hybrid of both. A core requirement of a visual tokenizer is that the generated visual units must possess sufficient representational capacity to enable high-quality reconstruction of the original visual input through a corresponding decoder or generator.
Note: This definition is not intended to represent a universally accepted or formally established definition in the academic community. Readers and users should be aware that different works in the literature may adopt varying definitions or criteria for what constitutes a visual tokenizer.
If you have any suggestions (missing papers, new papers, or typos), please feel free to edit and pull a request. Just letting us know the title of papers can also be a great contribution to us. You can do this by open issue or contact us directly via email.
-
Denoising Vision Transformer Autoencoder with Spectral Self-Regularization (Nov 16, 2025. arXiv)
-
Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models (Oct 21, 2025. arXiv)
-
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation (Oct 16, 2025. arXiv)
-
Diffusion Transformers with Representation Autoencoders (Oct 13, 2025. arXiv)
-
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer (Oct 8, 2025. arXiv)
-
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder (Sep 29, 2025. arXiv)
-
Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models (Sep 29, 2025. arXiv)
-
HunyuanImage 3.0 Technical Report (Sep 28, 2025. arXiv)
-
Seedream 4.0: Toward Next-generation Multimodal Image Generation (Sep 24, 2025. arXiv)
-
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale (Aug 14, 2025. arXiv)
-
Qwen-Image Technical Report (Aug 4, 2025. arXiv)
-
DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space (Aug 1, 2025. arXiv)
-
Latent Denoising Makes Good Visual Tokenizers (Jul 21, 2025. arXiv)
-
Seedance 1.0: Exploring the Boundaries of Video Generation Models (Jun 10, 2025. arXiv)
-
VIVAT: Virtuous Improving VAE Training through Artifact Mitigation (Jun 9, 2025. arXiv)
-
MAGI-1: Autoregressive Video Generation at Scale (May 19, 2025. arXiv)
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset (May 14, 2025. arXiv)
-
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers (Apr 14, 2025. arXiv)
-
Wan: Open and Advanced Large-Scale Video Generative Models (Mar 26, 2025. arXiv)
-
TULIP: Towards Unified Language-Image Pretraining (Mar 19, 2025. arXiv)
-
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models (Mar 18, 2025. arXiv)
-
FlowTok: Flowing Seamlessly Across Text and Image Tokens (Mar 13, 2025. arXiv)
-
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k (Mar 12, 2025. arXiv)
-
Improving the Diffusability of Autoencoders (Feb 20, 2025. arXiv)
-
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling (Feb 13, 2025. arXiv)
-
Masked Autoencoders Are Effective Tokenizers for Diffusion Models (Feb 5, 2025. arXiv)
-
Diffusion Autoencoders are Scalable Image Tokenizers (Jan 30, 2025. arXiv)
-
CAT: Content-Adaptive Image Tokenization (Jan 6, 2025. arXiv)
-
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models (Jan 2, 2025. arXiv)
-
LTX-Video: Realtime Video Latent Diffusion (Dec, 30, 2024. arXiv)
-
Open-Sora: Democratizing Efficient Video Production for All (Dec 29, 2024. arXiv)
-
Large Motion Video Autoencoding with Cross-modal Video VAE (Dec 23, 2024. arXiv)
-
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer (Dec 14, 2024. arXiv)
-
Multimodal Latent Language Modeling with Next-Token Diffusion (Dec 11, 2024. arXiv)
-
HunyuanVideo: A Systematic Framework For Large Video Generative Models (Dec 3, 2024. arXiv)
-
Open-Sora Plan: Open-Source Large Video Generation Model (Nov 28, 2024. arXiv)
-
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model (Nov 26, 2024. arXiv)
-
REDUCIO! Generating 1024 $\times$ 1024 Video within 16 Seconds using Extremely Compressed Motion Latents (Nov 20, 2024. arXiv)
-
Improved Video VAE for Latent Video Diffusion Model (Nov 10, 2024. arXiv)
-
Allegro: Open the Black Box of Commercial-Level Video Generation Model (Oct 20, 2024. arXiv)
-
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models (Oct 14, 2024. arXiv)
-
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers (Oct 14, 2024. arXiv)
-
Epsilon-VAE: Denoising as Visual Decoding (Oct 5, 2024. arXiv)
-
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model (Sep 2, 2024. arXiv)
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (Aug 20, 2024. arXiv)
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer (Aug 12, 2024. arXiv)
-
FLUX (Aug 1, 2024. BFL)
-
Autoregressive Image Generation without Vector Quantization (Jun 17, 2024. arXiv)
-
CV-VAE: A Compatible Video VAE for Latent Generative Video Models (May 30, 2024. arXiv)
-
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture (May 29, 2024. arXiv)
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Mar 5, 2024. arXiv)
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation (Apr 22, 2024. arXiv)
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (Jul 4, 2023. arXiv)
-
Diffusion Models as Masked Autoencoders (Apr 6, 2023. arXiv)
-
High-Resolution Image Synthesis with Latent Diffusion Models (Dec 20, 2021. arXiv)
-
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation (Nov 30, 2021. arXiv)
-
Masked Autoencoders Are Scalable Vision Learners (Nov 11, 2021. arXiv)
-
Simple and Effective VAE Training with Calibrated Decoders (Jun 23, 2020. arXiv)
-
$\beta$-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework (Feb 6, 2017. OpenReview)
-
Auto-Encoding Variational Bayes (Dec 20, 2013. arXiv)
-
WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction (Aug 7, 2025. arXiv)
-
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again (Jul 29, 2025. arXiv)
-
Quantize-then-Rectify: Efficient VQ-VAE Training (Jul 14, 2025. arXiv)
-
MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization (Jul 10, 2025. arXiv)
-
Hita: Holistic Tokenizer for Autoregressive Image Generation (Jul 3, 2025. arXiv)
-
AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model (Jun 5, 2025. arXiv)
-
Images are Worth Variable Length of Representations (Jun 4, 2025. arXiv)
-
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning (May 12, 2025. arXiv)
-
TVC: Tokenized Video Compression with Ultra-Low Bitrate (Apr 22, 2025. arXiv)
-
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens (Apr 20, 2025. arXiv)
-
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation (Apr 11, 2025. arXiv)
-
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning (Apr 3, 2025. arXiv)
-
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization (Apr 1, 2025. arXiv)
-
CODA: Repurposing Continuous VAEs for Discrete Tokenization (Mar 22, 2025. arXiv)
-
Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization (Mar 14, 2025. arXiv)
-
V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation (Mar 10, 2025. arXiv)
-
UniTok: A Unified Tokenizer for Visual Generation and Understanding (Feb 27, 2025. arXiv)
-
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length (Feb 19, 2025. arXiv)
-
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation (Feb 7, 2025. arXiv)
-
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model (Jan 21, 2025. arXiv)
-
One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression (Jan 17, 2025. arXiv)
-
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens (Dec 13, 2024. arXiv)
-
SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization (Dec 11, 2024. arXiv)
-
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance (Dec 9, 2024. arXiv)
-
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation (Dec 4, 2024. arXiv)
-
Scalable Image Tokenization with Index Backpropagation Quantization (Dec 3, 2024. arXiv)
-
Factorized Visual Tokenization and Generation (Nov 25, 2024. arXiv)
-
Image Understanding Makes for A Good Tokenizer for Image Generation (Nov 7, 2024. arXiv)
-
Adaptive Length Image Tokenization via Recurrent Allocation (Nov 4, 2024. arXiv)
-
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer (Nov 4, 2024. arXiv)
-
Randomized Autoregressive Visual Generation (Nov 1, 2024. arXiv)
-
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (Oct 28, 2024. arXiv)
-
Emu3: Next-Token Prediction is All You Need (Sep 27, 2024. arXiv)
-
MaskBit: Embedding-free Image Generation via Bit Tokens (Sep 24, 2024. arXiv)
-
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (Sep 6, 2024. arXiv)
-
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation (Sep 6, 2024. arXiv)
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (Aug 22, 2024. arXiv)
-
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99% (Jun 17, 2024. arXiv)
-
An Image is Worth 32 Tokens for Reconstruction and Generation (Jun 11, 2024. arXiv)
-
Image and Video Tokenization with Binary Spherical Quantization (Jun 11, 2024. arXiv)
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (Jun 10, 2024. arXiv)
-
LG-VQ: Language-Guided Codebook Learning (May 23, 2024. arXiv)
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models (May 16, 2024. arXiv)
-
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (Apr 3, 2024. arXiv)
-
HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes (Dec 31, 2023. arXiv)
-
Sequential Modeling Enables Scalable Learning for Large Vision Models (Dec 1, 2023. arXiv)
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation (Oct 9, 2023. arXiv)
-
Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers (Oct 9, 2023. arXiv)
-
Finite Scalar Quantization: VQ-VAE Made Simple (Sep 27, 2023. arXiv)
-
Online Clustered Codebook (Jul 27, 2023. arXiv)
-
Planting a SEED of Vision in Large Language Model (Jul 16, 2023. arXiv)
-
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs (Jun 29, 2023. arXiv)
-
Designing a Better Asymmetric VQGAN for StableDiffusion (Jun 7, 2023. arXiv)
-
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation (May 23, 2023. arXiv)
-
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization (May 19, 2023. arXiv)
-
Binary Latent Diffusion (Apr 10, 2023. arXiv)
-
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment (Feb 2, 2023. arXiv)
-
Muse: Text-To-Image Generation via Masked Generative Transformers (Jan 2, 2023. arXiv)
-
Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis (Dec 6, 2022. arXiv)
-
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis (Nov 16, 2022. arXiv)
-
Phenaki: Variable Length Video Generation From Open Domain Textual Description (Oct 5, 2022. arXiv)
-
MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation (Sep 19, 2022. arXiv)
-
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers (Aug 12, 2022. arXiv)
-
DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder (Jun 1, 2022. arXiv)
-
SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization (May 16, 2022. arXiv)
-
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers (Apr 28, 2022. arXiv)
-
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer (Apr 7, 2022. arXiv)
-
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (Mar 24, 2022. arXiv)
-
Autoregressive Image Generation using Residual Quantization (Mar 3, 2022. arXiv)
-
NΓWA-LIP: Language Guided Image Inpainting with Defect-free VQGAN (Feb 10, 2022. arXiv)
-
MaskGIT: Masked Generative Image Transformer (Feb 8, 2022. arXiv)
-
Vector Quantized Diffusion Model for Text-to-Image Synthesis (Nov 29, 2021. arXiv)
-
Vector-quantized Image Modeling with Improved VQGAN (Oct 9, 2021. arXiv)
-
CogView: Mastering Text-to-Image Generation via Transformers (May 26, 2021. arXiv)
-
VideoGPT: Video Generation using VQ-VAE and Transformers (Apr 20, 2021. arXiv)
-
Predicting Video with VQVAE (Mar 2, 2021. arXiv)
-
Zero-Shot Text-to-Image Generation (Feb 24, 2021. arXiv)
-
Taming Transformers for High-Resolution Image Synthesis (Dec 17, 2020. arXiv)
-
Hierarchical Quantized Autoencoders (Feb 19, 2020. arXiv)
-
Generating Diverse High-Fidelity Images with VQ-VAE-2 (Jun 2, 2019. arXiv)
-
Neural Discrete Representation Learning (Nov 2, 2017. arXiv)
-
AToken: A Unified Tokenizer for Vision (Sep 17, 2025. arXiv)
-
OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better (Aug 13, 2025. arXiv)
-
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer (Jul 7, 2025. arXiv)
-
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation (May 8, 2025. arXiv)
-
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding (Apr 6, 2025. arXiv)
-
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens (Jan 13, 2025. arXiv)
-
Cosmos World Foundation Model Platform for Physical AI (Jan 7, 2025. arXiv)
-
VidTok: A Versatile and Open-Source Video Tokenizer (Dec 17, 2024. arXiv)
-
Language-Guided Image Tokenization for Generation (Dec 8, 2024. arXiv)
-
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer (Oct 14, 2024. arXiv)
-
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation (Jun 13, 2024. arXiv)
This template is provided by Awesome-Unified-Multimodal-Models.