Skip to content

MAC-AutoML/Awesome-Efficient-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Awesome-Efficient-LLM

Toxonomy and Papers


Sparsity and Pruning

Unstructured Pruning

Pruning without Weight Update

Year Title Venue Paper code
2023 A Simple and Effective Pruning Approach for Large Language Models ICLR 2024 Link Link
2024 BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation ICLR 2024 Link Link
2024 COPAL: Continual Pruning in Large Language Generative Models ICML 2024 Link N/A
2024 Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models ICML 2024 Link Link
2025 BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation ICML 2025 Link N/A
2025 SAFE: Finding Sparse and Flat Minima to Improve Pruning ICML 2025 Link Link
2025 SwiftPrune: Hessian-Free Weight Pruning for Large Language Models EMNLP 2025 Findings Link N/A

Pruning with Weight Update

Year Title Venue Paper code
2023 SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot ICML 2023 Link Link
2023 Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs ICLR 2024 Link Link
2023 The LLM Surgeon ICLR 2024 Link Link
2024 Fast and Optimal Weight Update for Pruned Large Language Models TMLR 2024 Link Link
2024 Pruning Foundation Models for High Accuracy without Retraining EMNLP 2024 findings Link Link
2024 SparseLLM: Towards Global Pruning for Pre-trained Language Models NeurIPS 2024 Link Link
2024 ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models NeurIPS 2024 Link Link
2024 Shears: Unstructured Sparsity with Neural Low-rank Adapter Search NAACL 2024 Link Link
2025 Wanda++: Pruning Large Language Models via Regional Gradients ACL 2025 Findings Link Link
2024 Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization ICLR 2025 Link Link
2025 Dynamic Low-Rank Sparse Adaptation for Large Language Models ICLR 2025 Link Link
2024 Wasserstein Distances, Neuronal Entanglement, and Sparsity ICLR 2025 Link Link
2025 Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision ICML 2025 Link N/A
2025 An Efficient Pruner for Large Language Model with Theoretical Guarantee ICML 2025 Link N/A
2025 DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration NeurIPS 2025 Link Link
2025 Multi-Objective One-Shot Pruning for Large Language Models NeurIPS 2025 Link N/A

Sparsity Rate Allocation

Year Title Venue Paper code
2023 Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity ICML 2024 Link Link
2024 ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment NeurIPS 2024 Link Link
2024 Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models NeurIPS 2024 Link Link
2024 AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models NeurIPS 2024 Link Link
2024 EvoPress: Accurate Dynamic Model Compression via Evolutionary Search ICML 2025 Link Link
2025 Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective ICML 2025 Link Link
2025 DLP: Dynamic Layerwise Pruning in Large Language Models ICML 2025 Link Link
2025 Lua-LLM: Learning Unstructured-Sparsity Allocation for Large Language Models NeurIPS 2025 Link N/A

Sparse plus Low-Rank Compression

Year Title Venue Paper code
2024 OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition ICLR 2025 Link Link
2025 Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models ICML 2025 Link Link
2025 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models EMNLP 2025 Link Link
2025 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs NeurIPS 2025 Link Link

Calibration Dataset

Year Title Venue Paper code
2024 On the Impact of Calibration Data in Post-training Quantization and Pruning ACL 2024 Link Link
2024 Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning EMNLP 2024 Link Link
2024 Beware of Calibration Data for Pruning Large Language Models ICLR 2025 Link Link

Evaluation of Pruned Model

Year Title Venue Paper code
2023 Compressing LLMs: The Truth is Rarely Pure and Never Simple ICLR 2024 Link Link
2025 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compresssion ICML 2025 Link Link
2025 Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs EMNLP 2025 Findings Link N/A

Semi-structured Pruning

Year Title Venue Paper code
2024 WRP: Weight Recover Prune for Structured Sparsity ACL 2024 Link Link
2024 Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models ICLR 2024 Link Link
2024 Pruning Large Language Models with Semi-Structural Adaptive Sparse Training AAAI 2025 Link Link
2024 MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models NeurIPS 2024 Link Link
2025 ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs ICML 2025 Link Link
2025 PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models NeurIPS 2025 Link Link
2025 TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks NeurIPS 2025 Link Link

Structured Pruning

Head and Neuron Pruning

Year Title Venue Paper code
2023 LLM-Pruner: On the Structural Pruning of Large Language Models NeurIPS 2023 Link Link
2023 Fluctuation-based Adaptive Structured Pruning for Large Language Models AAAI 2024 Link Link
2023 Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning ICLR 2024 Link Link
2024 BlockPruner: Fine-grained Pruning for Large Language Models ACL 2025 Findings Link Link
2024 Structured Optimal Brain Pruning for Large Language Models EMNLP 2024 Link N/A
2024 Search for Efficient Large Language Models NeurIPS 2024 Link Link
2024 SlimGPT: Layer-wise Structured Pruning for Large Language Models NeurIPS 2024 Link N/A
2024 Compact Language Models via Pruning and Knowledge Distillation NeurIPS 2024 Link Link
2024 DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models NeurIPS 2024 Link Link
2025 Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization NeurIPS 2025 Link Link
2025 Olica: Efficient Structured Pruning of Large Language Models without Retraining ICML 2025 Link Link

Layer Pruning

Year Title Venue Paper code
2024 Shortened LLaMA: A Simple Depth Pruning for Large Language Models ICLR 2024 workshop Link Link
2024 LaCo: Large Language Model Pruning via Layer Collapse EMNLP 2024 Findings Link Link
2024 Shortgpt: Layers in large language models are more redundant than you expect ACL 2025 Findings Link Link
2024 Streamlining Redundant Layers to Compress Large Language Models ICLR 2025 Link Link
2024 SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks ICML 2024 Link Link
2024 Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging EMNLP 2024 Link N/A
2024 TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs ACL 2025 Link Link
2025 A Simple Linear Patch Revives Layer-Pruned Large Language Models NeurIPS 2025 Link Link

Other Topic

Year Title Venue Paper code
2023 Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning ACL 2024 Findings Link Link
2024 SliceGPT: Compress Large Language Models by Deleting Rows and Columns ICLR 2024 Link Link
2024 APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference ICML 2024 Link Link
2024 Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations ACL 2024 Findings Link Link
2024 LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models ICML 2024 Link Link
2024 Pruning as a Domain-specific LLM Extractor NAACL 2024 Findings Link Link
2024 Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient ACL 2025 Link Link
2025 One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models ACL 2025 Findings Link N/A
2024 RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model NAACL 2024 Findings Link N/A
2024 Finding Transformer Circuits with Edge Pruning NeurIPS 2024 Link Link
2024 MoDeGPT: Modular Decomposition for Large Language Model Compression ICLR 2025 Link Link
2024 The Unreasonable Ineffectiveness of the Deeper Layers ICLR 2025 Link N/A
2024 PAT: Pruning-Aware Tuning for Large Language Models AAAI 2025 Link Link
2024 Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy EMNLP 2024 Findings Link Link
2024 LEMON: Reviving Stronger and Smaller LMs from Larger LMs with Linear Parameter Fusion ACL 2024 Link N/A
2024 DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization ACL 2025 Link Link
2025 You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning ICLR 2025 Link Link
2025 LLaMaFlex: Many-in-one LLMs via Generalized Pruning and Weight Sharing ICLR 2025 Link N/A
2025 Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing ICLR 2025 Link Link
2025 You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning ICLR 2025 Link Link
2025 Instruction-Following Pruning for Large Language Models ICML 2025 Link N/A
2025 Let LLM Tell What to Prune and How Much to Prune ICML 2025 Link Link
2025 Prompt-based Depth Pruning of Large Language Models ICML 2025 Link Link
2025 IG-Pruning: Input-Guided Block Pruning for Large Language Models EMNLP 2025 Link Link
2025 PIP: Perturbation-based Iterative Pruning for Large Language Models EMNLP 2025 Findings Link N/A
2025 ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization NeurIPS 2025 Link Link
2025 Restoring Pruned Large Language Models via Lost Component Compensation NeurIPS 2025 Link Link

Activation Sparsity

Year Title Venue Paper code
2023 Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML 2024 Link Link
2023 ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models ICLR 2024 Link N/A
2024 CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models COLM 2024 Link Link
2024 ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models EMNLP 2024 Link Link
2024 Training-Free Activation Sparsity in Large Language Models ICLR 2025 Link Link
2024 Sparsing Law: Towards Large Language Models with Greater Activation Sparsity ICML 2025 Link Link
2025 La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation ICML 2025 Link N/A
2025 R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference ICLR 2025 Link Link
2024 Sirius: Contextual Sparsity with Correction for Efficient LLMs NeurIPS 2024 Link Link
2024 Learn To be Efficient: Build Structured Sparsity in Large Language Models NeurIPS 2024 Link Link
2025 Weight-Aware Activation Sparsity with Constrained Bayesian Optimization Scheduling for Large Language Models EMNLP 2025 Link Link
2025 Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity NeurIPS 2025 Link Link

Joint Sparsification and Quantization

Year Title Venue Paper code
2024 SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models EMNLP 2024 Findings Link Link
2024 Effective Interplay between Sparsity and Quantization: From Theory to Practice ICLR 2025 Link Link
2024 Compressing large language models by joint sparsification and quantization ICML 2024 Link Link
2024 SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression ICML 2025 Link Link
2025 Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs arxiv 2025 Link Link

Quantization

LLM Quantization

Year Title Venue Paper code
2023 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ICLR 2023 Link Link
2025 OSTQuant: Refining Large Language Model Quantization with
Orthogonal and Scaling Transformations for Better Distribution Fitting
ICLR 2025 Link Link
2025 SpinQuant: LLM quantization with learned rotations ICLR 2025 Link Link
2022 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models ICML 2023 Link Link
2023 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration MLSys 2024 Link Link
2024 QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks ICML 2024 Link Link
2025 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving MLSys 2025 Link Link
2024 QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs NeurIPS 2024 Link Link
2024 Atom: Low-bit Quantization for Efficient and Accurate LLM Serving MLSys 2024 Link Link
2024 OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models ICLR 2024 Link Link
2023 QuIP: 2-Bit Quantization of Large Language Models With Guarantees NeurIPS 2023 Link Link
2022 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale NeurIPS 2022 Link Link
2023 Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling EMNLP 2023 Link Link
2025 GPTAQ: Efficient Finetuning-Free Quantization for Asyetric Calibration ICML 2025 Link Link
2024 MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization NeurIPS 2024 Link Link
2024 AffineQuant: Affine Transformation Quantization for Large Language Models ICLR 2024 Link Link
2024 LLM-QAT: Data-Free Quantization Aware Training for Large Language Models ACL 2024 Link Link
2024 BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation ACL 2024 Link Link
2023 OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models AAAI 2024 (Oral) Link Link
2024 SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression ICLR 2024 Link Link
2022 ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers NeurIPS 2022 Link Link
2024 LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models ICLR 2024 Link Link
2024 OneBit: Towards Extremely Low-bit Large Language Models NeurIPS 2024 Link Link
2023 LLM-FP4: 4-bit Floating-Point Quantized Transformers EMNLP 2023 Link Link
2024 FlatQuant: Flatness Matters for LLM Quantization ICML 2025 Link Link
2024 SqueezeLLM: Dense-and-Sparse Quantization ICML 2024 Link Link
2023 RPTQ: Reorder-based Post-training Quantization for Large Language Models Link Link
2024 QQQ: Quality Quattuor-Bit Quantization for Large Language Models ICLR Link Link
2024 Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs Link Link

VLM Quantization

Year Title Venue Paper code
2024 Q-VLM: Post-training Quantization for Large Vision Language Models NIPS 2024 Link Link
2025 MBQ:Modality-Balanced Quantization for Large Vision-Language Models CVPR 2025 Link Link
2025 MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization ACM MM 2025 Link Link
2025 CASP: Compression of Large Multimodal Models Based on Attention Sparsity CVPR 2025 Link Link

Knowledge Distillation

Year Title Venue Paper code
2025 Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression CVPR 2025 Link Link
2025 LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation ICLR 2025 Link Link
2024 PromptKD: Prompt-based Knowledge Distillation for Large Language Models EMNLP 2024 Link Link
2023 AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression ACL 2023 Link Link
2023 DiffKD: Diffusion-based Knowledge Distillation for Large Language Models NIPS 2023 Link Link
2023 SCOTT: Self-Consistent Chain-of-Thought Distillation ACL 2023 Link Link
2023 Distilling Script Knowledge from Large Language Models for Constrained Language Planning ACL 2023 Link Link
2023 DOT: A Distillation-Oriented Trainer ICCV 2023 Link Link
2022 TinyViT: Fast Pretraining Distillation for Small Vision Transformers ECCV 2022 Link Link
2022 DIST: Distilling Large Language Models with Small-Scale Data NIPS 2022 Link Link
2022 Decoupled Knowledge Distillation CVPR 2022 Link Link
2021 HRKD: Hierarchical Relation-based Knowledge Distillation EMNLP 2021 Link Link
2021 Distilling Knowledge via Knowledge Review CVPR 2021 Link Link
2023 Specializing Smaller Language Models towards Multi-Step Reasoning ICML 2023 Link Link
2023 Distilling Script Knowledge from Large Language Models for Constrained Language Planning ACL 2023 Link Link
2023 DISCO: Distilling Counterfactuals with Large Language Models ACL 2023 Link Link
2023 Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind NIPS 2023 Link Link
2023 PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation EMNLP 2023 Link Link
2024 Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data AAAI 2024 Link Link
2023 Democratizing Reasoning Ability: Tailored Learning from Large Language Model EMNLP 2023 Link Link
2023 GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model ACL 2023 Link Link
2023 Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes ACL 2023 Link Link
2023 Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models EMNLP 2023 Link Link
2020 Few Sample Knowledge Distillation for Efficient Network Compression CVPR 2020 Link Link

Low-Rank Decomposition

Year Title Venue Paper code
2024 Compressing Large Language Models using Low Rank and Low Precision Decomposition NeurIPS 2024 Link Link
2022 Compressible-composable NeRF via Rank-residual Decomposition NeurIPS 2022 Link Link
2024 Unified Low-rank Compression Framework for Click-through Rate Prediction KDD 2024 Link Link
2025 Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models ICML 2025 Link Link
2024 SliceGPT: Orthogonal Slicing for Parameter-Efficient Transformer Compression ICLR 2024 Link Link
2024 Low-Rank Knowledge Decomposition for Medical Foundation Models CVPR 2024 Link Link
2024 LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking CVPR 2024 Link Link
2021 Decomposable-Net: Scalable Low-Rank Compression for Neural Networks IJCAI 2021 Link Link

KV Cache Compression

Year Title Venue Paper Code Category
2023 H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models NeurIPS 2023 Link Link Token Eviction
2023 Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time NeurIPS 2023 Link Link Token Eviction
2023 Efficient Streaming Language Models with Attention Sinks ICLR 2024 Link Link Token Eviction
2024 SnapKV: LLM Knows What You are Looking for Before Generation NeurIPS 2024 Link Link Token Eviction
2024 InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory NeurIPS 2024 Link Link Token Eviction
2025 InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation Link Link Token Eviction
2024 Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs ICLR 2024 Link Link Token Eviction
2024 Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference MLSys 2024 Link Link Token Eviction
2024 Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference ICML 2024 Link Link Token Eviction
2024 On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference Link Link Token Eviction
2025 R-KV: Redundancy-aware KV Cache Compression for Reasoning Models Link Link Token Eviction
2025 SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator ICML 2025 Link Link Token Eviction
2024 RazorAttention: Efficient KV Cache Compression Through Retrieval Heads ICLR 2025 Link N/A Token Eviction
2025 Squeezed Attention: Accelerating Long Context Length LLM Inference ACL 2025 Link Link Token Eviction
2024 PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling Link Link Budget Allocation
2024 VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration ICLR 2025 Link N/A Budget Allocation
2025 LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models ICML 2025 Link Link Budget Allocation
2025 CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences ICLR 2025 Link Link Budget Allocation
2024 MiniCache: KV Cache Compression in Depth Dimension for Large Language Models NeurIPS 2024 Link Link Cache Merging
2024 CaM: Cache Merging for Memory-efficient LLMs Inference ICML 2024 Link Link Cache Merging
2024 Compressed Context Memory For Online Language Model Interaction ICLR 2024 Link Link Cache Merging
2024 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference ICML 2024 Link N/A Cache Merging
2024 LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference EMNLP 2024 Findings Link Link Cache Merging
2024 CHAI: Clustered Head Attention for Efficient LLM Inference ICML 2024 Link Link Cache Merging
2024 D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models ICLR 2025 Link Link Cache Merging
2025 AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning ICCV 2025 Link Link Cache Merging
2024 IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact ACL 2024 Link Link Quantization
2024 KIVI: A Tuning-Free Asyetric 2bit Quantization for KV Cache ICML 2024 Link Link Quantization
2024 KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization NeurIPS 2024 Link Link Quantization
2024 SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models COLM 2024 Link Link Quantization
2024 GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM NeurIPS 2024 Link Link Quantization
2024 MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache ACL 2025 Link N/A Quantization
2024 Palu: Compressing KV-Cache with Low-Rank Projection ICLR 2025 Link Link Low Rank Projection
2024 Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning ICLR 2025 Link Link Token Eviction
2024 NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time ACL 2024 Link N/A Token Eviction
2024 SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation ACL 2025 Link Link Token Eviction
2024 AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asyetric Quantization Configurations ACL 2025 Link N/A Quantization

Speculative Decoding

Year Title Venue Paper code
2024 Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting NeurIPS 2024 Kangaroo code
2024 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees EMNLP 2024 EAGLE2 code
2025 Learning Harmonized Representations for Speculative Sampling ICLR 2025 HASS code
2025 Parallel Speculative Decoding with Adaptive Draft Length ICLR 2025 PEARL code
2025 SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration ICLR 2025 SWIFT code
2025 Pre-Training Curriculum for Multi-Token Prediction in Language Models ACL 2025 paper code
2025 Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree ACL 2025 paper N/A

Diffusion Models

Quantization

Year Title Venue Task Paper Code
2025 SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models ICLR 2025 T2I Link Link
2025 ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation ICLR 2025 Image Generation Link Link
2023 Post-training Quantization on Diffusion Models CVPR 2023 T2I、T2V Link Link
2023 Q-Diffusion: Quantizing Diffusion Models ICCV 2023 Image Generation Link Link
2024 Towards Accurate Post-training Quantization for Diffusion Models CVPR 2024 Image Generation Link Link
2024 EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models ICLR 2024 Image Generation Link Link
2025 Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers CVPR 2025 T2I、T2V Link Link
2024 TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models CVPR 2024 Image Generation Link Link
2023 Temporal Dynamic Quantization for Diffusion Models NIPS 2023 Link Link
2024 PTQ4DiT: Post-training Quantization for Diffusion Transformers NIPS 2024 T2I Link Link
2025 Data-free Video Diffusion Transformers Quantization T2V Link Link
2025 DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing WACV 2025 T2V Link Link
2025 Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs T2T Link
2025 SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration ICLR 2025 T2I、T2V Link Link

Sparsity

Year Title Venue Task Paper Code
2024 DiTFastAttn: Attention Compression for Diffusion Transformer Models NIPS 2024 T2I、T2V Link Link
2025 Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity ICML 2025 T2V Link Link
2025 Radial Attention: Sparse Attention with Energy Decay for Long Video Generation NIPS 2025 T2V Link Link
2025 XAttention: Block Sparse Attention with Antidiagonal Scoring ICML 2025 T2T、T2V Link Link

Caching & Reuse

Year Title Venue Task Paper Code
2025 Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model CVPR 2025 T2V Link Link
2025 From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers ICCV 2025 T2V Link Link
2025 Adaptive Caching for Faster Video Generation with Diffusion Transformers T2V Link Link
2024 DeepCache: Accelerating Diffusion Models for Free CVPR 2024 T2I Link Link
2025 Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching NIPS 2024 T2I Link Link

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 9