- Sparsity and Pruning
- Quantization
- Knowledge Distillation
- Low-Rank Decomposition
- KV Cache Compression
- Speculative Decoding
- Diffusion Models
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | A Simple and Effective Pruning Approach for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation | ICLR 2024 | Link | Link |
| 2024 | COPAL: Continual Pruning in Large Language Generative Models | ICML 2024 | Link | N/A |
| 2024 | Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models | ICML 2024 | Link | Link |
| 2025 | BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation | ICML 2025 | Link | N/A |
| 2025 | SAFE: Finding Sparse and Flat Minima to Improve Pruning | ICML 2025 | Link | Link |
| 2025 | SwiftPrune: Hessian-Free Weight Pruning for Large Language Models | EMNLP 2025 Findings | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | ICML 2023 | Link | Link |
| 2023 | Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs | ICLR 2024 | Link | Link |
| 2023 | The LLM Surgeon | ICLR 2024 | Link | Link |
| 2024 | Fast and Optimal Weight Update for Pruned Large Language Models | TMLR 2024 | Link | Link |
| 2024 | Pruning Foundation Models for High Accuracy without Retraining | EMNLP 2024 findings | Link | Link |
| 2024 | SparseLLM: Towards Global Pruning for Pre-trained Language Models | NeurIPS 2024 | Link | Link |
| 2024 | ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | Shears: Unstructured Sparsity with Neural Low-rank Adapter Search | NAACL 2024 | Link | Link |
| 2025 | Wanda++: Pruning Large Language Models via Regional Gradients | ACL 2025 Findings | Link | Link |
| 2024 | Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization | ICLR 2025 | Link | Link |
| 2025 | Dynamic Low-Rank Sparse Adaptation for Large Language Models | ICLR 2025 | Link | Link |
| 2024 | Wasserstein Distances, Neuronal Entanglement, and Sparsity | ICLR 2025 | Link | Link |
| 2025 | Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision | ICML 2025 | Link | N/A |
| 2025 | An Efficient Pruner for Large Language Model with Theoretical Guarantee | ICML 2025 | Link | N/A |
| 2025 | DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration | NeurIPS 2025 | Link | Link |
| 2025 | Multi-Objective One-Shot Pruning for Large Language Models | NeurIPS 2025 | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity | ICML 2024 | Link | Link |
| 2024 | ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment | NeurIPS 2024 | Link | Link |
| 2024 | Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | EvoPress: Accurate Dynamic Model Compression via Evolutionary Search | ICML 2025 | Link | Link |
| 2025 | Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective | ICML 2025 | Link | Link |
| 2025 | DLP: Dynamic Layerwise Pruning in Large Language Models | ICML 2025 | Link | Link |
| 2025 | Lua-LLM: Learning Unstructured-Sparsity Allocation for Large Language Models | NeurIPS 2025 | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition | ICLR 2025 | Link | Link |
| 2025 | Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models | ICML 2025 | Link | Link |
| 2025 | 1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models | EMNLP 2025 | Link | Link |
| 2025 | 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | On the Impact of Calibration Data in Post-training Quantization and Pruning | ACL 2024 | Link | Link |
| 2024 | Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning | EMNLP 2024 | Link | Link |
| 2024 | Beware of Calibration Data for Pruning Large Language Models | ICLR 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Compressing LLMs: The Truth is Rarely Pure and Never Simple | ICLR 2024 | Link | Link |
| 2025 | Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compresssion | ICML 2025 | Link | Link |
| 2025 | Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs | EMNLP 2025 Findings | Link | N/A |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | WRP: Weight Recover Prune for Structured Sparsity | ACL 2024 | Link | Link |
| 2024 | Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | Pruning Large Language Models with Semi-Structural Adaptive Sparse Training | AAAI 2025 | Link | Link |
| 2024 | MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models | NeurIPS 2024 | Link | Link |
| 2025 | ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs | ICML 2025 | Link | Link |
| 2025 | PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models | NeurIPS 2025 | Link | Link |
| 2025 | TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | LLM-Pruner: On the Structural Pruning of Large Language Models | NeurIPS 2023 | Link | Link |
| 2023 | Fluctuation-based Adaptive Structured Pruning for Large Language Models | AAAI 2024 | Link | Link |
| 2023 | Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning | ICLR 2024 | Link | Link |
| 2024 | BlockPruner: Fine-grained Pruning for Large Language Models | ACL 2025 Findings | Link | Link |
| 2024 | Structured Optimal Brain Pruning for Large Language Models | EMNLP 2024 | Link | N/A |
| 2024 | Search for Efficient Large Language Models | NeurIPS 2024 | Link | Link |
| 2024 | SlimGPT: Layer-wise Structured Pruning for Large Language Models | NeurIPS 2024 | Link | N/A |
| 2024 | Compact Language Models via Pruning and Knowledge Distillation | NeurIPS 2024 | Link | Link |
| 2024 | DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models | NeurIPS 2024 | Link | Link |
| 2025 | Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization | NeurIPS 2025 | Link | Link |
| 2025 | Olica: Efficient Structured Pruning of Large Language Models without Retraining | ICML 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | Shortened LLaMA: A Simple Depth Pruning for Large Language Models | ICLR 2024 workshop | Link | Link |
| 2024 | LaCo: Large Language Model Pruning via Layer Collapse | EMNLP 2024 Findings | Link | Link |
| 2024 | Shortgpt: Layers in large language models are more redundant than you expect | ACL 2025 Findings | Link | Link |
| 2024 | Streamlining Redundant Layers to Compress Large Language Models | ICLR 2025 | Link | Link |
| 2024 | SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks | ICML 2024 | Link | Link |
| 2024 | Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging | EMNLP 2024 | Link | N/A |
| 2024 | TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs | ACL 2025 | Link | Link |
| 2025 | A Simple Linear Patch Revives Layer-Pruned Large Language Models | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning | ACL 2024 Findings | Link | Link |
| 2024 | SliceGPT: Compress Large Language Models by Deleting Rows and Columns | ICLR 2024 | Link | Link |
| 2024 | APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference | ICML 2024 | Link | Link |
| 2024 | Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations | ACL 2024 Findings | Link | Link |
| 2024 | LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models | ICML 2024 | Link | Link |
| 2024 | Pruning as a Domain-specific LLM Extractor | NAACL 2024 Findings | Link | Link |
| 2024 | Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient | ACL 2025 | Link | Link |
| 2025 | One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models | ACL 2025 Findings | Link | N/A |
| 2024 | RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model | NAACL 2024 Findings | Link | N/A |
| 2024 | Finding Transformer Circuits with Edge Pruning | NeurIPS 2024 | Link | Link |
| 2024 | MoDeGPT: Modular Decomposition for Large Language Model Compression | ICLR 2025 | Link | Link |
| 2024 | The Unreasonable Ineffectiveness of the Deeper Layers | ICLR 2025 | Link | N/A |
| 2024 | PAT: Pruning-Aware Tuning for Large Language Models | AAAI 2025 | Link | Link |
| 2024 | Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy | EMNLP 2024 Findings | Link | Link |
| 2024 | LEMON: Reviving Stronger and Smaller LMs from Larger LMs with Linear Parameter Fusion | ACL 2024 | Link | N/A |
| 2024 | DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization | ACL 2025 | Link | Link |
| 2025 | You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning | ICLR 2025 | Link | Link |
| 2025 | LLaMaFlex: Many-in-one LLMs via Generalized Pruning and Weight Sharing | ICLR 2025 | Link | N/A |
| 2025 | Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing | ICLR 2025 | Link | Link |
| 2025 | You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning | ICLR 2025 | Link | Link |
| 2025 | Instruction-Following Pruning for Large Language Models | ICML 2025 | Link | N/A |
| 2025 | Let LLM Tell What to Prune and How Much to Prune | ICML 2025 | Link | Link |
| 2025 | Prompt-based Depth Pruning of Large Language Models | ICML 2025 | Link | Link |
| 2025 | IG-Pruning: Input-Guided Block Pruning for Large Language Models | EMNLP 2025 | Link | Link |
| 2025 | PIP: Perturbation-based Iterative Pruning for Large Language Models | EMNLP 2025 Findings | Link | N/A |
| 2025 | ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization | NeurIPS 2025 | Link | Link |
| 2025 | Restoring Pruned Large Language Models via Lost Component Compensation | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML 2024 | Link | Link |
| 2023 | ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models | ICLR 2024 | Link | N/A |
| 2024 | CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models | COLM 2024 | Link | Link |
| 2024 | ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models | EMNLP 2024 | Link | Link |
| 2024 | Training-Free Activation Sparsity in Large Language Models | ICLR 2025 | Link | Link |
| 2024 | Sparsing Law: Towards Large Language Models with Greater Activation Sparsity | ICML 2025 | Link | Link |
| 2025 | La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation | ICML 2025 | Link | N/A |
| 2025 | R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference | ICLR 2025 | Link | Link |
| 2024 | Sirius: Contextual Sparsity with Correction for Efficient LLMs | NeurIPS 2024 | Link | Link |
| 2024 | Learn To be Efficient: Build Structured Sparsity in Large Language Models | NeurIPS 2024 | Link | Link |
| 2025 | Weight-Aware Activation Sparsity with Constrained Bayesian Optimization Scheduling for Large Language Models | EMNLP 2025 | Link | Link |
| 2025 | Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity | NeurIPS 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models | EMNLP 2024 Findings | Link | Link |
| 2024 | Effective Interplay between Sparsity and Quantization: From Theory to Practice | ICLR 2025 | Link | Link |
| 2024 | Compressing large language models by joint sparsification and quantization | ICML 2024 | Link | Link |
| 2024 | SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression | ICML 2025 | Link | Link |
| 2025 | Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs | arxiv 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2023 | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | ICLR 2023 | Link | Link |
| 2025 | OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting |
ICLR 2025 | Link | Link |
| 2025 | SpinQuant: LLM quantization with learned rotations | ICLR 2025 | Link | Link |
| 2022 | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | ICML 2023 | Link | Link |
| 2023 | AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | MLSys 2024 | Link | Link |
| 2024 | QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks | ICML 2024 | Link | Link |
| 2025 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | MLSys 2025 | Link | Link |
| 2024 | QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | NeurIPS 2024 | Link | Link |
| 2024 | Atom: Low-bit Quantization for Efficient and Accurate LLM Serving | MLSys 2024 | Link | Link |
| 2024 | OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | ICLR 2024 | Link | Link |
| 2023 | QuIP: 2-Bit Quantization of Large Language Models With Guarantees | NeurIPS 2023 | Link | Link |
| 2022 | LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale | NeurIPS 2022 | Link | Link |
| 2023 | Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling | EMNLP 2023 | Link | Link |
| 2025 | GPTAQ: Efficient Finetuning-Free Quantization for Asyetric Calibration | ICML 2025 | Link | Link |
| 2024 | MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization | NeurIPS 2024 | Link | Link |
| 2024 | AffineQuant: Affine Transformation Quantization for Large Language Models | ICLR 2024 | Link | Link |
| 2024 | LLM-QAT: Data-Free Quantization Aware Training for Large Language Models | ACL 2024 | Link | Link |
| 2024 | BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation | ACL 2024 | Link | Link |
| 2023 | OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models | AAAI 2024 (Oral) | Link | Link |
| 2024 | SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression | ICLR 2024 | Link | Link |
| 2022 | ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | NeurIPS 2022 | Link | Link |
| 2024 | LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models | ICLR 2024 | Link | Link |
| 2024 | OneBit: Towards Extremely Low-bit Large Language Models | NeurIPS 2024 | Link | Link |
| 2023 | LLM-FP4: 4-bit Floating-Point Quantized Transformers | EMNLP 2023 | Link | Link |
| 2024 | FlatQuant: Flatness Matters for LLM Quantization | ICML 2025 | Link | Link |
| 2024 | SqueezeLLM: Dense-and-Sparse Quantization | ICML 2024 | Link | Link |
| 2023 | RPTQ: Reorder-based Post-training Quantization for Large Language Models | Link | Link |
|
| 2024 | QQQ: Quality Quattuor-Bit Quantization for Large Language Models | ICLR | Link | Link |
| 2024 | Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | Q-VLM: Post-training Quantization for Large Vision Language Models | NIPS 2024 | Link | Link |
| 2025 | MBQ:Modality-Balanced Quantization for Large Vision-Language Models | CVPR 2025 | Link | Link |
| 2025 | MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization | ACM MM 2025 | Link | Link |
| 2025 | CASP: Compression of Large Multimodal Models Based on Attention Sparsity | CVPR 2025 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2025 | Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression | CVPR 2025 | Link | Link |
| 2025 | LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation | ICLR 2025 | Link | Link |
| 2024 | PromptKD: Prompt-based Knowledge Distillation for Large Language Models | EMNLP 2024 | Link | Link |
| 2023 | AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression | ACL 2023 | Link | Link |
| 2023 | DiffKD: Diffusion-based Knowledge Distillation for Large Language Models | NIPS 2023 | Link | Link |
| 2023 | SCOTT: Self-Consistent Chain-of-Thought Distillation | ACL 2023 | Link | Link |
| 2023 | Distilling Script Knowledge from Large Language Models for Constrained Language Planning | ACL 2023 | Link | Link |
| 2023 | DOT: A Distillation-Oriented Trainer | ICCV 2023 | Link | Link |
| 2022 | TinyViT: Fast Pretraining Distillation for Small Vision Transformers | ECCV 2022 | Link | Link |
| 2022 | DIST: Distilling Large Language Models with Small-Scale Data | NIPS 2022 | Link | Link |
| 2022 | Decoupled Knowledge Distillation | CVPR 2022 | Link | Link |
| 2021 | HRKD: Hierarchical Relation-based Knowledge Distillation | EMNLP 2021 | Link | Link |
| 2021 | Distilling Knowledge via Knowledge Review | CVPR 2021 | Link | Link |
| 2023 | Specializing Smaller Language Models towards Multi-Step Reasoning | ICML 2023 | Link | Link |
| 2023 | Distilling Script Knowledge from Large Language Models for Constrained Language Planning | ACL 2023 | Link | Link |
| 2023 | DISCO: Distilling Counterfactuals with Large Language Models | ACL 2023 | Link | Link |
| 2023 | Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind | NIPS 2023 | Link | Link |
| 2023 | PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation | EMNLP 2023 | Link | Link |
| 2024 | Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data | AAAI 2024 | Link | Link |
| 2023 | Democratizing Reasoning Ability: Tailored Learning from Large Language Model | EMNLP 2023 | Link | Link |
| 2023 | GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model | ACL 2023 | Link | Link |
| 2023 | Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes | ACL 2023 | Link | Link |
| 2023 | Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models | EMNLP 2023 | Link | Link |
| 2020 | Few Sample Knowledge Distillation for Efficient Network Compression | CVPR 2020 | Link | Link |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | Compressing Large Language Models using Low Rank and Low Precision Decomposition | NeurIPS 2024 | Link | Link |
| 2022 | Compressible-composable NeRF via Rank-residual Decomposition | NeurIPS 2022 | Link | Link |
| 2024 | Unified Low-rank Compression Framework for Click-through Rate Prediction | KDD 2024 | Link | Link |
| 2025 | Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models | ICML 2025 | Link | Link |
| 2024 | SliceGPT: Orthogonal Slicing for Parameter-Efficient Transformer Compression | ICLR 2024 | Link | Link |
| 2024 | Low-Rank Knowledge Decomposition for Medical Foundation Models | CVPR 2024 | Link | Link |
| 2024 | LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking | CVPR 2024 | Link | Link |
| 2021 | Decomposable-Net: Scalable Low-Rank Compression for Neural Networks | IJCAI 2021 | Link | Link |
| Year | Title | Venue | Paper | Code | Category |
|---|---|---|---|---|---|
| 2023 | H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | NeurIPS 2023 | Link | Link |
Token Eviction |
| 2023 | Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | NeurIPS 2023 | Link | Link |
Token Eviction |
| 2023 | Efficient Streaming Language Models with Attention Sinks | ICLR 2024 | Link | Link |
Token Eviction |
| 2024 | SnapKV: LLM Knows What You are Looking for Before Generation | NeurIPS 2024 | Link | Link |
Token Eviction |
| 2024 | InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory | NeurIPS 2024 | Link | Link |
Token Eviction |
| 2025 | InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation | Link | Link |
Token Eviction | |
| 2024 | Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs | ICLR 2024 | Link | Link |
Token Eviction |
| 2024 | Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | MLSys 2024 | Link | Link |
Token Eviction |
| 2024 | Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | ICML 2024 | Link | Link |
Token Eviction |
| 2024 | On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference | Link | Link |
Token Eviction | |
| 2025 | R-KV: Redundancy-aware KV Cache Compression for Reasoning Models | Link | Link |
Token Eviction | |
| 2025 | SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator | ICML 2025 | Link | Link |
Token Eviction |
| 2024 | RazorAttention: Efficient KV Cache Compression Through Retrieval Heads | ICLR 2025 | Link | N/A | Token Eviction |
| 2025 | Squeezed Attention: Accelerating Long Context Length LLM Inference | ACL 2025 | Link | Link |
Token Eviction |
| 2024 | PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling | Link | Link |
Budget Allocation | |
| 2024 | VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration | ICLR 2025 | Link | N/A | Budget Allocation |
| 2025 | LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models | ICML 2025 | Link | Link |
Budget Allocation |
| 2025 | CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences | ICLR 2025 | Link | Link |
Budget Allocation |
| 2024 | MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | NeurIPS 2024 | Link | Link |
Cache Merging |
| 2024 | CaM: Cache Merging for Memory-efficient LLMs Inference | ICML 2024 | Link | Link |
Cache Merging |
| 2024 | Compressed Context Memory For Online Language Model Interaction | ICLR 2024 | Link | Link |
Cache Merging |
| 2024 | Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference | ICML 2024 | Link | N/A | Cache Merging |
| 2024 | LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference | EMNLP 2024 Findings | Link | Link |
Cache Merging |
| 2024 | CHAI: Clustered Head Attention for Efficient LLM Inference | ICML 2024 | Link | Link |
Cache Merging |
| 2024 | D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models | ICLR 2025 | Link | Link |
Cache Merging |
| 2025 | AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | ICCV 2025 | Link | Link |
Cache Merging |
| 2024 | IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact | ACL 2024 | Link | Link |
Quantization |
| 2024 | KIVI: A Tuning-Free Asyetric 2bit Quantization for KV Cache | ICML 2024 | Link | Link |
Quantization |
| 2024 | KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | NeurIPS 2024 | Link | Link |
Quantization |
| 2024 | SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models | COLM 2024 | Link | Link |
Quantization |
| 2024 | GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM | NeurIPS 2024 | Link | Link |
Quantization |
| 2024 | MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache | ACL 2025 | Link | N/A | Quantization |
| 2024 | Palu: Compressing KV-Cache with Low-Rank Projection | ICLR 2025 | Link | Link |
Low Rank Projection |
| 2024 | Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning | ICLR 2025 | Link | Link |
Token Eviction |
| 2024 | NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time | ACL 2024 | Link | N/A | Token Eviction |
| 2024 | SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation | ACL 2025 | Link | Link |
Token Eviction |
| 2024 | AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asyetric Quantization Configurations | ACL 2025 | Link | N/A | Quantization |
| Year | Title | Venue | Paper | code |
|---|---|---|---|---|
| 2024 | Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting | NeurIPS 2024 | Kangaroo | code |
| 2024 | EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees | EMNLP 2024 | EAGLE2 | code |
| 2025 | Learning Harmonized Representations for Speculative Sampling | ICLR 2025 | HASS | code |
| 2025 | Parallel Speculative Decoding with Adaptive Draft Length | ICLR 2025 | PEARL | code |
| 2025 | SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration | ICLR 2025 | SWIFT | code |
| 2025 | Pre-Training Curriculum for Multi-Token Prediction in Language Models | ACL 2025 | paper | code |
| 2025 | Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree | ACL 2025 | paper | N/A |
| Year | Title | Venue | Task | Paper | Code |
|---|---|---|---|---|---|
| 2025 | SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models | ICLR 2025 | T2I | Link | Link |
| 2025 | ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation | ICLR 2025 | Image Generation | Link | Link |
| 2023 | Post-training Quantization on Diffusion Models | CVPR 2023 | T2I、T2V | Link | Link |
| 2023 | Q-Diffusion: Quantizing Diffusion Models | ICCV 2023 | Image Generation | Link | Link |
| 2024 | Towards Accurate Post-training Quantization for Diffusion Models | CVPR 2024 | Image Generation | Link | Link |
| 2024 | EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models | ICLR 2024 | Image Generation | Link | Link |
| 2025 | Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers | CVPR 2025 | T2I、T2V | Link | Link |
| 2024 | TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models | CVPR 2024 | Image Generation | Link | Link |
| 2023 | Temporal Dynamic Quantization for Diffusion Models | NIPS 2023 | Link | Link |
|
| 2024 | PTQ4DiT: Post-training Quantization for Diffusion Transformers | NIPS 2024 | T2I | Link | Link |
| 2025 | Data-free Video Diffusion Transformers Quantization | T2V | Link | Link |
|
| 2025 | DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing | WACV 2025 | T2V | Link | Link |
| 2025 | Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs | T2T | Link | ||
| 2025 | SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration | ICLR 2025 | T2I、T2V | Link | Link |
| Year | Title | Venue | Task | Paper | Code |
|---|---|---|---|---|---|
| 2024 | DiTFastAttn: Attention Compression for Diffusion Transformer Models | NIPS 2024 | T2I、T2V | Link | Link |
| 2025 | Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity | ICML 2025 | T2V | Link | Link |
| 2025 | Radial Attention: Sparse Attention with Energy Decay for Long Video Generation | NIPS 2025 | T2V | Link | Link |
| 2025 | XAttention: Block Sparse Attention with Antidiagonal Scoring | ICML 2025 | T2T、T2V | Link | Link |
| Year | Title | Venue | Task | Paper | Code |
|---|---|---|---|---|---|
| 2025 | Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model | CVPR 2025 | T2V | Link | Link |
| 2025 | From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers | ICCV 2025 | T2V | Link | Link |
| 2025 | Adaptive Caching for Faster Video Generation with Diffusion Transformers | T2V | Link | Link |
|
| 2024 | DeepCache: Accelerating Diffusion Models for Free | CVPR 2024 | T2I | Link | Link |
| 2025 | Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching | NIPS 2024 | T2I | Link | Link |