Awesome-Efficient-LLM

Toxonomy and Papers

Sparsity and Pruning
Quantization
- LLM Quantization
- VLM Quantization
Knowledge Distillation
Low-Rank Decomposition
KV Cache Compression
Speculative Decoding
Diffusion Models

Sparsity and Pruning

Unstructured Pruning

Pruning without Weight Update

Year	Title	Venue	Paper	code
2023	A Simple and Effective Pruning Approach for Large Language Models	ICLR 2024	Link	Link
2024	BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation	ICLR 2024	Link	Link
2024	COPAL: Continual Pruning in Large Language Generative Models	ICML 2024	Link	N/A
2024	Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models	ICML 2024	Link	Link
2025	BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation	ICML 2025	Link	N/A
2025	SAFE: Finding Sparse and Flat Minima to Improve Pruning	ICML 2025	Link	Link
2025	SwiftPrune: Hessian-Free Weight Pruning for Large Language Models	EMNLP 2025 Findings	Link	N/A

Pruning with Weight Update

Year	Title	Venue	Paper	code
2023	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot	ICML 2023	Link	Link
2023	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs	ICLR 2024	Link	Link
2023	The LLM Surgeon	ICLR 2024	Link	Link
2024	Fast and Optimal Weight Update for Pruned Large Language Models	TMLR 2024	Link	Link
2024	Pruning Foundation Models for High Accuracy without Retraining	EMNLP 2024 findings	Link	Link
2024	SparseLLM: Towards Global Pruning for Pre-trained Language Models	NeurIPS 2024	Link	Link
2024	ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models	NeurIPS 2024	Link	Link
2024	Shears: Unstructured Sparsity with Neural Low-rank Adapter Search	NAACL 2024	Link	Link
2025	Wanda++: Pruning Large Language Models via Regional Gradients	ACL 2025 Findings	Link	Link
2024	Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization	ICLR 2025	Link	Link
2025	Dynamic Low-Rank Sparse Adaptation for Large Language Models	ICLR 2025	Link	Link
2024	Wasserstein Distances, Neuronal Entanglement, and Sparsity	ICLR 2025	Link	Link
2025	Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision	ICML 2025	Link	N/A
2025	An Efficient Pruner for Large Language Model with Theoretical Guarantee	ICML 2025	Link	N/A
2025	DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration	NeurIPS 2025	Link	Link
2025	Multi-Objective One-Shot Pruning for Large Language Models	NeurIPS 2025	Link	N/A

Sparsity Rate Allocation

Year	Title	Venue	Paper	code
2023	Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity	ICML 2024	Link	Link
2024	ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment	NeurIPS 2024	Link	Link
2024	Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models	NeurIPS 2024	Link	Link
2024	AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models	NeurIPS 2024	Link	Link
2024	EvoPress: Accurate Dynamic Model Compression via Evolutionary Search	ICML 2025	Link	Link
2025	Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective	ICML 2025	Link	Link
2025	DLP: Dynamic Layerwise Pruning in Large Language Models	ICML 2025	Link	Link
2025	Lua-LLM: Learning Unstructured-Sparsity Allocation for Large Language Models	NeurIPS 2025	Link	N/A

Sparse plus Low-Rank Compression

Year	Title	Venue	Paper	code
2024	OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition	ICLR 2025	Link	Link
2025	Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models	ICML 2025	Link	Link
2025	1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models	EMNLP 2025	Link	Link
2025	3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs	NeurIPS 2025	Link	Link

Calibration Dataset

Year	Title	Venue	Paper	code
2024	On the Impact of Calibration Data in Post-training Quantization and Pruning	ACL 2024	Link	Link
2024	Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning	EMNLP 2024	Link	Link
2024	Beware of Calibration Data for Pruning Large Language Models	ICLR 2025	Link	Link

Evaluation of Pruned Model

Year	Title	Venue	Paper	code
2023	Compressing LLMs: The Truth is Rarely Pure and Never Simple	ICLR 2024	Link	Link
2025	Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compresssion	ICML 2025	Link	Link
2025	Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs	EMNLP 2025 Findings	Link	N/A

Semi-structured Pruning

Year	Title	Venue	Paper	code
2024	WRP: Weight Recover Prune for Structured Sparsity	ACL 2024	Link	Link
2024	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models	ICLR 2024	Link	Link
2024	Pruning Large Language Models with Semi-Structural Adaptive Sparse Training	AAAI 2025	Link	Link
2024	MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models	NeurIPS 2024	Link	Link
2025	ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs	ICML 2025	Link	Link
2025	PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models	NeurIPS 2025	Link	Link
2025	TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks	NeurIPS 2025	Link	Link

Structured Pruning

Head and Neuron Pruning

Year	Title	Venue	Paper	code
2023	LLM-Pruner: On the Structural Pruning of Large Language Models	NeurIPS 2023	Link	Link
2023	Fluctuation-based Adaptive Structured Pruning for Large Language Models	AAAI 2024	Link	Link
2023	Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning	ICLR 2024	Link	Link
2024	BlockPruner: Fine-grained Pruning for Large Language Models	ACL 2025 Findings	Link	Link
2024	Structured Optimal Brain Pruning for Large Language Models	EMNLP 2024	Link	N/A
2024	Search for Efficient Large Language Models	NeurIPS 2024	Link	Link
2024	SlimGPT: Layer-wise Structured Pruning for Large Language Models	NeurIPS 2024	Link	N/A
2024	Compact Language Models via Pruning and Knowledge Distillation	NeurIPS 2024	Link	Link
2024	DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models	NeurIPS 2024	Link	Link
2025	Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization	NeurIPS 2025	Link	Link
2025	Olica: Efficient Structured Pruning of Large Language Models without Retraining	ICML 2025	Link	Link

Layer Pruning

Year	Title	Venue	Paper	code
2024	Shortened LLaMA: A Simple Depth Pruning for Large Language Models	ICLR 2024 workshop	Link	Link
2024	LaCo: Large Language Model Pruning via Layer Collapse	EMNLP 2024 Findings	Link	Link
2024	Shortgpt: Layers in large language models are more redundant than you expect	ACL 2025 Findings	Link	Link
2024	Streamlining Redundant Layers to Compress Large Language Models	ICLR 2025	Link	Link
2024	SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks	ICML 2024	Link	Link
2024	Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging	EMNLP 2024	Link	N/A
2024	TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs	ACL 2025	Link	Link
2025	A Simple Linear Patch Revives Layer-Pruned Large Language Models	NeurIPS 2025	Link	Link

Activation Sparsity

Year	Title	Venue	Paper	code
2023	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML 2024	Link	Link
2023	ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models	ICLR 2024	Link	N/A
2024	CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models	COLM 2024	Link	Link
2024	ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models	EMNLP 2024	Link	Link
2024	Training-Free Activation Sparsity in Large Language Models	ICLR 2025	Link	Link
2024	Sparsing Law: Towards Large Language Models with Greater Activation Sparsity	ICML 2025	Link	Link
2025	La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation	ICML 2025	Link	N/A
2025	R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference	ICLR 2025	Link	Link
2024	Sirius: Contextual Sparsity with Correction for Efficient LLMs	NeurIPS 2024	Link	Link
2024	Learn To be Efficient: Build Structured Sparsity in Large Language Models	NeurIPS 2024	Link	Link
2025	Weight-Aware Activation Sparsity with Constrained Bayesian Optimization Scheduling for Large Language Models	EMNLP 2025	Link	Link
2025	Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity	NeurIPS 2025	Link	Link

Joint Sparsification and Quantization

Year	Title	Venue	Paper	code
2024	SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models	EMNLP 2024 Findings	Link	Link
2024	Effective Interplay between Sparsity and Quantization: From Theory to Practice	ICLR 2025	Link	Link
2024	Compressing large language models by joint sparsification and quantization	ICML 2024	Link	Link
2024	SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression	ICML 2025	Link	Link
2025	Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs	arxiv 2025	Link	Link

Quantization

LLM Quantization

Year	Title	Venue	Paper	code
2023	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	ICLR 2023	Link	Link
2025	OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting	ICLR 2025	Link	Link
2025	SpinQuant: LLM quantization with learned rotations	ICLR 2025	Link	Link
2022	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	ICML 2023	Link	Link
2023	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	MLSys 2024	Link	Link
2024	QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks	ICML 2024	Link	Link
2025	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	MLSys 2025	Link	Link
2024	QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs	NeurIPS 2024	Link	Link
2024	Atom: Low-bit Quantization for Efficient and Accurate LLM Serving	MLSys 2024	Link	Link
2024	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models	ICLR 2024	Link	Link
2023	QuIP: 2-Bit Quantization of Large Language Models With Guarantees	NeurIPS 2023	Link	Link
2022	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	NeurIPS 2022	Link	Link
2023	Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling	EMNLP 2023	Link	Link
2025	GPTAQ: Efficient Finetuning-Free Quantization for Asyetric Calibration	ICML 2025	Link	Link
2024	MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization	NeurIPS 2024	Link	Link
2024	AffineQuant: Affine Transformation Quantization for Large Language Models	ICLR 2024	Link	Link
2024	LLM-QAT: Data-Free Quantization Aware Training for Large Language Models	ACL 2024	Link	Link
2024	BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation	ACL 2024	Link	Link
2023	OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models	AAAI 2024 (Oral)	Link	Link
2024	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	ICLR 2024	Link	Link
2022	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	NeurIPS 2022	Link	Link
2024	LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models	ICLR 2024	Link	Link
2024	OneBit: Towards Extremely Low-bit Large Language Models	NeurIPS 2024	Link	Link
2023	LLM-FP4: 4-bit Floating-Point Quantized Transformers	EMNLP 2023	Link	Link
2024	FlatQuant: Flatness Matters for LLM Quantization	ICML 2025	Link	Link
2024	SqueezeLLM: Dense-and-Sparse Quantization	ICML 2024	Link	Link
2023	RPTQ: Reorder-based Post-training Quantization for Large Language Models		Link	Link
2024	QQQ: Quality Quattuor-Bit Quantization for Large Language Models	ICLR	Link	Link
2024	Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs		Link	Link

VLM Quantization

Year	Title	Venue	Paper	code
2024	Q-VLM: Post-training Quantization for Large Vision Language Models	NIPS 2024	Link	Link
2025	MBQ:Modality-Balanced Quantization for Large Vision-Language Models	CVPR 2025	Link	Link
2025	MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization	ACM MM 2025	Link	Link
2025	CASP: Compression of Large Multimodal Models Based on Attention Sparsity	CVPR 2025	Link	Link

Knowledge Distillation

Year	Title	Venue	Paper	code
2025	Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression	CVPR 2025	Link	Link
2025	LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation	ICLR 2025	Link	Link
2024	PromptKD: Prompt-based Knowledge Distillation for Large Language Models	EMNLP 2024	Link	Link
2023	AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression	ACL 2023	Link	Link
2023	DiffKD: Diffusion-based Knowledge Distillation for Large Language Models	NIPS 2023	Link	Link
2023	SCOTT: Self-Consistent Chain-of-Thought Distillation	ACL 2023	Link	Link
2023	Distilling Script Knowledge from Large Language Models for Constrained Language Planning	ACL 2023	Link	Link
2023	DOT: A Distillation-Oriented Trainer	ICCV 2023	Link	Link
2022	TinyViT: Fast Pretraining Distillation for Small Vision Transformers	ECCV 2022	Link	Link
2022	DIST: Distilling Large Language Models with Small-Scale Data	NIPS 2022	Link	Link
2022	Decoupled Knowledge Distillation	CVPR 2022	Link	Link
2021	HRKD: Hierarchical Relation-based Knowledge Distillation	EMNLP 2021	Link	Link
2021	Distilling Knowledge via Knowledge Review	CVPR 2021	Link	Link
2023	Specializing Smaller Language Models towards Multi-Step Reasoning	ICML 2023	Link	Link
2023	Distilling Script Knowledge from Large Language Models for Constrained Language Planning	ACL 2023	Link	Link
2023	DISCO: Distilling Counterfactuals with Large Language Models	ACL 2023	Link	Link
2023	Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind	NIPS 2023	Link	Link
2023	PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation	EMNLP 2023	Link	Link
2024	Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data	AAAI 2024	Link	Link
2023	Democratizing Reasoning Ability: Tailored Learning from Large Language Model	EMNLP 2023	Link	Link
2023	GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model	ACL 2023	Link	Link
2023	Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes	ACL 2023	Link	Link
2023	Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models	EMNLP 2023	Link	Link
2020	Few Sample Knowledge Distillation for Efficient Network Compression	CVPR 2020	Link	Link

Low-Rank Decomposition

Year	Title	Venue	Paper	code
2024	Compressing Large Language Models using Low Rank and Low Precision Decomposition	NeurIPS 2024	Link	Link
2022	Compressible-composable NeRF via Rank-residual Decomposition	NeurIPS 2022	Link	Link
2024	Unified Low-rank Compression Framework for Click-through Rate Prediction	KDD 2024	Link	Link
2025	Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models	ICML 2025	Link	Link
2024	SliceGPT: Orthogonal Slicing for Parameter-Efficient Transformer Compression	ICLR 2024	Link	Link
2024	Low-Rank Knowledge Decomposition for Medical Foundation Models	CVPR 2024	Link	Link
2024	LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking	CVPR 2024	Link	Link
2021	Decomposable-Net: Scalable Low-Rank Compression for Neural Networks	IJCAI 2021	Link	Link

KV Cache Compression

Year	Title	Venue	Paper	Code	Category
2023	H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	NeurIPS 2023	Link	Link	Token Eviction
2023	Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time	NeurIPS 2023	Link	Link	Token Eviction
2023	Efficient Streaming Language Models with Attention Sinks	ICLR 2024	Link	Link	Token Eviction
2024	SnapKV: LLM Knows What You are Looking for Before Generation	NeurIPS 2024	Link	Link	Token Eviction
2024	InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory	NeurIPS 2024	Link	Link	Token Eviction
2025	InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation		Link	Link	Token Eviction
2024	Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs	ICLR 2024	Link	Link	Token Eviction
2024	Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference	MLSys 2024	Link	Link	Token Eviction
2024	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	ICML 2024	Link	Link	Token Eviction
2024	On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference		Link	Link	Token Eviction
2025	R-KV: Redundancy-aware KV Cache Compression for Reasoning Models		Link	Link	Token Eviction
2025	SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator	ICML 2025	Link	Link	Token Eviction
2024	RazorAttention: Efficient KV Cache Compression Through Retrieval Heads	ICLR 2025	Link	N/A	Token Eviction
2025	Squeezed Attention: Accelerating Long Context Length LLM Inference	ACL 2025	Link	Link	Token Eviction
2024	PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling		Link	Link	Budget Allocation
2024	VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration	ICLR 2025	Link	N/A	Budget Allocation
2025	LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models	ICML 2025	Link	Link	Budget Allocation
2025	CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences	ICLR 2025	Link	Link	Budget Allocation
2024	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	NeurIPS 2024	Link	Link	Cache Merging
2024	CaM: Cache Merging for Memory-efficient LLMs Inference	ICML 2024	Link	Link	Cache Merging
2024	Compressed Context Memory For Online Language Model Interaction	ICLR 2024	Link	Link	Cache Merging
2024	Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference	ICML 2024	Link	N/A	Cache Merging
2024	LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference	EMNLP 2024 Findings	Link	Link	Cache Merging
2024	CHAI: Clustered Head Attention for Efficient LLM Inference	ICML 2024	Link	Link	Cache Merging
2024	D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models	ICLR 2025	Link	Link	Cache Merging
2025	AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning	ICCV 2025	Link	Link	Cache Merging
2024	IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact	ACL 2024	Link	Link	Quantization
2024	KIVI: A Tuning-Free Asyetric 2bit Quantization for KV Cache	ICML 2024	Link	Link	Quantization
2024	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	NeurIPS 2024	Link	Link	Quantization
2024	SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models	COLM 2024	Link	Link	Quantization
2024	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM	NeurIPS 2024	Link	Link	Quantization
2024	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache	ACL 2025	Link	N/A	Quantization
2024	Palu: Compressing KV-Cache with Low-Rank Projection	ICLR 2025	Link	Link	Low Rank Projection
2024	Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning	ICLR 2025	Link	Link	Token Eviction
2024	NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time	ACL 2024	Link	N/A	Token Eviction
2024	SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation	ACL 2025	Link	Link	Token Eviction
2024	AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asyetric Quantization Configurations	ACL 2025	Link	N/A	Quantization

Speculative Decoding

Year	Title	Venue	Paper	code
2024	Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting	NeurIPS 2024	Kangaroo	code
2024	EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees	EMNLP 2024	EAGLE2	code
2025	Learning Harmonized Representations for Speculative Sampling	ICLR 2025	HASS	code
2025	Parallel Speculative Decoding with Adaptive Draft Length	ICLR 2025	PEARL	code
2025	SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration	ICLR 2025	SWIFT	code
2025	Pre-Training Curriculum for Multi-Token Prediction in Language Models	ACL 2025	paper	code
2025	Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree	ACL 2025	paper	N/A

Diffusion Models

Quantization

Year	Title	Venue	Task	Paper	Code
2025	SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models	ICLR 2025	T2I	Link	Link
2025	ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation	ICLR 2025	Image Generation	Link	Link
2023	Post-training Quantization on Diffusion Models	CVPR 2023	T2I、T2V	Link	Link
2023	Q-Diffusion: Quantizing Diffusion Models	ICCV 2023	Image Generation	Link	Link
2024	Towards Accurate Post-training Quantization for Diffusion Models	CVPR 2024	Image Generation	Link	Link
2024	EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models	ICLR 2024	Image Generation	Link	Link
2025	Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers	CVPR 2025	T2I、T2V	Link	Link
2024	TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models	CVPR 2024	Image Generation	Link	Link
2023	Temporal Dynamic Quantization for Diffusion Models	NIPS 2023		Link	Link
2024	PTQ4DiT: Post-training Quantization for Diffusion Transformers	NIPS 2024	T2I	Link	Link
2025	Data-free Video Diffusion Transformers Quantization		T2V	Link	Link
2025	DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing	WACV 2025	T2V	Link	Link
2025	Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs		T2T	Link
2025	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	ICLR 2025	T2I、T2V	Link	Link

Sparsity

Year	Title	Venue	Task	Paper	Code
2024	DiTFastAttn: Attention Compression for Diffusion Transformer Models	NIPS 2024	T2I、T2V	Link	Link
2025	Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity	ICML 2025	T2V	Link	Link
2025	Radial Attention: Sparse Attention with Energy Decay for Long Video Generation	NIPS 2025	T2V	Link	Link
2025	XAttention: Block Sparse Attention with Antidiagonal Scoring	ICML 2025	T2T、T2V	Link	Link

Caching & Reuse

Year	Title	Venue	Task	Paper	Code
2025	Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model	CVPR 2025	T2V	Link	Link
2025	From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers	ICCV 2025	T2V	Link	Link
2025	Adaptive Caching for Faster Video Generation with Diffusion Transformers		T2V	Link	Link
2024	DeepCache: Accelerating Diffusion Models for Free	CVPR 2024	T2I	Link	Link
2025	Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching	NIPS 2024	T2I	Link	Link

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-Efficient-LLM

Toxonomy and Papers

Sparsity and Pruning

Unstructured Pruning

Pruning without Weight Update

Pruning with Weight Update

Sparsity Rate Allocation

Sparse plus Low-Rank Compression

Calibration Dataset

Evaluation of Pruned Model

Semi-structured Pruning

Structured Pruning

Head and Neuron Pruning

Layer Pruning

Other Topic

Activation Sparsity

Joint Sparsification and Quantization

Quantization

LLM Quantization

VLM Quantization

Knowledge Distillation

Low-Rank Decomposition

KV Cache Compression

Speculative Decoding

Diffusion Models

Quantization

Sparsity

Caching & Reuse

About

Uh oh!

Releases

Packages

Contributors 9

Uh oh!

Year	Title	Venue	Paper	code
2023	Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning	ACL 2024 Findings	Link	Link
2024	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	ICLR 2024	Link	Link
2024	APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference	ICML 2024	Link	Link
2024	Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations	ACL 2024 Findings	Link	Link
2024	LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models	ICML 2024	Link	Link
2024	Pruning as a Domain-specific LLM Extractor	NAACL 2024 Findings	Link	Link
2024	Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient	ACL 2025	Link	Link
2025	One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models	ACL 2025 Findings	Link	N/A
2024	RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model	NAACL 2024 Findings	Link	N/A
2024	Finding Transformer Circuits with Edge Pruning	NeurIPS 2024	Link	Link
2024	MoDeGPT: Modular Decomposition for Large Language Model Compression	ICLR 2025	Link	Link
2024	The Unreasonable Ineffectiveness of the Deeper Layers	ICLR 2025	Link	N/A
2024	PAT: Pruning-Aware Tuning for Large Language Models	AAAI 2025	Link	Link
2024	Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy	EMNLP 2024 Findings	Link	Link
2024	LEMON: Reviving Stronger and Smaller LMs from Larger LMs with Linear Parameter Fusion	ACL 2024	Link	N/A
2024	DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization	ACL 2025	Link	Link
2025	You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning	ICLR 2025	Link	Link
2025	LLaMaFlex: Many-in-one LLMs via Generalized Pruning and Weight Sharing	ICLR 2025	Link	N/A
2025	Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing	ICLR 2025	Link	Link
2025	You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning	ICLR 2025	Link	Link
2025	Instruction-Following Pruning for Large Language Models	ICML 2025	Link	N/A
2025	Let LLM Tell What to Prune and How Much to Prune	ICML 2025	Link	Link
2025	Prompt-based Depth Pruning of Large Language Models	ICML 2025	Link	Link
2025	IG-Pruning: Input-Guided Block Pruning for Large Language Models	EMNLP 2025	Link	Link
2025	PIP: Perturbation-based Iterative Pruning for Large Language Models	EMNLP 2025 Findings	Link	N/A
2025	ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization	NeurIPS 2025	Link	Link
2025	Restoring Pruned Large Language Models via Lost Component Compensation	NeurIPS 2025	Link	Link

MAC-AutoML/Awesome-Efficient-LLM

Folders and files

Latest commit

History

Repository files navigation

Awesome-Efficient-LLM

Toxonomy and Papers

Sparsity and Pruning

Unstructured Pruning

Pruning without Weight Update

Pruning with Weight Update

Sparsity Rate Allocation

Sparse plus Low-Rank Compression

Calibration Dataset

Evaluation of Pruned Model

Semi-structured Pruning

Structured Pruning

Head and Neuron Pruning

Layer Pruning

Other Topic

Activation Sparsity

Joint Sparsification and Quantization

Quantization

LLM Quantization

VLM Quantization

Knowledge Distillation

Low-Rank Decomposition

KV Cache Compression

Speculative Decoding

Diffusion Models

Quantization

Sparsity

Caching & Reuse

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Packages