A curated list for Efficient Large Language Models
-
Updated
Jun 17, 2025 - Python
A curated list for Efficient Large Language Models
14-stage Fusion Pipeline for LLM token compression — reversible compression, AST-aware code analysis, intelligent content routing. Zero LLM inference cost. MIT licensed.
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
D^2-MoE: Delta Decompression for MoE-based LLMs Compression
[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.
papers of llm compression
LLM Inference on AWS Lambda
[CAAI AIR'24] Minimize Quantization Output Error with Bias Compensation
QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. Support Qwen, Olmo3, Llama, etc.
An implementation of the MoDeGPT LLM compression from the ICLR 2025 Conference paper: Modular Decomposition For Large Language Model Compression.
Behavioral auditing & repair toolkit for LLMs. Measures 8 dimensions via confidence probes.
AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.
Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.
[ICLR2026] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models. Support interpretation of Qwen, Llama, etc.
A standard PyTorch implementation of Google’s paper Language Modeling Is Compression—with no reliance on Haiku or JAX. Drawing on the original repository (https://github.com/google-deepmind/language_modeling_is_compression), this code is capable of reproducing the key results from the paper.
Token Price Estimation for LLMs
Research code for LLM Compression using Functional Algorithms, exploring stratified manifold learning, clustering, and compression techniques. Experiments span synthetic datasets (Swiss Roll, Manifold Singularities) and real-world text embeddings (DBpedia-14). The goal is to preserve semantic structure while reducing model complexity.
NYCU Edge AI Final Project Using SGLang
Add a description, image, and links to the llm-compression topic page so that developers can more easily learn about it.
To associate your repository with the llm-compression topic, visit your repo's landing page and select "manage topics."