I hope to implement some acceleration technologies for Large Language Models (LLMs) because I enjoy doing this myself and love the challenge of bringing research papers into real-world applications.
If there are any technologies you'd like to develop or discuss, feel free to reach out. Thanks!
I'm excited to dive deeper into AI research!
- 2024/12/16: Add the Medusa-1 Training Script v2
- 2024/12/15: Add the Medusa-1 Training Script
- 2024/12/12: Update the KV Cache support for Speculative Decoding
- 2024/12/04: Add the Kangaroo Training Script v2
- 2024/11/26: Add the Kangaroo Training Script
- 2024/11/22: Update the Target Model Keep Generation Mechanismexperiment
- 2024/11/18: Update the Self-Speculative Decodingexperiment results ofgoogle--gemma-2-9b-it.
- 2024/11/12: Reviewing implementation challenges for Self-Speculative Decodingand evaluating model compatibility for improved efficiency.
- 2024/11/10: Initial setup for Self-Speculative Decodingcompleted; data pipeline in place for testing draft-and-verify.
- 2024/11/08: Speculative Decodingsuccessfully implemented. Verified improved inference time with no noticeable accuracy degradation.
- Batched Speculative Decoding:
- Prompt lookup decoding: Determine timeline after reviewing initial implementations.
- UAG Integration: Assess when to integrate after MedusaandKangarooare in place.
-  2024/11/08 | Complete Speculative Decodingfollowing the paper Fast Inference from Transformers via Speculative Decoding
-  2024/11/15 | Implement Self-Speculative Decodingas per Draft & Verify - Lossless Large Language Model Acceleration via Self-Speculative Decoding- LayerSkip model architecture
- Bayesian Optimization for Layer Skip Selection (AR)
- Adaption Draft-Exiting Mechanism
- Optimization
- Bayesian Optimization for Layer Skip Selection (Speed)
-  gemma-2-9b-itexperiment
 
-  2024/11/22 | Develop Kangaroofollowing Kangaroo - Lossless Self-Speculative Decoding via Double Early Exiting- Kangaroo model
- Training Script
- Implement double early exits to improve speed.
 
-  2024/11/29 | Implement Medusafrom Medusa - Simple LLM Inference Acceleration Framework with Multiple Decoding Heads- Medusa model
- Training Script (Medusa-1)
- Testing
 
-  2025/03 | Implement Hydrafrom Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
-  2025/03 | Implement Lookahead Decodingfrom Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
-  2025/04 | Implement Eaglefrom EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
-  2025/04 | Implement Eagle-2from EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
-  2025/04 | Implement Eagle-3from EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
-  TBD | Implement Batched Speculative Decodingfrom The Synergy of Speculative Decoding and Batching in Serving Large Language Models
-  TBD | Implement prompt lookup decodingfrom prompt-lookup-decoding GitHub
-  TBD | Implement UAG(Universal Assisted Generation) from Universal Assisted Generation Blog