inference-acceleration

Here are 21 public repositories matching this topic...

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

cuda triton attention vit quantization video-generation mlsys inference-acceleration efficient-attention llm llm-infra video-generate

Updated Oct 28, 2025
Cuda

ali-vilab / TeaCache

Star

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

latte video-generation diffusion-models inference-acceleration open-sora-plan open-sora cogvideox hunyuan-video

Updated Jun 8, 2025
Python

thu-ml / SpargeAttn

Star

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

attention vit quantization video-generation mlsys inference-acceleration ai-infra vision-transformer sparse-attention llm sageattention

Updated Oct 31, 2025
Cuda

H-EmbodVis / EasyCache

Star

Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

video-generation diffusion-models inference-acceleration hunyuan-video

Updated Aug 29, 2025
Python

czg1225 / AsyncDiff

Star

[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

distributed-computing text-to-image efficient-inference diffusion-models text-to-video inference-acceleration stable-diffusion training-free

Updated Sep 27, 2025
Python

zhijie-group / Discrete-Diffusion-Forcing

Star

Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference

inference-acceleration llm dllm

Updated Sep 25, 2025
Python

autonomi-ai / nos

Star

⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.

machine-learning computer-vision inference inference-acceleration generative-ai llm-inference

Updated Jun 8, 2024
Python

thu-ml / SLA

Star

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

transformer video-generation mlsys inference-acceleration ai-infra linear-attention sparse-attention diffusion-transformer train-acceleration sparse-linear-attention

Updated Oct 22, 2025
Python

dvlab-research / Q-LLM

Star

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

fast-inference inference-acceleration large-language-models long-context kv-cache-compression

Updated Jul 16, 2024
Python

jagennath-hari / DepthStream-Accelerator-ROS2-Integrated-Monocular-Depth-Inference

Star

DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.

computer-vision deep-learning robotics depth-estimation ros2 monocular-depth-estimation inference-acceleration tensorrt-inference vision-tranformer