[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
-
Updated
Oct 28, 2025 - Cuda
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.
Implementation of ICCV 2025 paper "Growing a Twig to Accelerate Large Vision-Language Models".
a mixed-precision gemm with quantize and reorder kernel.
The official repo for “Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models”.
Code for paper "TLEE: Temporal-wise and Layer-wise Early Exiting Network for Efficient Video Recognition on Edge Devices"
AURA: Augmented Representation for Unified Accuracy-aware Quantization
Convert and run scikit-learn MLPs on Rockchip NPU.
Code for paper "Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices"
Code for paper "Deep Reinforcement Learning based Multi-task Automated Channel Pruning for DNNs"
Modified inference engine for quantized convolution using product quantization
Project for the "Symbolic and Evolutionary Artificial Intelligence" class at Pisa University
Add a description, image, and links to the inference-acceleration topic page so that developers can more easily learn about it.
To associate your repository with the inference-acceleration topic, visit your repo's landing page and select "manage topics."