Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
Updated
Sep 8, 2024 - Cuda
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Multiple GEMM operators are constructed with cutlass to support LLM inference.
FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Codes for DTC-SpMM (ASPLOS'24)
CUDA kernel optimization lab for GEMM, FlashAttention, quantization, and GPU performance learning.
CUDA kernels for LLM inference: FlashAttention forward, Tensor Core GEMM, PyTorch bindings, and benchmarkable reference implementations.
The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions
A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.
Systematic CUDA kernel engineering from SGEMM fundamentals to reusable kernels, advanced optimization experiments, and lightweight inference components. https://lessup.github.io/cuda-kernel-academy/
Header-only C++/CUDA AI kernel library with OpenSpec-driven workflow and readable implementations of GEMM, attention, normalization, convolution, sparse ops, and Python bindings.
🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.
CUDA-native C++ Transformer inference engine with W8A16 quantization, KV cache management, and hand-tuned kernels.
Add a description, image, and links to the tensor-core topic page so that developers can more easily learn about it.
To associate your repository with the tensor-core topic, visit your repo's landing page and select "manage topics."