📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
-
Updated
Mar 4, 2025 - Cuda
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
GEMM and Winograd based convolutions using CUTLASS
study of cutlass
Multiple GEMM operators are constructed with cutlass to support LLM inference.
A cutlass cute implementation of headdim-64 flashattentionv2 TensorRT plugin for LightGlue. Run on Jetson Orin NX 8GB with TensorRT 8.5.2.
pytorch implements block sparse
Add a description, image, and links to the cutlass topic page so that developers can more easily learn about it.
To associate your repository with the cutlass topic, visit your repo's landing page and select "manage topics."