feat: modify group-gemm stage number (#497)

jeejeelee · web-flow · commit 52dab1d4a4d7 · 2024-09-12T21:26:50.000-07:00
The current group-gemm configuration raises the following error on NVIDIA 3090 : ```shell RuntimeError: cutlass group_gemm.initialize failed: Error Internal ``` Modify the stage of group-gemm to 4, reduce the size of dynamic smem, so that it can be called on GPUs like the 3090. Additionally, I also did a simple comparison on the A800. Modifying the stage to 4 can still slightly improve the performance of group-gemm. Refer to: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/gemm_grouped_sm80.cu
diff --git a/include/flashinfer/group_gemm/wrapper.cuh b/include/flashinfer/group_gemm/wrapper.cuh
@@ -85,7 +85,7 @@ cudaError_t CutlassSegmentGEMMWrapper(CutlassSegmentGEMMHandler* handler, DType*
         cutlass::gemm::GemmShape<16, 8, 16>,     // Instruction Shape
         cutlass::epilogue::thread::LinearCombination<DType, 8, float, float>,  // Epilogue
         cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle,     // Swizzling Operator
-        8                                                                      // Stages
+        4                                                                      // Stages
         >::GemmKernel;
 
     using EpilogueOutputOp = typename GemmKernel::Epilogue::OutputOp;