Feature: Optimize dngvd_op with CPU/GPU Branching Based on nstart #5919

jieli-matrix · 2025-02-21T11:05:39Z

Currently, dngvd_op in module_hsolver/kernels/rocm/dngvd_op.hip.cu uses the ROCm implementation for all input sizes (nstart). Performance analysis shows that the CPU implementation (dngvd_op.cpp) is faster for smaller nstart values.
This PR proposes adding a conditional branch within the dngvd_op<double, base_device::DEVICE_GPU>::operator() function to select the optimal implementation based on nstart:

If nstart > 234, use the existing ROCm implementation.
If nstart <= 234, call the CPU implementation (dngvd_op<double, base_device::DEVICE_CPU>).
This change requires:
Adding an if (nstart > 234) { ... } else { ... } block within the GPU operator().
Inside the else block, calling the CPU implementation with appropriate type casts.

This optimization is expected to improve performance, especially for bigger matrix sizes.

dyzheng · 2025-02-25T05:13:39Z

please fix the compiling error

source/module_hsolver/hsolver_pw.h

source/module_hsolver/hsolver_pw.cpp

source/module_hsolver/hsolver_pw.h

dyzheng · 2025-02-25T05:38:01Z

source/module_hsolver/kernels/rocm/dngvd_op.hip.cu

+    // copied from ../cuda/dngvd_op.cu, "dngvd_op"
+    assert(nstart == ldh);
+
+    if (nstart > N_DCU){


please add notes for this kernel has intersection point of the performance curves between CPU and DCU.

dyzheng · 2025-02-25T05:38:31Z

source/module_hsolver/kernels/rocm/dngvd_op.hip.cu

+    // copied from ../cuda/dngvd_op.cu, "dngvd_op"
+    assert(nstart == ldh);
+
+    if (nstart > N_DCU){


please add notes for this kernel has intersection point of the performance curves between CPU and DCU.

to be honest, N_DCU is tested only for "complex" kernel, but maybe not good intersection point for "double" and "complex" kernel.

jieli-matrix and others added 7 commits February 12, 2025 15:11

add k continuity in hsolver

7b77479

fix FFT call

98fec60

fix device

e330b57

fix device for cpu & gpu

2156ecc

fix on gpu

b029c4e

Fix: do dngvd on DCU rather than CPU

87568f0

dcu optimize

a871050