[QST] which is optimised way to iterate over the conv2d filter tensor #2146

IzanCatalan · 2025-03-04T19:52:42Z

What is your question?
Hi, I want to iterate and perform some operations over the elements of every channel of each filter's filter tensor of a conv2dfprop operation.

According to https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md, the filters are grouped inside the tensor in shapes K, R, S, C.

I want to perform an adding operation of every element of each channel from each filter. To perform that, I need to iterate through each K filter and, inside each C channel, group its R*S elements per channel and perform the total addition of them.

As I aim to perform this inside the kernel (

cutlass/include/cutlass/conv/kernel/implicit_gemm_convolution.h

Line 281 in 24f991e

void operator()(Params const &params, SharedStorage &shared_storage) {

) and with the threads of the GPU, I wanted to parallelize this iteration.

From my point of view, the best way is for each thread of the GPU (in my case, a GPU Tesla v100 with 128 threads) to add a channel's elements. So, each thread needs to process an addition of RS elements. So if we have KC channels, I need different K*C threads so no thread iterates over others' data. If we don't have enough threads, one thread could perform 2 channels. The code should be similar to the following:

// Shape of Filters Tensor
        long int K = params.problem_size.K;  // Number of filters
        long int C = params.problem_size.C;  // Channels per filter
        long int R = params.problem_size.R;  // Height of the filter 
        long int S = params.problem_size.S;  // Width of the filter
        long int HW = R * S;                 // Number of elements per channel of a filter
        long int total_channels = K * C;

        // Ensure each thread processes multiple channels if there are more channels than threads
        for (long int channel_idx = thread_idx; channel_idx < total_channels; channel_idx += total_threads) {

          // Compute (filter_id, channel_id) for the given channel index
          long int n = channel_idx / C;  // Filter index
          long int c = channel_idx % C;  // Channel index within the filter

          // starting index of each thread within the tensor
          long int base_index = n * (C * HW) + c;


          // iteration over R*S elements per channel 
         float sum = 0.0
          for (long int i = 0; i < HW; i++) {
            // the elements of each channel are stored in memory i*C elements apart
             sum  += params.ptr_B[base_index + i * C]);
          }

Am I doing it properly, or am I iterating over the tensor in a suboptimal way so I am downgrading GPU performance?

The text was updated successfully, but these errors were encountered:

IzanCatalan added ? - Needs Triage question Question labels Mar 4, 2025

IzanCatalan changed the title ~~[QST] The most optimised way to iterate over the conv2d filter tensor~~ [QST] which is optimised way to iterate over the conv2d filter tensor Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] which is optimised way to iterate over the conv2d filter tensor #2146

[QST] which is optimised way to iterate over the conv2d filter tensor #2146

IzanCatalan commented Mar 4, 2025

[QST] which is optimised way to iterate over the conv2d filter tensor #2146

[QST] which is optimised way to iterate over the conv2d filter tensor #2146

Comments

IzanCatalan commented Mar 4, 2025