Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] which is optimised way to iterate over the conv2d filter tensor #2146

Open
IzanCatalan opened this issue Mar 4, 2025 · 0 comments
Open

Comments

@IzanCatalan
Copy link

What is your question?
Hi, I want to iterate and perform some operations over the elements of every channel of each filter's filter tensor of a conv2dfprop operation.

According to https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md, the filters are grouped inside the tensor in shapes K, R, S, C.

I want to perform an adding operation of every element of each channel from each filter. To perform that, I need to iterate through each K filter and, inside each C channel, group its R*S elements per channel and perform the total addition of them.

As I aim to perform this inside the kernel (

void operator()(Params const &params, SharedStorage &shared_storage) {
) and with the threads of the GPU, I wanted to parallelize this iteration.

From my point of view, the best way is for each thread of the GPU (in my case, a GPU Tesla v100 with 128 threads) to add a channel's elements. So, each thread needs to process an addition of RS elements. So if we have KC channels, I need different K*C threads so no thread iterates over others' data. If we don't have enough threads, one thread could perform 2 channels. The code should be similar to the following:

// Shape of Filters Tensor
        long int K = params.problem_size.K;  // Number of filters
        long int C = params.problem_size.C;  // Channels per filter
        long int R = params.problem_size.R;  // Height of the filter 
        long int S = params.problem_size.S;  // Width of the filter
        long int HW = R * S;                 // Number of elements per channel of a filter
        long int total_channels = K * C;

        // Ensure each thread processes multiple channels if there are more channels than threads
        for (long int channel_idx = thread_idx; channel_idx < total_channels; channel_idx += total_threads) {

          // Compute (filter_id, channel_id) for the given channel index
          long int n = channel_idx / C;  // Filter index
          long int c = channel_idx % C;  // Channel index within the filter

          // starting index of each thread within the tensor
          long int base_index = n * (C * HW) + c;


          // iteration over R*S elements per channel 
         float sum = 0.0
          for (long int i = 0; i < HW; i++) {
            // the elements of each channel are stored in memory i*C elements apart
             sum  += params.ptr_B[base_index + i * C]);
          }

Am I doing it properly, or am I iterating over the tensor in a suboptimal way so I am downgrading GPU performance?

@IzanCatalan IzanCatalan changed the title [QST] The most optimised way to iterate over the conv2d filter tensor [QST] which is optimised way to iterate over the conv2d filter tensor Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant