You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is your question?
Hi, I want to iterate and perform some operations over the elements of every channel of each filter's filter tensor of a conv2dfprop operation.
I want to perform an adding operation of every element of each channel from each filter. To perform that, I need to iterate through each K filter and, inside each C channel, group its R*S elements per channel and perform the total addition of them.
) and with the threads of the GPU, I wanted to parallelize this iteration.
From my point of view, the best way is for each thread of the GPU (in my case, a GPU Tesla v100 with 128 threads) to add a channel's elements. So, each thread needs to process an addition of RS elements. So if we have KC channels, I need different K*C threads so no thread iterates over others' data. If we don't have enough threads, one thread could perform 2 channels. The code should be similar to the following:
// Shape of Filters Tensor
long int K = params.problem_size.K; // Number of filters
long int C = params.problem_size.C; // Channels per filter
long int R = params.problem_size.R; // Height of the filter
long int S = params.problem_size.S; // Width of the filter
long int HW = R * S; // Number of elements per channel of a filter
long int total_channels = K * C;
// Ensure each thread processes multiple channels if there are more channels than threads
for (long int channel_idx = thread_idx; channel_idx < total_channels; channel_idx += total_threads) {
// Compute (filter_id, channel_id) for the given channel index
long int n = channel_idx / C; // Filter index
long int c = channel_idx % C; // Channel index within the filter
// starting index of each thread within the tensor
long int base_index = n * (C * HW) + c;
// iteration over R*S elements per channel
float sum = 0.0
for (long int i = 0; i < HW; i++) {
// the elements of each channel are stored in memory i*C elements apart
sum += params.ptr_B[base_index + i * C]);
}
Am I doing it properly, or am I iterating over the tensor in a suboptimal way so I am downgrading GPU performance?
The text was updated successfully, but these errors were encountered:
IzanCatalan
changed the title
[QST] The most optimised way to iterate over the conv2d filter tensor
[QST] which is optimised way to iterate over the conv2d filter tensor
Mar 5, 2025
What is your question?
Hi, I want to iterate and perform some operations over the elements of every channel of each filter's filter tensor of a conv2dfprop operation.
According to https://github.com/NVIDIA/cutlass/blob/main/media/docs/implicit_gemm_convolution.md, the filters are grouped inside the tensor in shapes K, R, S, C.
I want to perform an adding operation of every element of each channel from each filter. To perform that, I need to iterate through each K filter and, inside each C channel, group its R*S elements per channel and perform the total addition of them.
As I aim to perform this inside the kernel (
cutlass/include/cutlass/conv/kernel/implicit_gemm_convolution.h
Line 281 in 24f991e
From my point of view, the best way is for each thread of the GPU (in my case, a GPU Tesla v100 with 128 threads) to add a channel's elements. So, each thread needs to process an addition of RS elements. So if we have KC channels, I need different K*C threads so no thread iterates over others' data. If we don't have enough threads, one thread could perform 2 channels. The code should be similar to the following:
Am I doing it properly, or am I iterating over the tensor in a suboptimal way so I am downgrading GPU performance?
The text was updated successfully, but these errors were encountered: