-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shared/local memory in FD stencils kernels #1746
Comments
We can use the same high-level design as the shared memory for the spectral element kernels:
|
I still don't understand this. Here are some questions:
This is my current understanding, at high level. Suppose we want to compute
Now, a more complex case,
Now, if we had an extruded field, we would allocate one block per column with as many threads as the number of levels. And repeated above. Is my understanding correct? |
It's allocated once per StencilBroadcasted object.
We are recursing over the broadcasted object (it is a recursive object). We fill the shared memory with (for
The result of which operation? The outer-most layer of the expression,
Correct
Correct
No, only the argument to
Put differently, if we ask: "Where is the result of the operation stored in
Most of these questions are really about how broadcasting and
This case is exactly the same as the previous one:
I'm really confused about what you're asking. Let's look at the code: @inline function fd_stencil_partition(
us::DataLayouts.UniversalSize,
n_face_levels::Integer,
n_max_threads::Integer = 256;
)
(Nq, _, _, Nv, Nh) = DataLayouts.universal_size(us)
Nvthreads = n_face_levels
@assert Nvthreads <= maximum_allowable_threads()[1] "Number of vertical face levels cannot exceed $(maximum_allowable_threads()[1])"
Nvblocks = cld(Nv, Nvthreads) # +1 may be needed to guarantee that shared memory is populated at the last cell face
return (;
threads = (Nvthreads,),
blocks = (Nh, Nvblocks, Nq * Nq),
Nvthreads,
)
end What's unclear about the launch configuration? cc @Sbozzolo |
Local notes. Each div(a) + div(b) + div(c)
Broadcasted(+, StencilBroadcasted(div, a), StencilBroadcasted(div, b), StencilBroadcasted(div, c)) # 3 shmem allocations
div(a + b + c)
StencilBroadcasted(div, Broadcasted(+, a, b, c)) # 1 shmem allocation |
We allocate We always launch |
For (f[i+1] - f[i]) / dz[i] To fill shmem, we need to populate shmem with |
|
We should implement shared memory for our FD stencil kernels, as this will lower the global memory traffic, reducing the memory bandwidth of the kernels and improve the performance.
A count of different operators in ClimaAtmos:
The text was updated successfully, but these errors were encountered: