Use shared/local memory in FD stencils kernels #1746

charleskawczynski · 2024-05-21T16:47:30Z

We should implement shared memory for our FD stencil kernels, as this will lower the global memory traffic, reducing the memory bandwidth of the kernels and improve the performance.

A count of different operators in ClimaAtmos:

ᶠinterp: 139
ᶠwinterp: 14
ᶠgradᵥ: 35
ᶠcurlᵥ: 1
ᶠupwind1: 17 + more
ᶠupwind3: 17 + more
ᶠfct_boris_book: 1
ᶠfct_zalesak: 1
ᶠlin_vanleer: 1
ᶜinterp: 65
ᶜdivᵥ: 51
ᶜgradᵥ: ~28
ᶜadvdivᵥ: 49
ᶜsubdivᵥ: 9
ᶜprecipdivᵥ: 17
ᶠright_bias: 15
ᶜleft_bias: 12
ᶜright_bias: 6

charleskawczynski · 2025-02-25T14:25:15Z

We can use the same high-level design as the shared memory for the spectral element kernels:

Add work to the stencil broadcasted object
When inside the stencil kernel, we reconstruct the stencil broadcasted object, populating the shared memory for each operator
recursively fill the shmem for all operators in the broadcasted object (we read from global memory into shmem)
call getidx on the broadcasted object, which grabs the data from the shmem instead of global memory

Sbozzolo · 2025-02-25T16:42:13Z

We can use the same high-level design as the shared memory for the spectral element kernels:

Add work to the stencil broadcasted object

When inside the stencil kernel, we reconstruct the stencil broadcasted object, populating the shared memory for each operator

recursively fill the shmem for all operators in the broadcasted object

call getidx on the broadcasted object, which grabs the data from the shmem instead of global memory

I still don't understand this.

Here are some questions:

What's work?
Is shared memory allocated once or multiple times per broadcasted expression?
What are you recursiving over and what are you filling the shared memory with?
Where is the result of the operation stored?

This is my current understanding, at high level. Suppose we want to compute div.(f), with f a vertical field.
What happens is:

An area of shared memory of the same size as the number of levels of f is allocated
f is read onto shared memory
The result of div.(f) is stored back to shared memory
If I had g .= div.(f), copy the data from shared memory to g

Now, a more complex case, div.(f .* g):

An area of shared memory of the same size as the number of levels of g is allocated
g is read onto shared memory
f is read from global memory, and the result of f * g is stored to the allocated shared memory
div is applied on the shared memory and the result stored back to shared memory
If I was copying to another field, I'd copy the results from shared memory

Now, if we had an extruded field, we would allocate one block per column with as many threads as the number of levels. And repeated above.

Is my understanding correct?

charleskawczynski · 2025-02-25T17:11:02Z

I still don't understand this.

Here are some questions:

What's work?

work is the shared memory. We could rename it, but that was the name used in the SEM broadcasted object (I was trying to keep things somewhat unified).

Is shared memory allocated once or multiple times per broadcasted expression?

It's allocated once per StencilBroadcasted object.

What are you recursiving over and what are you filling the shared memory with?

We are recursing over the broadcasted object (it is a recursive object). We fill the shared memory with (for DivergenceF2C) Ju³ at each face. Different operators may store different things, since different operators have different arguments (and sometimes a different number of arguments). Concretely, the eltype of the shared memory is the operator_eltype.

Where is the result of the operation stored?

The result of which operation? The outer-most layer of the expression,

This is my current understanding, at high level. Suppose we want to compute div.(f), with f a vertical field. What happens is:

An area of shared memory of the same size as the number of levels of f is allocated

Correct

f is read onto shared memory

Correct

The result of div.(f) is stored back to shared memory

No, only the argument to div is stored into shared memory. Everything else works exactly as it currently does. The workflow is really pretty simple:

Allocate shmem (when shmem is supported)
Fill shmem with the operator argument, using getidx (when shmem is supported)
Retrieve operator argument from shmem (which is done via overloading getidx for when shmem is supported)

Put differently, if we ask: "Where is the result of the operation stored in div.(f)?" there is no "store" in this expression-- it's an allocating expression. Writing this as @. c = div(f), then we can say that the result is stored in c. The cell center values of each div(f) result is computed via getidx (this is how the existing infrastructure works). That result is then "stored" in c at the corresponding idx. I hope that helps.

If I had g .= div.(f), copy the data from shared memory to g

f is the only part that is put into, and retrieved from, shared memory, as that is where there are duplicate reads.

Most of these questions are really about how broadcasting and getidx works, we should probably improve the documentation on that, but that should be a separate effort.

Now, a more complex case, div.(f .* g):

This case is exactly the same as the previous one: f .* g is simply a broadcasted object, which can be thought of as a field.

Now, if we had an extruded field, we would allocate one block per column with as many threads as the number of levels. And repeated above.

I'm really confused about what you're asking. Let's look at the code:

@inline function fd_stencil_partition(
    us::DataLayouts.UniversalSize,
    n_face_levels::Integer,
    n_max_threads::Integer = 256;
)
    (Nq, _, _, Nv, Nh) = DataLayouts.universal_size(us)
    Nvthreads = n_face_levels
    @assert Nvthreads <= maximum_allowable_threads()[1] "Number of vertical face levels cannot exceed $(maximum_allowable_threads()[1])"
    Nvblocks = cld(Nv, Nvthreads) # +1 may be needed to guarantee that shared memory is populated at the last cell face
    return (;
        threads = (Nvthreads,),
        blocks = (Nh, Nvblocks, Nq * Nq),
        Nvthreads,
    )
end

What's unclear about the launch configuration?

cc @Sbozzolo

charleskawczynski · 2025-02-25T17:22:47Z

Local notes. Each StencilBroadcasted allocates shmem for itself:

div(a) + div(b) + div(c)
Broadcasted(+, StencilBroadcasted(div, a), StencilBroadcasted(div, b), StencilBroadcasted(div, c)) # 3 shmem allocations

div(a + b + c)
StencilBroadcasted(div, Broadcasted(+, a, b, c)) # 1 shmem allocation

charleskawczynski · 2025-02-25T17:26:23Z

We allocate nlevels for shmem, so, some are of size Nv (centers) and some can be of size Nv+1 (faces).

We always launch N+v threads, in order to always populate all of the faces.

charleskawczynski · 2025-02-25T17:28:00Z

For div, we roughly have

(f[i+1] - f[i]) / dz[i]

To fill shmem, we need to populate shmem with f for all i.

charleskawczynski · 2025-02-25T17:29:58Z

div.(f .* g) -> div.(h) h[i]

charleskawczynski added the enhancement New feature or request label May 21, 2024

charleskawczynski mentioned this issue May 21, 2024

Performance roadmap CliMA/ClimaAtmos.jl#2632

Open

charleskawczynski self-assigned this May 21, 2024

This was referenced May 28, 2024

Make Nv a type parameter #1757

Merged

Transform FD broadcast objs to use MArrays #1763

Draft

charleskawczynski mentioned this issue Feb 25, 2025

Use shared memory in DivergenceF2C stencil operators #2184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use shared/local memory in FD stencils kernels #1746

Use shared/local memory in FD stencils kernels #1746

charleskawczynski commented May 21, 2024 •

edited

Loading

charleskawczynski commented Feb 25, 2025 •

edited

Loading

Sbozzolo commented Feb 25, 2025

charleskawczynski commented Feb 25, 2025 •

edited

Loading

charleskawczynski commented Feb 25, 2025 •

edited

Loading

charleskawczynski commented Feb 25, 2025

charleskawczynski commented Feb 25, 2025

charleskawczynski commented Feb 25, 2025

Use shared/local memory in FD stencils kernels #1746

Use shared/local memory in FD stencils kernels #1746

Comments

charleskawczynski commented May 21, 2024 • edited Loading

charleskawczynski commented Feb 25, 2025 • edited Loading

Sbozzolo commented Feb 25, 2025

charleskawczynski commented Feb 25, 2025 • edited Loading

charleskawczynski commented Feb 25, 2025 • edited Loading

charleskawczynski commented Feb 25, 2025

charleskawczynski commented Feb 25, 2025

charleskawczynski commented Feb 25, 2025

charleskawczynski commented May 21, 2024 •

edited

Loading

charleskawczynski commented Feb 25, 2025 •

edited

Loading

charleskawczynski commented Feb 25, 2025 •

edited

Loading

charleskawczynski commented Feb 25, 2025 •

edited

Loading