Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shared/local memory in FD stencils kernels #1746

Open
Tracked by #2632
charleskawczynski opened this issue May 21, 2024 · 7 comments · May be fixed by #1763
Open
Tracked by #2632

Use shared/local memory in FD stencils kernels #1746

charleskawczynski opened this issue May 21, 2024 · 7 comments · May be fixed by #1763
Assignees
Labels
enhancement New feature or request

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented May 21, 2024

We should implement shared memory for our FD stencil kernels, as this will lower the global memory traffic, reducing the memory bandwidth of the kernels and improve the performance.

A count of different operators in ClimaAtmos:

  • ᶠinterp: 139
  • ᶠwinterp: 14
  • ᶠgradᵥ: 35
  • ᶠcurlᵥ: 1
  • ᶠupwind1: 17 + more
  • ᶠupwind3: 17 + more
  • ᶠfct_boris_book: 1
  • ᶠfct_zalesak: 1
  • ᶠlin_vanleer: 1
  • ᶜinterp: 65
  • ᶜdivᵥ: 51
  • ᶜgradᵥ: ~28
  • ᶜadvdivᵥ: 49
  • ᶜsubdivᵥ: 9
  • ᶜprecipdivᵥ: 17
  • ᶠright_bias: 15
  • ᶜleft_bias: 12
  • ᶜright_bias: 6
@charleskawczynski
Copy link
Member Author

charleskawczynski commented Feb 25, 2025

We can use the same high-level design as the shared memory for the spectral element kernels:

  • Add work to the stencil broadcasted object
  • When inside the stencil kernel, we reconstruct the stencil broadcasted object, populating the shared memory for each operator
  • recursively fill the shmem for all operators in the broadcasted object (we read from global memory into shmem)
  • call getidx on the broadcasted object, which grabs the data from the shmem instead of global memory

@Sbozzolo
Copy link
Member

We can use the same high-level design as the shared memory for the spectral element kernels:

  • Add work to the stencil broadcasted object
  • When inside the stencil kernel, we reconstruct the stencil broadcasted object, populating the shared memory for each operator
  • recursively fill the shmem for all operators in the broadcasted object
  • call getidx on the broadcasted object, which grabs the data from the shmem instead of global memory

I still don't understand this.

Here are some questions:

  • What's work?
  • Is shared memory allocated once or multiple times per broadcasted expression?
  • What are you recursiving over and what are you filling the shared memory with?
  • Where is the result of the operation stored?

This is my current understanding, at high level. Suppose we want to compute div.(f), with f a vertical field.
What happens is:

  • An area of shared memory of the same size as the number of levels of f is allocated
  • f is read onto shared memory
  • The result of div.(f) is stored back to shared memory
  • If I had g .= div.(f), copy the data from shared memory to g

Now, a more complex case, div.(f .* g):

  • An area of shared memory of the same size as the number of levels of g is allocated
  • g is read onto shared memory
  • f is read from global memory, and the result of f * g is stored to the allocated shared memory
  • div is applied on the shared memory and the result stored back to shared memory
  • If I was copying to another field, I'd copy the results from shared memory

Now, if we had an extruded field, we would allocate one block per column with as many threads as the number of levels. And repeated above.

Is my understanding correct?

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Feb 25, 2025

I still don't understand this.

Here are some questions:

  • What's work?

work is the shared memory. We could rename it, but that was the name used in the SEM broadcasted object (I was trying to keep things somewhat unified).

  • Is shared memory allocated once or multiple times per broadcasted expression?

It's allocated once per StencilBroadcasted object.

  • What are you recursiving over and what are you filling the shared memory with?

We are recursing over the broadcasted object (it is a recursive object). We fill the shared memory with (for DivergenceF2C) Ju³ at each face. Different operators may store different things, since different operators have different arguments (and sometimes a different number of arguments). Concretely, the eltype of the shared memory is the operator_eltype.

  • Where is the result of the operation stored?

The result of which operation? The outer-most layer of the expression,

This is my current understanding, at high level. Suppose we want to compute div.(f), with f a vertical field. What happens is:

  • An area of shared memory of the same size as the number of levels of f is allocated

Correct

  • f is read onto shared memory

Correct

  • The result of div.(f) is stored back to shared memory

No, only the argument to div is stored into shared memory. Everything else works exactly as it currently does. The workflow is really pretty simple:

  • Allocate shmem (when shmem is supported)
  • Fill shmem with the operator argument, using getidx (when shmem is supported)
  • Retrieve operator argument from shmem (which is done via overloading getidx for when shmem is supported)

Put differently, if we ask: "Where is the result of the operation stored in div.(f)?" there is no "store" in this expression-- it's an allocating expression. Writing this as @. c = div(f), then we can say that the result is stored in c. The cell center values of each div(f) result is computed via getidx (this is how the existing infrastructure works). That result is then "stored" in c at the corresponding idx. I hope that helps.

If I had g .= div.(f), copy the data from shared memory to g

f is the only part that is put into, and retrieved from, shared memory, as that is where there are duplicate reads.

Most of these questions are really about how broadcasting and getidx works, we should probably improve the documentation on that, but that should be a separate effort.

Now, a more complex case, div.(f .* g):

This case is exactly the same as the previous one: f .* g is simply a broadcasted object, which can be thought of as a field.

Now, if we had an extruded field, we would allocate one block per column with as many threads as the number of levels. And repeated above.

I'm really confused about what you're asking. Let's look at the code:

@inline function fd_stencil_partition(
    us::DataLayouts.UniversalSize,
    n_face_levels::Integer,
    n_max_threads::Integer = 256;
)
    (Nq, _, _, Nv, Nh) = DataLayouts.universal_size(us)
    Nvthreads = n_face_levels
    @assert Nvthreads <= maximum_allowable_threads()[1] "Number of vertical face levels cannot exceed $(maximum_allowable_threads()[1])"
    Nvblocks = cld(Nv, Nvthreads) # +1 may be needed to guarantee that shared memory is populated at the last cell face
    return (;
        threads = (Nvthreads,),
        blocks = (Nh, Nvblocks, Nq * Nq),
        Nvthreads,
    )
end

What's unclear about the launch configuration?

cc @Sbozzolo

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Feb 25, 2025

Local notes. Each StencilBroadcasted allocates shmem for itself:

div(a) + div(b) + div(c)
Broadcasted(+, StencilBroadcasted(div, a), StencilBroadcasted(div, b), StencilBroadcasted(div, c)) # 3 shmem allocations

div(a + b + c)
StencilBroadcasted(div, Broadcasted(+, a, b, c)) # 1 shmem allocation

@charleskawczynski
Copy link
Member Author

We allocate nlevels for shmem, so, some are of size Nv (centers) and some can be of size Nv+1 (faces).

We always launch N+v threads, in order to always populate all of the faces.

@charleskawczynski
Copy link
Member Author

For div, we roughly have

(f[i+1] - f[i]) / dz[i]

To fill shmem, we need to populate shmem with f for all i.

@charleskawczynski
Copy link
Member Author

div.(f .* g) -> div.(h) h[i]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants