Use shared memory in `DivergenceF2C` stencil operators #2184

charleskawczynski · 2025-02-12T21:50:12Z

This PR implements shared memory for the finite difference stencil DivergenceF2C. This is a step towards #1746, where the high-level design is discussed (it's basically the same design that we use for spectral element kernels).

Some notes on implementation and behavior details:

Like our SEM, shared memory (shmem) is supported through a single layer of operator composition. That is, composed operators like div(grad(f)) will put the result of grad(f) into shared memory per thread (or, per point), but duplicate reads of f will still occur (thread i will read f at i+1 and thread i+1 will read f at i). To improve on this, we can write combined operators (e.g., divgrad), where shmem is managed, and this will result in a minimal number of required reads/writes. Ideally, we could automate this somehow, but I think that's the next step. One nice aspect of automating this is that we wouldn't need to introduce new terminology to users.
If the vertical resolution is too large (e.g., 1000 vertical levels), then we dispatch to a shmem-free version to support higher vertical resolution.
This PR only adds shared memory for a F2C operator, we should probably add one C2F operator, so that we exercise the cell center-filling shared memory branch code (which is currently not exercised, and is therefore likely not correct).
I've also added some specializations for JContravariant3 to convince cuda to emit fused instructions (found through nsight compute).

Direct kernel performance impact

This comment discusses the direct impact on kernels.

Overall performance impact

Here is a GPU target pipeline build of ClimaAtmos, and the latest nearby main. For moist held suarez, 1 gpu:
main: SYPD 1.88
This branch: SYPD 2.056

Adding new operators

To add new operators, developers need to update two files:

ext/cuda/operators_fd_shmem.jl, where you need to define:
- fd_operator_shmem - allocates shared memory
- fd_operator_fill_shmem_interior! - defines how interior stencil shmem is populate
- fd_operator_fill_shmem_left_boundary! - defines how left stencil shmem is populate
- fd_operator_fill_shmem_right_boundary! - defines how right stencil shmem is populate
- fd_operator_evaluate - defines how to get the result of getidx from shmem
ext/cuda/operators_fd_shmem_is_supported.jl, where you need to define fd_shmem_is_supported for the newly supported operator (and boundary conditions)

Sbozzolo · 2025-02-14T04:22:20Z

I think the expectation of a 40 % improvement relies on the assumption that you have complete L1 cache misses, but if different threads within a block share faces and not much more data is needed, the effective reads to global memory will be on average less than 5 because some values will be cached in the L1 cache of the streaming multiprocessor (which has the same latency as the shared memory).

charleskawczynski · 2025-02-17T06:12:37Z

Even slightly complicated cases are showing a nice improvement (~2x), so it may just depend on additional factors (e.g., register pressure / if there are errors / traps emitted by LLVM).

charleskawczynski · 2025-02-20T16:27:38Z

Here are the preliminary results for some of the relevant benchmarks (in test/Operators/finitedifference/benchmark_fd_ops_shared_memory.jl):

       main (Float64)  shmem (Float64)   main (Float32)  shmem (Float32)
ᶜout1: 245.218 μs,     204.288 μs        116.570 μs      114.139 μs
ᶜout2: 287.298 μs,     231.828 μs        178.039 μs      127.799 μs
ᶜout3: 578.095 μs,     278.687 μs        412.717 μs      200.338 μs
ᶜout4: 320.777 μs,     226.438 μs        268.588 μs      158.829 μs
ᶜout5: 335.117 μs,     252.077 μs        290.848 μs      168.479 μs
ᶜout6: 568.145 μs,     267.818 μs        497.586 μs      179.479 μs
ᶜout7: 439.927 μs,     239.068 μs        388.677 μs      183.238 μs
ᶜout8: 440.407 μs,     245.918 μs        382.127 μs      182.989 μs

Specializing on Jcontravariant3, which is currently resulting in warp-stalls due to unfused instructions, we can do slightly better:

       main (Float64)  shmem (Float64)   main (Float32)  shmem (Float32)
ᶜout1: 245.218 μs,     146.589 μs        116.570 μs      99.249 μs
ᶜout2: 287.298 μs,     170.128 μs        178.039 μs      112.099 μs
ᶜout3: 578.095 μs,     210.158 μs        412.717 μs      177.068 μs
ᶜout4: 320.777 μs,     169.309 μs        268.588 μs      142.969 μs
ᶜout5: 335.117 μs,     191.898 μs        290.848 μs      156.409 μs
ᶜout6: 568.145 μs,     199.199 μs        497.586 μs      162.038 μs
ᶜout7: 439.927 μs,     187.088 μs        388.677 μs      151.149 μs
ᶜout8: 440.407 μs,     186.219 μs        382.127 μs      151.468 μs

It's notable that some of these kernels can be further optimized. For example ᶜout6 looks like @. ᶜout6 = div_bcs(Geometry.WVector(ᶠwinterp(ϕ, ρ))), and we only apply shared memory to div_bcs, but we could also add shared memory support for ᶠwinterp. Finally, we could define a DivergenceF2CWInterp operator that combines the stencil into one, which should, in theory, give us something closer to ~100 μs (on Float32), which is a ~5x speedup.

src/Geometry/conversions.jl

Sbozzolo · 2025-02-21T17:51:06Z

The build that you run in ClimaAtmos shows a significant regression in the "Benchmark: GPU prog edmf" test compared to main. So maybe that's something that you want to look into before we look at this?

I also leave my first impression here:

I think this may change the limit of vertical resolution that we support (1,000 levels in the test was lowered)-- should we dispatch to a shmem-free version to support higher vertical resolution? I don't think we'll ever need more than 256, which is supported by this implementation.

I think that this should be addressed. We will not need more than 256 levels for a global simulation, but we might want it in other settings. For example, I used more than 256 levels to study self-convergence in ClimaTimeSteppers. I think that the restriction of 256 points on a finite difference method is very strong for most applicaitons that are not global simulations. Moreover, this would further differentiate our CPU and GPU capabilities. (I think it'd be perfectly acceptable to have a slower path for when the number of points is more than 256, but users should still be able to run with such a configuration)

charleskawczynski · 2025-02-21T18:08:06Z

I'll of course make sure we address the performance before merging. Allowing a larger number of vertical levels is now fixed.

Do you have any other comments?

Sbozzolo · 2025-02-21T18:13:52Z

Do you have any other comments?

Yes, but it will be much more efficient to talk in person. So I'd suggest you first look at the problem with that job and after we can schedule a call to chat about this

charleskawczynski · 2025-02-21T20:32:38Z

The build that you run in ClimaAtmos shows a significant regression in the "Benchmark: GPU prog edmf" test compared to main. So maybe that's something that you want to look into before we look at this?

Which build are you comparing against?

Sbozzolo · 2025-02-24T22:12:03Z

xref: CliMA/ClimaAtmos.jl#2632

Meeting 24 Feb notes:

High level goal:

Shared memory is required to allow for larger kernels that would not otherwise fit for finite difference operators
Performance benefits due to reduced reads

Tradeffs:

Code complexity and duplication
Developer time

Path forward:

Start with one operator (this PR)
Progressively add more and more operators.
Experiment with this by supporting three code paths: CPU, GPU without shared memory, GPU with shared memory.
Later down the line, experiment with unifying shared memory backends

High level design:

Concerns:

Dennis: we should not manually fuse operators, we should use lazy representations as in MatrixFields

charleskawczynski · 2025-02-25T15:29:55Z

Experiment with this by supporting three code paths: CPU, GPU without shared memory, GPU with shared memory.

Just to reiterate on this point: the amount of duplication for the "three" code paths is really very small, copyto_stencil_kernel! is the only method that we need to support the non-shmem version.

src/Geometry/conversions.jl

ext/cuda/operators_finite_difference.jl

charleskawczynski · 2025-03-07T17:58:30Z

I realized something recently this PR. We don't need to define shmem operators for combinations of operators because the read operation is collocated and any global memory read performed by a thread will be put into a register, and that can be reused for any part of the broadcasted object. So, I'm now thinking that it's worth moving forward with this for at least a handful of operators.

Given this, fusing operations with this pattern should basically minimize the number of reads, which is great, but it could be limited in terms of how many operators that we can fuse, since shared memory is a limited resource, and we would need shmem allocated per operator.

So, it's possible that a super fused operator (ColumnStateShmem or something) could still be better in the limit of fusing many operations. See #1763 for an initial sketch, where I was applying this to the existing apply_stencil!. I should have defined a new operator in that PR to explicitly request that behavior-- that will prevent performance regressions for our existing operators.

I think a super fused is possible in principle, but I think I'd first need to count the number of needed metric terms in the local geometry to find out at what point it will benefit us.

dennisYatunin

There's definitely some duplicated code patterns here, but overall this is fine to merge in as an experimental feature. In the future, we should try to figure out a way to avoid duplicating the stencil definition for each vertical operator (by somehow generalizing the stencil coefficients in operator_matrices.jl, akin to what #782 was doing), and a way to combine the vertical and horizontal shared memory interfaces (allocate_shmem, resolve_shmem, etc). As long as this duplication is confined to the CUDA extension, it shouldn't be too hard to simplify things later.

Try dont_limit on recursive resolve_shmem methods Fixes + more dont limit Matrix field fixes Matrix field fixes DivergenceF2C fix MatrixField fixes Qualify DivergenceF2C wip Refactor + fixed space bug. All seems good. More tests.. Fixes Test updates Fixes Allow disabling shmem using broadcast style Fix fused cuda operations in LG Revert some unwanted pieces More fixes Format wip, adding docs Fixes Fixes Apply formatter + docs Always call disable_shmem_style in else-branch Fix git conflict

charleskawczynski added performance draft labels Feb 12, 2025

charleskawczynski force-pushed the ck/fd_shmem branch 9 times, most recently from 82ff728 to f4b175f Compare February 13, 2025 17:23

charleskawczynski force-pushed the ck/fd_shmem branch 4 times, most recently from ba4f5dc to 3f85c1b Compare February 17, 2025 05:30

charleskawczynski force-pushed the ck/fd_shmem branch from eea802c to 3ba4ad9 Compare February 17, 2025 19:23

charleskawczynski mentioned this pull request Feb 17, 2025

Transform FD broadcast objs to use MArrays #1763

Draft

charleskawczynski force-pushed the ck/fd_shmem branch 3 times, most recently from 8d6a613 to eb6db43 Compare February 20, 2025 16:27

charleskawczynski force-pushed the ck/fd_shmem branch 4 times, most recently from bfa8df9 to 274f646 Compare February 21, 2025 14:39

charleskawczynski marked this pull request as ready for review February 21, 2025 16:14

charleskawczynski requested review from dennisYatunin and Sbozzolo February 21, 2025 16:14

charleskawczynski requested review from nefrathenrici and imreddyTeja February 21, 2025 16:14

charleskawczynski commented Feb 21, 2025

View reviewed changes

src/Geometry/conversions.jl Outdated Show resolved Hide resolved

charleskawczynski force-pushed the ck/fd_shmem branch from e81af4f to bbd5067 Compare February 24, 2025 16:12

charleskawczynski removed the draft label Feb 25, 2025

charleskawczynski commented Feb 25, 2025

View reviewed changes

src/Geometry/conversions.jl Outdated Show resolved Hide resolved

charleskawczynski force-pushed the ck/fd_shmem branch 4 times, most recently from 01d0269 to 502ff05 Compare February 26, 2025 13:52

charleskawczynski commented Mar 2, 2025

View reviewed changes

ext/cuda/operators_finite_difference.jl Outdated Show resolved Hide resolved

charleskawczynski mentioned this pull request Mar 5, 2025

bug in divergence_result_type #2208

Closed

charleskawczynski force-pushed the ck/fd_shmem branch from faff317 to 9ba1195 Compare March 9, 2025 13:56

charleskawczynski mentioned this pull request Mar 9, 2025

Add convenience method in stencils #2221

Merged

charleskawczynski force-pushed the ck/fd_shmem branch from ec20f45 to 9f6a297 Compare March 9, 2025 18:45

dennisYatunin approved these changes Mar 18, 2025

View reviewed changes

charleskawczynski force-pushed the ck/fd_shmem branch from 9f6a297 to c11a66d Compare March 19, 2025 17:24

charleskawczynski merged commit 869cc54 into main Mar 19, 2025
20 of 23 checks passed

charleskawczynski deleted the ck/fd_shmem branch March 19, 2025 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use shared memory in `DivergenceF2C` stencil operators #2184

Use shared memory in `DivergenceF2C` stencil operators #2184

charleskawczynski commented Feb 12, 2025 •

edited

Loading

Sbozzolo commented Feb 14, 2025

charleskawczynski commented Feb 17, 2025

charleskawczynski commented Feb 20, 2025 •

edited

Loading

Sbozzolo commented Feb 21, 2025

charleskawczynski commented Feb 21, 2025 •

edited

Loading

Sbozzolo commented Feb 21, 2025

charleskawczynski commented Feb 21, 2025

Sbozzolo commented Feb 24, 2025

charleskawczynski commented Feb 25, 2025

charleskawczynski commented Mar 7, 2025

dennisYatunin left a comment

Use shared memory in DivergenceF2C stencil operators #2184

Use shared memory in DivergenceF2C stencil operators #2184

Conversation

charleskawczynski commented Feb 12, 2025 • edited Loading

Some notes on implementation and behavior details:

Direct kernel performance impact

Overall performance impact

Adding new operators

Sbozzolo commented Feb 14, 2025

charleskawczynski commented Feb 17, 2025

charleskawczynski commented Feb 20, 2025 • edited Loading

Sbozzolo commented Feb 21, 2025

charleskawczynski commented Feb 21, 2025 • edited Loading

Sbozzolo commented Feb 21, 2025

charleskawczynski commented Feb 21, 2025

Sbozzolo commented Feb 24, 2025

High level design:

charleskawczynski commented Feb 25, 2025

charleskawczynski commented Mar 7, 2025

dennisYatunin left a comment

Choose a reason for hiding this comment

Use shared memory in `DivergenceF2C` stencil operators #2184

Use shared memory in `DivergenceF2C` stencil operators #2184

charleskawczynski commented Feb 12, 2025 •

edited

Loading

charleskawczynski commented Feb 20, 2025 •

edited

Loading

charleskawczynski commented Feb 21, 2025 •

edited

Loading