Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shared memory in DivergenceF2C stencil operators #2184

Merged
merged 1 commit into from
Mar 19, 2025

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Feb 12, 2025

This PR implements shared memory for the finite difference stencil DivergenceF2C. This is a step towards #1746, where the high-level design is discussed (it's basically the same design that we use for spectral element kernels).

Some notes on implementation and behavior details:

  • Like our SEM, shared memory (shmem) is supported through a single layer of operator composition. That is, composed operators like div(grad(f)) will put the result of grad(f) into shared memory per thread (or, per point), but duplicate reads of f will still occur (thread i will read f at i+1 and thread i+1 will read f at i). To improve on this, we can write combined operators (e.g., divgrad), where shmem is managed, and this will result in a minimal number of required reads/writes. Ideally, we could automate this somehow, but I think that's the next step. One nice aspect of automating this is that we wouldn't need to introduce new terminology to users.
  • If the vertical resolution is too large (e.g., 1000 vertical levels), then we dispatch to a shmem-free version to support higher vertical resolution.
  • This PR only adds shared memory for a F2C operator, we should probably add one C2F operator, so that we exercise the cell center-filling shared memory branch code (which is currently not exercised, and is therefore likely not correct).
  • I've also added some specializations for JContravariant3 to convince cuda to emit fused instructions (found through nsight compute).

Direct kernel performance impact

This comment discusses the direct impact on kernels.

Overall performance impact

  • Here is a GPU target pipeline build of ClimaAtmos, and the latest nearby main. For moist held suarez, 1 gpu:
  • main: SYPD 1.88
  • This branch: SYPD 2.056

Adding new operators

To add new operators, developers need to update two files:

  • ext/cuda/operators_fd_shmem.jl, where you need to define:
    • fd_operator_shmem - allocates shared memory
    • fd_operator_fill_shmem_interior! - defines how interior stencil shmem is populate
    • fd_operator_fill_shmem_left_boundary! - defines how left stencil shmem is populate
    • fd_operator_fill_shmem_right_boundary! - defines how right stencil shmem is populate
    • fd_operator_evaluate - defines how to get the result of getidx from shmem
  • ext/cuda/operators_fd_shmem_is_supported.jl, where you need to define fd_shmem_is_supported for the newly supported operator (and boundary conditions)

@Sbozzolo
Copy link
Member

I think the expectation of a 40 % improvement relies on the assumption that you have complete L1 cache misses, but if different threads within a block share faces and not much more data is needed, the effective reads to global memory will be on average less than 5 because some values will be cached in the L1 cache of the streaming multiprocessor (which has the same latency as the shared memory).

@charleskawczynski charleskawczynski force-pushed the ck/fd_shmem branch 4 times, most recently from ba4f5dc to 3f85c1b Compare February 17, 2025 05:30
@charleskawczynski
Copy link
Member Author

Even slightly complicated cases are showing a nice improvement (~2x), so it may just depend on additional factors (e.g., register pressure / if there are errors / traps emitted by LLVM).

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Feb 20, 2025

Here are the preliminary results for some of the relevant benchmarks (in test/Operators/finitedifference/benchmark_fd_ops_shared_memory.jl):

       main (Float64)  shmem (Float64)   main (Float32)  shmem (Float32)
ᶜout1: 245.218 μs,     204.288 μs        116.570 μs      114.139 μs
ᶜout2: 287.298 μs,     231.828 μs        178.039 μs      127.799 μs
ᶜout3: 578.095 μs,     278.687 μs        412.717 μs      200.338 μs
ᶜout4: 320.777 μs,     226.438 μs        268.588 μs      158.829 μs
ᶜout5: 335.117 μs,     252.077 μs        290.848 μs      168.479 μs
ᶜout6: 568.145 μs,     267.818 μs        497.586 μs      179.479 μs
ᶜout7: 439.927 μs,     239.068 μs        388.677 μs      183.238 μs
ᶜout8: 440.407 μs,     245.918 μs        382.127 μs      182.989 μs

Specializing on Jcontravariant3, which is currently resulting in warp-stalls due to unfused instructions, we can do slightly better:

       main (Float64)  shmem (Float64)   main (Float32)  shmem (Float32)
ᶜout1: 245.218 μs,     146.589 μs        116.570 μs      99.249 μs
ᶜout2: 287.298 μs,     170.128 μs        178.039 μs      112.099 μs
ᶜout3: 578.095 μs,     210.158 μs        412.717 μs      177.068 μs
ᶜout4: 320.777 μs,     169.309 μs        268.588 μs      142.969 μs
ᶜout5: 335.117 μs,     191.898 μs        290.848 μs      156.409 μs
ᶜout6: 568.145 μs,     199.199 μs        497.586 μs      162.038 μs
ᶜout7: 439.927 μs,     187.088 μs        388.677 μs      151.149 μs
ᶜout8: 440.407 μs,     186.219 μs        382.127 μs      151.468 μs

It's notable that some of these kernels can be further optimized. For example ᶜout6 looks like @. ᶜout6 = div_bcs(Geometry.WVector(ᶠwinterp(ϕ, ρ))), and we only apply shared memory to div_bcs, but we could also add shared memory support for ᶠwinterp. Finally, we could define a DivergenceF2CWInterp operator that combines the stencil into one, which should, in theory, give us something closer to ~100 μs (on Float32), which is a ~5x speedup.

@charleskawczynski charleskawczynski force-pushed the ck/fd_shmem branch 4 times, most recently from bfa8df9 to 274f646 Compare February 21, 2025 14:39
@charleskawczynski charleskawczynski marked this pull request as ready for review February 21, 2025 16:14
@Sbozzolo
Copy link
Member

The build that you run in ClimaAtmos shows a significant regression in the "Benchmark: GPU prog edmf" test compared to main. So maybe that's something that you want to look into before we look at this?

I also leave my first impression here:

I think this may change the limit of vertical resolution that we support (1,000 levels in the test was lowered)-- should we dispatch to a shmem-free version to support higher vertical resolution? I don't think we'll ever need more than 256, which is supported by this implementation.

I think that this should be addressed. We will not need more than 256 levels for a global simulation, but we might want it in other settings. For example, I used more than 256 levels to study self-convergence in ClimaTimeSteppers. I think that the restriction of 256 points on a finite difference method is very strong for most applicaitons that are not global simulations. Moreover, this would further differentiate our CPU and GPU capabilities. (I think it'd be perfectly acceptable to have a slower path for when the number of points is more than 256, but users should still be able to run with such a configuration)

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Feb 21, 2025

I'll of course make sure we address the performance before merging. Allowing a larger number of vertical levels is now fixed.

Do you have any other comments?

@Sbozzolo
Copy link
Member

Do you have any other comments?

Yes, but it will be much more efficient to talk in person. So I'd suggest you first look at the problem with that job and after we can schedule a call to chat about this

@charleskawczynski
Copy link
Member Author

The build that you run in ClimaAtmos shows a significant regression in the "Benchmark: GPU prog edmf" test compared to main. So maybe that's something that you want to look into before we look at this?

Which build are you comparing against?

@Sbozzolo
Copy link
Member

xref: CliMA/ClimaAtmos.jl#2632

Meeting 24 Feb notes:

High level goal:

  • Shared memory is required to allow for larger kernels that would not otherwise fit for finite difference operators
  • Performance benefits due to reduced reads

Tradeffs:

  • Code complexity and duplication
  • Developer time

Path forward:

  • Start with one operator (this PR)
  • Progressively add more and more operators.
  • Experiment with this by supporting three code paths: CPU, GPU without shared memory, GPU with shared memory.
  • Later down the line, experiment with unifying shared memory backends

High level design:

Concerns:

  • Dennis: we should not manually fuse operators, we should use lazy representations as in MatrixFields

@charleskawczynski
Copy link
Member Author

Experiment with this by supporting three code paths: CPU, GPU without shared memory, GPU with shared memory.

Just to reiterate on this point: the amount of duplication for the "three" code paths is really very small, copyto_stencil_kernel! is the only method that we need to support the non-shmem version.

@charleskawczynski charleskawczynski removed the draft draft PR label Feb 25, 2025
@charleskawczynski charleskawczynski force-pushed the ck/fd_shmem branch 4 times, most recently from 01d0269 to 502ff05 Compare February 26, 2025 13:52
@charleskawczynski
Copy link
Member Author

I realized something recently this PR. We don't need to define shmem operators for combinations of operators because the read operation is collocated and any global memory read performed by a thread will be put into a register, and that can be reused for any part of the broadcasted object. So, I'm now thinking that it's worth moving forward with this for at least a handful of operators.

Given this, fusing operations with this pattern should basically minimize the number of reads, which is great, but it could be limited in terms of how many operators that we can fuse, since shared memory is a limited resource, and we would need shmem allocated per operator.

So, it's possible that a super fused operator (ColumnStateShmem or something) could still be better in the limit of fusing many operations. See #1763 for an initial sketch, where I was applying this to the existing apply_stencil!. I should have defined a new operator in that PR to explicitly request that behavior-- that will prevent performance regressions for our existing operators.

I think a super fused is possible in principle, but I think I'd first need to count the number of needed metric terms in the local geometry to find out at what point it will benefit us.

Copy link
Member

@dennisYatunin dennisYatunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's definitely some duplicated code patterns here, but overall this is fine to merge in as an experimental feature. In the future, we should try to figure out a way to avoid duplicating the stencil definition for each vertical operator (by somehow generalizing the stencil coefficients in operator_matrices.jl, akin to what #782 was doing), and a way to combine the vertical and horizontal shared memory interfaces (allocate_shmem, resolve_shmem, etc). As long as this duplication is confined to the CUDA extension, it shouldn't be too hard to simplify things later.

Verified

This commit was signed with the committer’s verified signature.
jglick Jesse Glick
Try dont_limit on recursive resolve_shmem methods

Fixes + more dont limit

Matrix field fixes

Matrix field fixes

DivergenceF2C fix

MatrixField fixes

Qualify DivergenceF2C

wip

Refactor + fixed space bug. All seems good.

More tests..

Fixes

Test updates

Fixes

Allow disabling shmem using broadcast style

Fix fused cuda operations in LG

Revert some unwanted pieces

More fixes

Format

wip, adding docs

Fixes

Fixes

Apply formatter + docs

Always call disable_shmem_style in else-branch

Fix git conflict
@charleskawczynski charleskawczynski merged commit 869cc54 into main Mar 19, 2025
20 of 23 checks passed
@charleskawczynski charleskawczynski deleted the ck/fd_shmem branch March 19, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants