-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shared memory in DivergenceF2C
stencil operators
#2184
Conversation
82ff728
to
f4b175f
Compare
I think the expectation of a 40 % improvement relies on the assumption that you have complete L1 cache misses, but if different threads within a block share faces and not much more data is needed, the effective reads to global memory will be on average less than 5 because some values will be cached in the L1 cache of the streaming multiprocessor (which has the same latency as the shared memory). |
ba4f5dc
to
3f85c1b
Compare
Even slightly complicated cases are showing a nice improvement (~2x), so it may just depend on additional factors (e.g., register pressure / if there are errors / traps emitted by LLVM). |
eea802c
to
3ba4ad9
Compare
8d6a613
to
eb6db43
Compare
Here are the preliminary results for some of the relevant benchmarks (in
Specializing on
It's notable that some of these kernels can be further optimized. For example |
bfa8df9
to
274f646
Compare
The build that you run in ClimaAtmos shows a significant regression in the "Benchmark: GPU prog edmf" test compared to I also leave my first impression here:
I think that this should be addressed. We will not need more than 256 levels for a global simulation, but we might want it in other settings. For example, I used more than 256 levels to study self-convergence in ClimaTimeSteppers. I think that the restriction of 256 points on a finite difference method is very strong for most applicaitons that are not global simulations. Moreover, this would further differentiate our CPU and GPU capabilities. (I think it'd be perfectly acceptable to have a slower path for when the number of points is more than 256, but users should still be able to run with such a configuration) |
I'll of course make sure we address the performance before merging. Allowing a larger number of vertical levels is now fixed. Do you have any other comments? |
Yes, but it will be much more efficient to talk in person. So I'd suggest you first look at the problem with that job and after we can schedule a call to chat about this |
Which build are you comparing against? |
e81af4f
to
bbd5067
Compare
xref: CliMA/ClimaAtmos.jl#2632 Meeting 24 Feb notes: High level goal:
Tradeffs:
Path forward:
High level design:Concerns:
|
Just to reiterate on this point: the amount of duplication for the "three" code paths is really very small, |
01d0269
to
502ff05
Compare
I realized something recently this PR. We don't need to define shmem operators for combinations of operators because the read operation is collocated and any global memory read performed by a thread will be put into a register, and that can be reused for any part of the broadcasted object. So, I'm now thinking that it's worth moving forward with this for at least a handful of operators. Given this, fusing operations with this pattern should basically minimize the number of reads, which is great, but it could be limited in terms of how many operators that we can fuse, since shared memory is a limited resource, and we would need shmem allocated per operator. So, it's possible that a super fused operator ( I think a super fused is possible in principle, but I think I'd first need to count the number of needed metric terms in the local geometry to find out at what point it will benefit us. |
faff317
to
9ba1195
Compare
ec20f45
to
9f6a297
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's definitely some duplicated code patterns here, but overall this is fine to merge in as an experimental feature. In the future, we should try to figure out a way to avoid duplicating the stencil definition for each vertical operator (by somehow generalizing the stencil coefficients in operator_matrices.jl
, akin to what #782 was doing), and a way to combine the vertical and horizontal shared memory interfaces (allocate_shmem
, resolve_shmem
, etc). As long as this duplication is confined to the CUDA extension, it shouldn't be too hard to simplify things later.
Try dont_limit on recursive resolve_shmem methods Fixes + more dont limit Matrix field fixes Matrix field fixes DivergenceF2C fix MatrixField fixes Qualify DivergenceF2C wip Refactor + fixed space bug. All seems good. More tests.. Fixes Test updates Fixes Allow disabling shmem using broadcast style Fix fused cuda operations in LG Revert some unwanted pieces More fixes Format wip, adding docs Fixes Fixes Apply formatter + docs Always call disable_shmem_style in else-branch Fix git conflict
9f6a297
to
c11a66d
Compare
This PR implements shared memory for the finite difference stencil
DivergenceF2C
. This is a step towards #1746, where the high-level design is discussed (it's basically the same design that we use for spectral element kernels).Some notes on implementation and behavior details:
div(grad(f))
will put the result ofgrad(f)
into shared memory per thread (or, per point), but duplicate reads off
will still occur (threadi
will readf
ati+1
and threadi+1
will readf
ati
). To improve on this, we can write combined operators (e.g.,divgrad
), where shmem is managed, and this will result in a minimal number of required reads/writes. Ideally, we could automate this somehow, but I think that's the next step. One nice aspect of automating this is that we wouldn't need to introduce new terminology to users.JContravariant3
to convince cuda to emit fused instructions (found through nsight compute).Direct kernel performance impact
This comment discusses the direct impact on kernels.
Overall performance impact
Adding new operators
To add new operators, developers need to update two files:
ext/cuda/operators_fd_shmem.jl
, where you need to define:fd_operator_shmem
- allocates shared memoryfd_operator_fill_shmem_interior!
- defines how interior stencil shmem is populatefd_operator_fill_shmem_left_boundary!
- defines how left stencil shmem is populatefd_operator_fill_shmem_right_boundary!
- defines how right stencil shmem is populatefd_operator_evaluate
- defines how to get the result ofgetidx
from shmemext/cuda/operators_fd_shmem_is_supported.jl
, where you need to definefd_shmem_is_supported
for the newly supported operator (and boundary conditions)