-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix distributed remapping bug #2169
Conversation
8dfe3ef
to
2cace51
Compare
It looks like that my earlier tests passed just by luck :( |
I think the best path forward with this is to boil down the reproducer (even if it means putting a flaky reproducer inside a loop). It's a bit of work, but it's often the best path to ensure progress is made on identifying the issue. |
b01d248
to
c613c90
Compare
c613c90
to
c65c44d
Compare
@@ -777,38 +754,26 @@ function _collect_interpolated_values!( | |||
if only_one_field | |||
ClimaComms.reduce!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the synchronization of the CUDA stream?
c65c44d
to
da3004b
Compare
43c5987
to
fc3099d
Compare
I spent many hours tracking down #2108 and could not find the root issue. I decided to take a different approach and simplify redefine `interpolate` in terms of `interpolate!`.
As suggested in JuliaParallel/MPI.jl#892 ``` remapper._interpolated_values[remapper.colons..., begin] ``` allocates a new copy, which can trip up CUDA's synchronization.
fc3099d
to
a3c659a
Compare
#2108 partially describes the odissey I went through to understand and fix this bug.
After discussing this with folks at MPI.jl (JuliaParallel/MPI.jl#892), a potential problem was identified:
in the
ClimaComms.reduce!
call allocates a new copy. This can mess up CUDA's synchronization, so that the correct data is not sent over. The simple solution is to useview
instead of slices.In addition to this, this PR changes
interpolate
to useinterpoalte!
, which leads much simpler code (the reason it was not done this way originally is becauseinterpolate!
is a more recent addition and behaved slightly differently).Closes #2108
Closes #2132