Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix distributed remapping bug #2169

Merged
merged 3 commits into from
Feb 15, 2025
Merged

Fix distributed remapping bug #2169

merged 3 commits into from
Feb 15, 2025

Conversation

Sbozzolo
Copy link
Member

@Sbozzolo Sbozzolo commented Jan 30, 2025

#2108 partially describes the odissey I went through to understand and fix this bug.

After discussing this with folks at MPI.jl (JuliaParallel/MPI.jl#892), a potential problem was identified:

remapper._interpolated_values[remapper.colons..., begin]

in the ClimaComms.reduce! call allocates a new copy. This can mess up CUDA's synchronization, so that the correct data is not sent over. The simple solution is to use view instead of slices.

In addition to this, this PR changes interpolate to use interpoalte!, which leads much simpler code (the reason it was not done this way originally is because interpolate! is a more recent addition and behaved slightly differently).

Closes #2108
Closes #2132

@Sbozzolo Sbozzolo force-pushed the gb/distributed branch 2 times, most recently from 8dfe3ef to 2cace51 Compare January 30, 2025 18:46
@Sbozzolo
Copy link
Member Author

It looks like that my earlier tests passed just by luck :(

@charleskawczynski
Copy link
Member

I think the best path forward with this is to boil down the reproducer (even if it means putting a flaky reproducer inside a loop). It's a bit of work, but it's often the best path to ensure progress is made on identifying the issue.

@Sbozzolo Sbozzolo force-pushed the gb/distributed branch 26 times, most recently from b01d248 to c613c90 Compare February 6, 2025 19:30
@@ -777,38 +754,26 @@ function _collect_interpolated_values!(
if only_one_field
ClimaComms.reduce!(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sbozzolo Sbozzolo enabled auto-merge February 12, 2025 20:19
@Sbozzolo Sbozzolo force-pushed the gb/distributed branch 2 times, most recently from 43c5987 to fc3099d Compare February 14, 2025 23:47
I spent many hours tracking down
#2108 and could not find the
root issue.

I decided to take a different approach and simplify redefine
`interpolate` in terms of `interpolate!`.
As suggested in
JuliaParallel/MPI.jl#892
```
remapper._interpolated_values[remapper.colons..., begin]
```
allocates a new copy, which can trip up CUDA's synchronization.
@Sbozzolo Sbozzolo merged commit 6838f81 into main Feb 15, 2025
34 checks passed
@Sbozzolo Sbozzolo deleted the gb/distributed branch February 15, 2025 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distributed remapping interpolation bug
3 participants