Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 762 #763

Merged
merged 3 commits into from
Jul 29, 2022
Merged

Issue 762 #763

merged 3 commits into from
Jul 29, 2022

Conversation

vehre
Copy link
Collaborator

@vehre vehre commented Jul 27, 2022

coverage on master
Codecov branch

Summary of changes

Fix crash on certain platforms on finalize.

Rationale for changes

On finalize opencoarray was MPI_Win_detaching a management structure instead of previous Win_attached token. Detaching the token fixes the crash with openmpi. The testcase provided in the first commit does/can not really test the fix, because the tests do not check for crashing tests.

Additional info and certifications

This pull request (PR) is a:

  • Bug fix
  • Feature addition
  • Other, Please describe:

I certify that

  • I certify that:
    • I have reviewed and followed the contributing guidelines
    • I will wait at least 24 hours before self-approving the PR to give another
      OpenCoarrays developer a chance to review my proposed code
    • I have not introduced errant white space (no trailing white space or white space errors may
      be introduced)
    • I have added an explanation of what these changes do and why they should be included
    • I have checked to ensure there aren't other open Pull Requests for the same change
    • I have you written new tests for these changes
    • I have successfully tested these changes locally
    • I have commented any non-trivial, non-obvious code changes
    • The commits are logically atomic, self consistent and coherent
    • The commit messages follow best practices
    • Test coverage is maintained or increased after this is merged

Code coverage data

coverage on master

Copy link
Collaborator

@everythingfunctional everythingfunctional left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with this for a bit now, and can confirm it solves the crashing with openmpi. I have no idea why the change fixes it for openmpi without breaking mpich, or why it originally worked with mpich but not openmpi, but it works now. 🤷‍♂️

I also figured out why I was getting incorrect results with mpich. cafrun was still using mpiexec from openmpi, even though caf did link the executable to mpich. Perhaps something to look into 🤷‍♂️? I still think the new test is valuable, so no sense in getting rid of it.

Now the bad news. We still crash with Intel mpi. I'd say we should still merge this and call it a win, we just aren't done yet.

While mpich and openmpi do not care whether memory allocated with
MPI_Alloc_mem is freed using free() or MPI_Free_mem, the intel MPI lib
crashes when not being in sync there. Therefore use MPI_Free_mem for all
memory allocate using MPI_Alloc_mem.
@vehre
Copy link
Collaborator Author

vehre commented Jul 28, 2022

Intel crashing should be fixed by commit 9d4afcb. At least it does on my Fedora 35 Linux with Intel MPI 2021.6 .

@everythingfunctional
Copy link
Collaborator

I can confirm that this did solve the crashing with Intel MPI on Linux, and did improve the situation on Windows. Unfortunately, it does still crash on Windows, now with less severity it seems. The output from Windows is:

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf --show

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" --show
C:/Users/brad/gcc/bin/gfortran.exe -I/c/Users/brad/Repositories/GitHub/sourceryinstitute/opencoarrays-install/include/OpenCoarrays-2.10.0-14-g9d4afcb_GNU-12.1.0 -fcoarray=lib ${@} /c/Users/brad/Repositories/GitHub/sourceryinstitute/opencoarrays-install/lib/libcaf_mpi.a -pthread C:/Program Files (x86)/Intel/oneAPI/mpi/latest/lib/release/impi.lib

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\cafrun --show

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" --show
C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n <number_of_images> /path/to/coarray_Fortran_program [arg4 [arg5 [...]]]

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf hello_coarrays.f90 -o hello_coarrays

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" hello_coarrays.f90 -o hello_coarrays

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\cafrun -n 4 .\hello_coarrays.exe

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" -n 4 .\hello_coarrays.exe
           1           1
           3           3
           4           4
           2           2

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 6520 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1073740940 (c0000374)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 444 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 4796 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 8996 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================
Error: Command:
   `C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n 4 .\hello_coarrays.exe`
failed to run.

I'm still of the opinion that if solving this last issue is much more effort, we can go ahead and merge this and solve the last thing as a separate PR. Up to you though.

@vehre
Copy link
Collaborator Author

vehre commented Jul 29, 2022

I tried to debug this under Windows, but I see a different error. I see this error:

f:\dd\vctools\crt\crtw32\misc\dbgheap.c(1322) : Assertion failed: _CrtIsValidHeapPointer(pUserData)

This seems to be runtime related and I have no clue how to debug this on windows. So lets merge the existing fixes and if this is important do another round.

@everythingfunctional everythingfunctional merged commit 9123d92 into main Jul 29, 2022
@everythingfunctional everythingfunctional deleted the issue-762 branch July 29, 2022 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants