Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defect: allocate_as_barrier_proc test still fails intermittently #415

Closed
zbeekman opened this issue Jul 17, 2017 · 6 comments
Closed

Defect: allocate_as_barrier_proc test still fails intermittently #415

zbeekman opened this issue Jul 17, 2017 · 6 comments

Comments

@zbeekman
Copy link
Collaborator

Avg response time
Issue Stats

Defect/Bug Report

  • OpenCoarrays Version: master/c0bbc16178bf2dd65402f88d399952f0d131c53b
  • Fortran Compiler: GFortran 7.1
  • C compiler used for building lib: GCC 7.1
  • Installation method: Travis-CI tests
  • Output of uname -a: Good question, I should add this to .travis.yml... dockerized MacOS
  • MPI library being used: MPICH 3.2
  • Machine architecture and number of physical cores: virtual x86_64
  • Version of CMake: 3.4

Observed Behavior

Test allocate_as_barrier_proc fails with:

      Start  6: allocate_as_barrier_proc
      Test  #6: allocate_as_barrier_proc .........***Failed  Required regular expression not found.Regex=[Test passed.
]  1.12 sec
 Test failed.           0

(https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/253477906#L3460)

Expected Behavior

Test passes

Steps to Reproduce

Run on CI and repeat test a lot with CTest. @afanfa or @vehre if you want me to show you how to do this, please let me know.

@zbeekman
Copy link
Collaborator Author

zbeekman commented Aug 9, 2017

Here is a link to a CI job that I won't rerun, that shows the error: https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/262671942#L2947

@vehre
Copy link
Collaborator

vehre commented Aug 11, 2017

Well, this is difficult to reproduce. On Linux I have no problem when running the test multiple times nor when running it with increasing number of processes 2-64 (more won't work on my machine: overload).
Furthermore did I try to start the test in parallel, but even starting it 40-times with 8 processes each did not produce the error.
I am using mpich 3.2.6.fc25. They have some small additional patches, but nothing that touches inner MPI routines. So I am tempted to exclude the slightly different version from the cause.

@zbeekman
Copy link
Collaborator Author

Yes I agree that this is very difficult to debug. I mostly just want to track it and document it here. Unless some insight can be gained from inspecting the code responsible for implicit synchronization of subroutines that do allocation, we can just sit on this issue---at least until it's observed in the wild.

@zbeekman zbeekman added the ready label Aug 30, 2017
@zbeekman
Copy link
Collaborator Author

zbeekman commented Feb 9, 2018

This seems to happen more when compiling with clang, or using OpenMPI

@zbeekman
Copy link
Collaborator Author

zbeekman commented May 1, 2018

Here is a more helpful debug log, but may not have all of the required information: (See https://travis-ci.org/sourceryinstitute/OpenCoarrays/jobs/373497931#L1733)

      Test  #6: allocate_as_barrier_proc .........................................................***Failed  Required regular expression not found.Regex=[Test passed.
]  2.84 sec
1/8: Entering sync all.
3/8: Entering sync all.
7/8: Entering sync all.
2/8: Entering sync all.
6/8: Entering sync all.
5/8: Entering sync all.
8/8: Entering sync all.
4/8: Entering sync all.
1/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
2/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
2/8: Leaving sync all.
2/8: _gfortran_caf_get() src_vector = (nil), image_index = 1, offset = 4.
2/8: deregister(0x243dc50)
1/8: Leaving sync all.
1/8: deregister(0xf0dad0)
3/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
3/8: Leaving sync all.
3/8: _gfortran_caf_get() src_vector = (nil), image_index = 2, offset = 4.
3/8: deregister(0x1584c50)
4/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
5/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
5/8: Leaving sync all.
5/8: _gfortran_caf_get() src_vector = (nil), image_index = 4, offset = 4.
5/8: deregister(0x1059b50)
4/8: Leaving sync all.
4/8: _gfortran_caf_get() src_vector = (nil), image_index = 3, offset = 4.
4/8: deregister(0x1fbfc50)
7/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
7/8: Leaving sync all.
7/8: _gfortran_caf_get() src_vector = (nil), image_index = 6, offset = 4.
7/8: deregister(0x16a9c50)
8/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
8/8: Leaving sync all.
8/8: _gfortran_caf_get() src_vector = (nil), image_index = 7, offset = 4.
8/8: deregister(0x16e6c50)
6/8: _gfortran_caf_sync_all: MPI_Barrier = 0.
6/8: Leaving sync all.
6/8: _gfortran_caf_get() src_vector = (nil), image_index = 5, offset = 4.
6/8: deregister(0x8e6c50)
1/8: finalize_internal(status_code = 0)
1/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
 Test failed.           0
2/8: finalize_internal(status_code = 0)
2/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
3/8: finalize_internal(status_code = 0)
3/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
5/8: finalize_internal(status_code = 0)
5/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
4/8: finalize_internal(status_code = 0)
4/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
7/8: finalize_internal(status_code = 0)
8/8: finalize_internal(status_code = 0)
7/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
8/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
6/8: finalize_internal(status_code = 0)
6/8: finalize_internal: Before MPI_Barrier (CAF_COMM_WORLD)
1/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
3/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
3/8: finalize(): Freeed all slave tokens.
1/8: finalize(): Freeed all slave tokens.
4/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
4/8: finalize(): Freeed all slave tokens.
5/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
5/8: finalize(): Freeed all slave tokens.
7/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
8/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
8/8: finalize(): Freeed all slave tokens.
7/8: finalize(): Freeed all slave tokens.
6/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
6/8: finalize(): Freeed all slave tokens.
2/8: finalize_internal: After MPI_Barrier (CAF_COMM_WORLD) = 0
2/8: finalize(): Freeed all slave tokens.
1/8: finalize_internal: before Win_unlock_all.
2/8: finalize_internal: before Win_unlock_all.
2/8: finalize_internal: before Win_free(stat_tok)
1/8: finalize_internal: before Win_free(stat_tok)
5/8: finalize_internal: before Win_unlock_all.
5/8: finalize_internal: before Win_free(stat_tok)
6/8: finalize_internal: before Win_unlock_all.
7/8: finalize_internal: before Win_unlock_all.
7/8: finalize_internal: before Win_free(stat_tok)
6/8: finalize_internal: before Win_free(stat_tok)
3/8: finalize_internal: before Win_unlock_all.
8/8: finalize_internal: before Win_unlock_all.
8/8: finalize_internal: before Win_free(stat_tok)
3/8: finalize_internal: before Win_free(stat_tok)
4/8: finalize_internal: before Win_unlock_all.
4/8: finalize_internal: before Win_free(stat_tok)
1/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
2/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
2/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
1/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
5/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
5/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
6/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
7/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
7/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
6/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
8/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
3/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
3/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
8/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
4/8: finalize_internal: before Comm_free(CAF_COMM_WORLD)
4/8: finalize_internal: after Comm_free(CAF_COMM_WORLD)
5/8: finalize_internal: Finalisation done!!!
7/8: finalize_internal: Finalisation done!!!
6/8: finalize_internal: Finalisation done!!!
3/8: finalize_internal: Finalisation done!!!
1/8: finalize_internal: Finalisation done!!!
2/8: finalize_internal: Finalisation done!!!
8/8: finalize_internal: Finalisation done!!!
4/8: finalize_internal: Finalisation done!!!

@zbeekman
Copy link
Collaborator Author

zbeekman commented Jul 1, 2018

This appears to be fixed on GFortran >= 7. I'm going to close this after adding having CMake disable this test

if( ( NOT CMAKE_Fortran_COMPILER_VERSION VERSION_LESS 7.0.0 ) OR OPENCOARRAYS_DEVELOPER )

Here is the debug output for GCC 6.4:

21/60 Test  #6: allocate_as_barrier_proc .........................................................***Failed  Required regular expression not found.Regex=[Test passed.
]  3.24 sec
5/8: _gfortran_caf_sync_all(1537) Entering sync all.
8/8: _gfortran_caf_sync_all(1537) Entering sync all.
1/8: _gfortran_caf_sync_all(1537) Entering sync all.
4/8: _gfortran_caf_sync_all(1537) Entering sync all.
7/8: _gfortran_caf_sync_all(1537) Entering sync all.
2/8: _gfortran_caf_sync_all(1537) Entering sync all.
3/8: _gfortran_caf_sync_all(1537) Entering sync all.
6/8: _gfortran_caf_sync_all(1537) Entering sync all.
1/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
2/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
2/8: _gfortran_caf_sync_all(1586) Leaving sync all.
2/8: _gfortran_caf_get(3393) src_vector = (nil), image_index = 1, offset = 4.
2/8: _gfortran_caf_deregister(1391) deregister(0x1d3eb90)
1/8: _gfortran_caf_sync_all(1586) Leaving sync all.
1/8: _gfortran_caf_deregister(1391) deregister(0x8fab10)
3/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
3/8: _gfortran_caf_sync_all(1586) Leaving sync all.
3/8: _gfortran_caf_get(3393) src_vector = (nil), image_index = 2, offset = 4.
3/8: _gfortran_caf_deregister(1391) deregister(0x165bc90)
5/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
5/8: _gfortran_caf_sync_all(1586) Leaving sync all.
5/8: _gfortran_caf_get(3393) src_vector = (nil), image_index = 4, offset = 4.
5/8: _gfortran_caf_deregister(1391) deregister(0x18c5c90)
6/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
6/8: _gfortran_caf_sync_all(1586) Leaving sync all.
6/8: _gfortran_caf_get(3393) src_vector = (nil), image_index = 5, offset = 4.
6/8: _gfortran_caf_deregister(1391) deregister(0xbe7b90)
4/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
4/8: _gfortran_caf_sync_all(1586) Leaving sync all.
4/8: _gfortran_caf_get(3393) src_vector = (nil), image_index = 3, offset = 4.
4/8: _gfortran_caf_deregister(1391) deregister(0x1542c90)
7/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
7/8: _gfortran_caf_sync_all(1586) Leaving sync all.
7/8: _gfortran_caf_get(3393) src_vector = (nil), image_index = 6, offset = 4.
7/8: _gfortran_caf_deregister(1391) deregister(0x23e2c90)
8/8: _gfortran_caf_sync_all(1553) MPI_Barrier = 0.
8/8: _gfortran_caf_sync_all(1586) Leaving sync all.
8/8: _gfortran_caf_get(3393) src_vector = (nil), image_index = 7, offset = 4.
8/8: _gfortran_caf_deregister(1391) deregister(0x20a4c90)
1/8: finalize_internal(930) (status_code = 0)
 Test failed.           0
2/8: finalize_internal(930) (status_code = 0)
2/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
1/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
5/8: finalize_internal(930) (status_code = 0)
5/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
6/8: finalize_internal(930) (status_code = 0)
6/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
3/8: finalize_internal(930) (status_code = 0)
4/8: finalize_internal(930) (status_code = 0)
3/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
4/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
7/8: finalize_internal(930) (status_code = 0)
7/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
8/8: finalize_internal(930) (status_code = 0)
8/8: finalize_internal(974) Before MPI_Barrier(CAF_COMM_WORLD)
1/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
1/8: finalize_internal(1016) Freed all slave tokens.
2/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
2/8: finalize_internal(1016) Freed all slave tokens.
5/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
5/8: finalize_internal(1016) Freed all slave tokens.
6/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
6/8: finalize_internal(1016) Freed all slave tokens.
3/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
3/8: finalize_internal(1016) Freed all slave tokens.
4/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
4/8: finalize_internal(1016) Freed all slave tokens.
7/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
7/8: finalize_internal(1016) Freed all slave tokens.
8/8: finalize_internal(976) After MPI_Barrier(CAF_COMM_WORLD) = 0
8/8: finalize_internal(1016) Freed all slave tokens.
1/8: finalize_internal(1048) before Win_unlock_all.
1/8: finalize_internal(1050) before Win_free(stat_tok)
5/8: finalize_internal(1048) before Win_unlock_all.
5/8: finalize_internal(1050) before Win_free(stat_tok)
2/8: finalize_internal(1048) before Win_unlock_all.
2/8: finalize_internal(1050) before Win_free(stat_tok)
6/8: finalize_internal(1048) before Win_unlock_all.
6/8: finalize_internal(1050) before Win_free(stat_tok)
3/8: finalize_internal(1048) before Win_unlock_all.
3/8: finalize_internal(1050) before Win_free(stat_tok)
4/8: finalize_internal(1048) before Win_unlock_all.
4/8: finalize_internal(1050) before Win_free(stat_tok)
7/8: finalize_internal(1048) before Win_unlock_all.
7/8: finalize_internal(1050) before Win_free(stat_tok)
8/8: finalize_internal(1048) before Win_unlock_all.
8/8: finalize_internal(1050) before Win_free(stat_tok)
1/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
1/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
5/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
5/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
3/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
3/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
2/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
2/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
4/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
4/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
7/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
7/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
6/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
6/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
8/8: finalize_internal(1052) before Comm_free(CAF_COMM_WORLD)
8/8: finalize_internal(1055) after Comm_free(CAF_COMM_WORLD)
4/8: finalize_internal(1082) Finalisation done!!!
1/8: finalize_internal(1082) Finalisation done!!!
5/8: finalize_internal(1082) Finalisation done!!!
2/8: finalize_internal(1082) Finalisation done!!!
7/8: finalize_internal(1082) Finalisation done!!!
3/8: finalize_internal(1082) Finalisation done!!!
6/8: finalize_internal(1082) Finalisation done!!!
8/8: finalize_internal(1082) Finalisation done!!!

@ghost ghost removed the ready label Jul 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants