Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Singletons and Singleton Spawn #10688

Merged
merged 1 commit into from
Aug 23, 2022

Conversation

jjhursey
Copy link
Member

  • Fixes Singleton MPI initialization and spawn #10590
  • Singletons will not have a PMIx value for PMIX_LOCAL_PEERS
    so make that optional instead of required.
  • & is being confused as an application argument in prte
    instead of the background character
    • Replace with --daemonize which is probably better anyway

 * Fixes open-mpi#10590
 * Singletons will not have a PMIx value for `PMIX_LOCAL_PEERS`
   so make that optional instead of required.
 * `&` is being confused as an application argument in `prte`
   instead of the background character
   * Replace with `--daemonize` which is probably better anyway

Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey jjhursey marked this pull request as ready for review August 18, 2022 21:17
@jjhursey
Copy link
Member Author

@awlauria We will need to sync the prrte submodule pointer to pick up openpmix/prrte#1443

@jjhursey jjhursey requested a review from awlauria August 18, 2022 21:17
@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

FWIW: I have fixed the & confusion in the PMIx command line parser (along with other things) in openpmix/openpmix#2694. You need to update the PMIx submodule pointer once that has been committed, and add #10695, to fix #10691.

What a tangled web we weave!

@jsquyres
Copy link
Member

@jjhursey When singletons are fully fixed, please merge open-mpi/ompi-scripts#62 so that singleton tests are added to the OMPI Jenkins CI.

@jjhursey
Copy link
Member Author

Hold this PR for combined testing. I'm trying to align the 3 repos to get a current view of the state of this work.

I'm testing with Open MPI main:

shell$ git submodule status
+8cb6f58fe074efde0239aa4567854742223ab1a9 3rd-party/openpmix (v4.2.0-3-g8cb6f58f)
+b29abde61c618de28f6a4c181e8c28f68e332969 3rd-party/prrte (v3.0.0rc1-18-gb29abde6)

With these changes it seems the things are failing again :( I'm investigating.

@jjhursey
Copy link
Member Author

Tested with Open MPI main

shell$ git submodule status
+e3b925f82d2a59f58c60c7d7a7b5a71eda7d41ae ../../3rd-party/openpmix (v1.1.3-3598-ge3b925f8)
+12bb6c7dd6df522a38ed611c1fa4cf2dc9ea1761 ../../3rd-party/prrte (psrvr-v2.0.0rc1-4410-g12bb6c7d)

So something must be missing from the OpenPMIx and/or PRRTE release branches.

For Open MPI, since it uses the master branch of those two projects this PR is fine to merge.

@jjhursey
Copy link
Member Author

FYI: I pushed my set of tests open-mpi/ompi-tests-public#20

@jjhursey
Copy link
Member Author

jjhursey commented Aug 22, 2022

  • ✅ Open MPI main with OpenPMIx master and PRRTE master works.
  • 💥 Open MPI main with OpenPMIx v4.2 and PRRTE v3.0 does not work.
  • 💥 Open MPI main with OpenPMIx v4.2 and PRRTE master does not work
  • ✅ Open MPI main with OpenPMIx master and PRRTE v3.0 does work

So it looks like an OpenPMIx issue - probably a missing commit from master. 👀

Here is the error message: tests are here

shell$ ./simple_spawn ./simple_spawn
[f5n18:3003503] PMIX ERROR: PROC-ENTRY-NOT-FOUND in file server/pmix_server.c at line 3588
[f5n18:3003493] pml_ucx.c:191  Error: Failed to receive UCX worker address: Take next option (-46)
[f5n18:3003493] OPAL ERROR: Error in file dpm/dpm.c at line 480

@jjhursey
Copy link
Member Author

Ok we resolved the issue with OpenPMIx v4.2 as recommended by openpmix/openpmix#2705 (comment) . I posted #10700 with the change.

Once Open MPI main has the following merged then spawn should work correctly.

Once verified then we will PR these back to Open MPI v5.x

@jjhursey jjhursey merged commit 2659da6 into open-mpi:main Aug 23, 2022
@jjhursey jjhursey deleted the fix-singleton-spawn branch August 23, 2022 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Singleton MPI initialization and spawn
4 participants