Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping ranks evenly across nodes and across resources within nodes with MPI+OMP codes (--map-by Res:PE=n:span is broken) #13143

Open
drmichaeltcvx opened this issue Mar 14, 2025 · 60 comments
Assignees

Comments

@drmichaeltcvx
Copy link

Is your feature request related to a problem? Please describe.
I am requesting the following mapping objective: map ranks of MPI+OMP codes as evenly as possible across nodes and spread ranks within nodes as evenly as possible over NUMA domains or L3 caches. This is a show-stopper issue for us.

Also, L3cache on AMD Zen4 systems (with 8 cores sharing an L3 cache) is incorrectly treated the same as a memory domain mapping.

Describe the solution you'd like
This is a HPC use case: Memory BW (or L3 cache capacity) limited codes would benefit by spreading the ranks evenly over as many memory controllers (L3 Caches) as possible. Example of these codes are large numerical simulation including finite-difference, finite element or finite volume methods. They traverse large swaths of memory easily overwhelming L3 capacities (thus limiting effectiveness of cache-blocking in loops) and are almost exclusively DRAM BW limited. They would most benefit by spreading ranks over as many memory controllers as available. In this scenario users want to launch MPI OMP ranks onto a subset of the available cores so as not to exceed the BW each controller can server to the ranks mapped to these controller.

Unfortunately, OpenMPI's mapping logic is missing the capability of spreading ranks with OMP threads as evenly as possible over nodes and then over numa domains or L3 caches within nodes. When we use a "--map-by resource:span" or a "--map-by ppr:_N_:resource:span" clause mapping selects the resource in sequence and coreectly allocates 1 core (slot) to each rank. However, when we have MPI+OMP codes we need to allocate as many cores to each rank as OMP threads. The syntax for this would be "--map-by resource:PE=n:span", n OMP threads, but this will bunch up all ranks to the first set of nodes that can satisfy the request as opposed to evenly spreading ranks across nodes.

For instance assume 2 nodes with 176 cores each with 2 sockets and 4 memory domains per node. If we want to map say, 32 ranks evenly over NUMA domains we can use

mpirun $CLIMPIOPTS --map-by numa:span -np 16 hostname
that correctly maps 16 ranks evenly across nodes and NUMA domains:
``
$ mpirun $CLIMPIOPTS --map-by socket:span -np 16 hostname
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 1 bound to socket 1[core 88[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 3 bound to socket 1[core 89[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 4 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 5 bound to socket 1[core 90[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 6 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 7 bound to socket 1[core 91[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 8 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]

[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 9 bound to socket 1[core 88[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]

[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 10 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]

[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 11 bound to socket 1[core 89[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]

[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 12 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]

[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 13 bound to socket 1[core 90[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]

[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 14 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 15 bound to socket 1[core 91[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]

``

However, when we have say 2 OMP threads / rank, using

mpirun $CLIMPIOPTS --map-by numa:PE=2:span -np 16 hostname
maps all ranks to the first node :

``
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 1 bound to socket 0[core 44[hwt 0]], socket 0[core 45[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 2 bound to socket 1[core 88[hwt 0]], socket 1[core 89[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 3 bound to socket 1[core 132[hwt 0]], socket 1[core 133[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 4 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 5 bound to socket 0[core 46[hwt 0]], socket 0[core 47[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 6 bound to socket 1[core 90[hwt 0]], socket 1[core 91[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 7 bound to socket 1[core 134[hwt 0]], socket 1[core 135[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 8 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 9 bound to socket 0[core 48[hwt 0]], socket 0[core 49[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 10 bound to socket 1[core 92[hwt 0]], socket 1[core 93[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 11 bound to socket 1[core 136[hwt 0]], socket 1[core 137[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 12 bound to socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 13 bound to socket 0[core 50[hwt 0]], socket 0[core 51[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 14 bound to socket 1[core 94[hwt 0]], socket 1[core 95[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 15 bound to socket 1[core 138[hwt 0]], socket 1[core 139[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././.]

``

Describe alternatives you've considered
The meaningful mapping of MPI+OMP codes for even spreading over all resources should let the
--map-by numa:PE=_n_:span clause allocate n cores per rank and replicate the effect of the same clause with `PE=n' absent.

Additional context
This is a show stopper for us, as we cannot control the placement of the ranks in MPI+OMP codes under to evenly use as many memory controllers as available.

@drmichaeltcvx
Copy link
Author

Using --cpu-set cpu number list --bind-to core binds to all cores on the node and not to the specified CPUs in the list

@drmichaeltcvx drmichaeltcvx changed the title Mapping ranks evenly across nodes and across resources within nodes with MPI+OMP codes Mapping ranks evenly across nodes and across resources within nodes with MPI+OMP codes (--map-by Res:PE=n:span is broken) Mar 14, 2025
@rhc54
Copy link
Contributor

rhc54 commented Mar 14, 2025

Yeah, that's not a new feature, just a bug. Could you please tell us what version of OMPI you are using?

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2025

Using --cpu-set cpu number list --bind-to core binds to all cores on the node and not to the specified CPUs in the list

This may just be an older version - here's what I get when testing with a complex topology on my system:

$ mpirun --prtemca hwloc_use_topo_file xxx.xml --runtime-options donotlaunch --display map-devel -n 4 --cpu-set 1,2,3,4 --bind-to core hostname

=================================   JOB MAP   =================================
Data for JOB prterun-Ralphs-iMac-41243@1 offset 0 Total slots allocated 48
Mapper requested: NULL  Last mapper: round_robin  Mapping policy: PE-LIST:NOOVERSUBSCRIBE  Ranking policy: SLOT
Binding policy: CORE  Cpu set: 1,2,3,4  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE
Num new daemons: 0	New daemon starting vpid INVALID
Num nodes: 1

Data for node: Ralphs-iMac	State: 3	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:MAPPED:SLOTS_GIVEN
                resolved from Ralphs-iMac.local
                resolved from Ralphs-iMac
        Daemon: [prterun-Ralphs-iMac-41243@0,0]	Daemon launched: True
            Num slots: 48	Slots in use: 4	Oversubscribed: FALSE
            Num slots allocated: 48	Max slots: 0	Num procs: 4
        Data for proc: [prterun-Ralphs-iMac-41243@1,0]
                Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:1-4]
        Data for proc: [prterun-Ralphs-iMac-41243@1,1]
                Pid: 0	Local rank: 1	Node rank: 1	App rank: 1
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:1-4]
        Data for proc: [prterun-Ralphs-iMac-41243@1,2]
                Pid: 0	Local rank: 2	Node rank: 2	App rank: 2
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:1-4]
        Data for proc: [prterun-Ralphs-iMac-41243@1,3]
                Pid: 0	Local rank: 3	Node rank: 3	App rank: 3
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:1-4]

The above was with the head of the PRRTE v3.0 branch, which is in the upcoming OMPI v5.0.x release. Looks like it all worked as expected.

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2025

Likewise, the original concern with the "pe=2:span" modifiers seems to work as well for two nodes:

$ mpirun --prtemca hwloc_use_topo_file xxx.xml --runtime-options donotlaunch --display map-devel --map-by numa:PE=2:span   -np 16 hostname

=================================   JOB MAP   =================================
Data for JOB prterun-Ralphs-iMac-41412@1 offset 0 Total slots allocated 96
Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN  Ranking policy: SPAN
Binding policy: CORE:IF-SUPPORTED  Cpu set: N/A  PPR: N/A  Cpus-per-rank: 2  Cpu Type: CORE
Num new daemons: 0	New daemon starting vpid INVALID
Num nodes: 2

Data for node: nodeA0	State: 3	Flags: MAPPED:SLOTS_GIVEN
        Daemon: [prterun-Ralphs-iMac-41412@0,1]	Daemon launched: False
            Num slots: 48	Slots in use: 8	Oversubscribed: FALSE
            Num slots allocated: 48	Max slots: 48	Num procs: 8
        Data for proc: [prterun-Ralphs-iMac-41412@1,0]
                Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:0-1]
        Data for proc: [prterun-Ralphs-iMac-41412@1,1]
                Pid: 0	Local rank: 1	Node rank: 1	App rank: 1
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:24-25]
        Data for proc: [prterun-Ralphs-iMac-41412@1,4]
                Pid: 0	Local rank: 2	Node rank: 2	App rank: 4
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:2-3]
        Data for proc: [prterun-Ralphs-iMac-41412@1,5]
                Pid: 0	Local rank: 3	Node rank: 3	App rank: 5
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:26-27]
        Data for proc: [prterun-Ralphs-iMac-41412@1,8]
                Pid: 0	Local rank: 4	Node rank: 4	App rank: 8
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:4-5]
        Data for proc: [prterun-Ralphs-iMac-41412@1,9]
                Pid: 0	Local rank: 5	Node rank: 5	App rank: 9
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:28-29]
        Data for proc: [prterun-Ralphs-iMac-41412@1,12]
                Pid: 0	Local rank: 6	Node rank: 6	App rank: 12
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:6-7]
        Data for proc: [prterun-Ralphs-iMac-41412@1,13]
                Pid: 0	Local rank: 7	Node rank: 7	App rank: 13
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:30-31]

Data for node: nodeA1	State: 3	Flags: MAPPED:SLOTS_GIVEN
        Daemon: [prterun-Ralphs-iMac-41412@0,2]	Daemon launched: False
            Num slots: 48	Slots in use: 8	Oversubscribed: FALSE
            Num slots allocated: 48	Max slots: 48	Num procs: 8
        Data for proc: [prterun-Ralphs-iMac-41412@1,2]
                Pid: 0	Local rank: 0	Node rank: 0	App rank: 2
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:0-1]
        Data for proc: [prterun-Ralphs-iMac-41412@1,3]
                Pid: 0	Local rank: 1	Node rank: 1	App rank: 3
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:24-25]
        Data for proc: [prterun-Ralphs-iMac-41412@1,6]
                Pid: 0	Local rank: 2	Node rank: 2	App rank: 6
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:2-3]
        Data for proc: [prterun-Ralphs-iMac-41412@1,7]
                Pid: 0	Local rank: 3	Node rank: 3	App rank: 7
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:26-27]
        Data for proc: [prterun-Ralphs-iMac-41412@1,10]
                Pid: 0	Local rank: 4	Node rank: 4	App rank: 10
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:4-5]
        Data for proc: [prterun-Ralphs-iMac-41412@1,11]
                Pid: 0	Local rank: 5	Node rank: 5	App rank: 11
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:28-29]
        Data for proc: [prterun-Ralphs-iMac-41412@1,14]
                Pid: 0	Local rank: 6	Node rank: 6	App rank: 14
                State: INITIALIZED	App_context: 0
        	Binding: package[0][core:6-7]
        Data for proc: [prterun-Ralphs-iMac-41412@1,15]
                Pid: 0	Local rank: 7	Node rank: 7	App rank: 15
                State: INITIALIZED	App_context: 0
        	Binding: package[1][core:30-31]

So I think this is just a case of updating to a newer OMPI release.

@rhc54
Copy link
Contributor

rhc54 commented Mar 15, 2025

Also, L3cache on AMD Zen4 systems (with 8 cores sharing an L3 cache) is incorrectly treated the same as a memory domain mapping.

Don't know anything about that - could you perhaps provide your topology? You can attach the output of lstopo --output-format xml

@drmichaeltcvx
Copy link
Author

Yeah, that's not a new feature, just a bug. Could you please tell us what version of OMPI you are using?

Thanks for the prompt response!

I am using ompi out of HPC_X :
$ cat $HPCX_DIR/VERSION
HPC-X v2.21
clusterkit-3312df7 1.14.462 (3312df7)
hcoll-1a4e38d 4.8.3230 (1a4e38d)
nccl_rdma_sharp_plugin-2a632df681125923c4b9e8d4426df76c0eb8db69 2.7 (2a632df)
ompi-4292ae09f9c961614e93bd0a86cf517806140380 gitclone (4292ae0)
sharp-7a20b6077f803fad04915e6727bc78ff86f7bb1c 3.9.0 (7a20b60)
ucc-master_remove_submodule_fix 1.4.0 (22c8c3c)
ucx-152bf42db308b4cb42739fd706ce9f8b8e87246b 1.18.0 (152bf42)
Linux: redhat8
OFED: doca_ofed
Build #: 814
gcc (GCC) 8.2.1 20180905 (Red Hat 8.2.1-3)
CUDA: V12.6.68

@drmichaeltcvx
Copy link
Author

Likewise, the original concern with the "pe=2:span" modifiers seems to work as well for two nodes:

$ mpirun --prtemca hwloc_use_topo_file xxx.xml --runtime-options donotlaunch --display map-devel --map-by numa:PE=2:span -np 16 hostname

================================= JOB MAP =================================
Data for JOB prterun-Ralphs-iMac-41412@1 offset 0 Total slots allocated 96
Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN Ranking policy: SPAN
Binding policy: CORE:IF-SUPPORTED Cpu set: N/A PPR: N/A Cpus-per-rank: 2 Cpu Type: CORE
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 2

Data for node: nodeA0 State: 3 Flags: MAPPED:SLOTS_GIVEN
Daemon: [prterun-Ralphs-iMac-41412@0,1] Daemon launched: False
Num slots: 48 Slots in use: 8 Oversubscribed: FALSE
Num slots allocated: 48 Max slots: 48 Num procs: 8
Data for proc: [prterun-Ralphs-iMac-41412@1,0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED App_context: 0
Binding: package[0][core:0-1]
Data for proc: [prterun-Ralphs-iMac-41412@1,1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED App_context: 0
Binding: package[1][core:24-25]
Data for proc: [prterun-Ralphs-iMac-41412@1,4]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 4
State: INITIALIZED App_context: 0
Binding: package[0][core:2-3]
Data for proc: [prterun-Ralphs-iMac-41412@1,5]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 5
State: INITIALIZED App_context: 0
Binding: package[1][core:26-27]
Data for proc: [prterun-Ralphs-iMac-41412@1,8]
Pid: 0 Local rank: 4 Node rank: 4 App rank: 8
State: INITIALIZED App_context: 0
Binding: package[0][core:4-5]
Data for proc: [prterun-Ralphs-iMac-41412@1,9]
Pid: 0 Local rank: 5 Node rank: 5 App rank: 9
State: INITIALIZED App_context: 0
Binding: package[1][core:28-29]
Data for proc: [prterun-Ralphs-iMac-41412@1,12]
Pid: 0 Local rank: 6 Node rank: 6 App rank: 12
State: INITIALIZED App_context: 0
Binding: package[0][core:6-7]
Data for proc: [prterun-Ralphs-iMac-41412@1,13]
Pid: 0 Local rank: 7 Node rank: 7 App rank: 13
State: INITIALIZED App_context: 0
Binding: package[1][core:30-31]

Data for node: nodeA1 State: 3 Flags: MAPPED:SLOTS_GIVEN
Daemon: [prterun-Ralphs-iMac-41412@0,2] Daemon launched: False
Num slots: 48 Slots in use: 8 Oversubscribed: FALSE
Num slots allocated: 48 Max slots: 48 Num procs: 8
Data for proc: [prterun-Ralphs-iMac-41412@1,2]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
State: INITIALIZED App_context: 0
Binding: package[0][core:0-1]
Data for proc: [prterun-Ralphs-iMac-41412@1,3]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 3
State: INITIALIZED App_context: 0
Binding: package[1][core:24-25]
Data for proc: [prterun-Ralphs-iMac-41412@1,6]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 6
State: INITIALIZED App_context: 0
Binding: package[0][core:2-3]
Data for proc: [prterun-Ralphs-iMac-41412@1,7]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 7
State: INITIALIZED App_context: 0
Binding: package[1][core:26-27]
Data for proc: [prterun-Ralphs-iMac-41412@1,10]
Pid: 0 Local rank: 4 Node rank: 4 App rank: 10
State: INITIALIZED App_context: 0
Binding: package[0][core:4-5]
Data for proc: [prterun-Ralphs-iMac-41412@1,11]
Pid: 0 Local rank: 5 Node rank: 5 App rank: 11
State: INITIALIZED App_context: 0
Binding: package[1][core:28-29]
Data for proc: [prterun-Ralphs-iMac-41412@1,14]
Pid: 0 Local rank: 6 Node rank: 6 App rank: 14
State: INITIALIZED App_context: 0
Binding: package[0][core:6-7]
Data for proc: [prterun-Ralphs-iMac-41412@1,15]
Pid: 0 Local rank: 7 Node rank: 7 App rank: 15
State: INITIALIZED App_context: 0
Binding: package[1][core:30-31]
So I think this is just a case of updating to a newer OMPI release.

Should I assume that the above behavior only applies to OMPI v5.XX and beyond?

@drmichaeltcvx
Copy link
Author

What is the recommendation for applying user-directed "complex" mappings with ompi v4.1.x? As those mentioned on the issue description?

Actually this mapping is common when we run on h/w platforms with large number of cores. We undersubscribe the compute resources in order to allow for higher and even allocation of available DRAM BW to ranks.

@drmichaeltcvx
Copy link
Author

Also, L3cache on AMD Zen4 systems (with 8 cores sharing an L3 cache) is incorrectly treated the same as a memory domain mapping.

Don't know anything about that - could you perhaps provide your topology? You can attach the output of lstopo --output-format xml

The Zen4 and Zen3 architectures have 8 cores more closely associated with each L3 cache sub-unit, whereas Zen2 has 4 cores associated with each L3 sub-unit. The examples below is for Zen4 and each node has 176 cores.

Here is map-by L3cache and bind-to core.

$ mpirun --oversubscribe --np 8  --hostfile  ~/hosts/Zen4-2.hosts --display-devel-map --report-bindings --map-by L3cache --bind-to core hostname 


 Data for JOB [40608,1] offset 0 Total slots allocated 352

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYL3CACHE:OVERSUBSCRIBE  Ranking policy: UNKNOWN
 Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 0
 	Num new daemons: 0	New daemon starting vpid INVALID
 	Num nodes: 1

 Data for node: ccnpusc4000001	State: 3	Flags: 11
 	Daemon: [[40608,0],0]	Daemon launched: True
 	Num slots: 176	Slots in use: 8	Oversubscribed: FALSE
 	Num slots allocated: 176	Max slots: 0
 	Num procs: 8	Next node_rank: 8
 	Data for proc: [[40608,1],0]
 		Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40608,1],1]
 		Pid: 0	Local rank: 1	Node rank: 1	App rank: 1
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [././././././././././././././././././././././././././././././././././././././././././././B/././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40608,1],2]
 		Pid: 0	Local rank: 2	Node rank: 2	App rank: 2
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40608,1],3]
 		Pid: 0	Local rank: 3	Node rank: 3	App rank: 3
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40608,1],4]
 		Pid: 0	Local rank: 4	Node rank: 4	App rank: 4
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40608,1],5]
 		Pid: 0	Local rank: 5	Node rank: 5	App rank: 5
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././B/./././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40608,1],6]
 		Pid: 0	Local rank: 6	Node rank: 6	App rank: 6
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40608,1],7]
 		Pid: 0	Local rank: 7	Node rank: 7	App rank: 7
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././B/./././././././././././././././././././././././././././././././././././././././././.]
 Data for JOB [40608,1] offset 0 Total slots allocated 352

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYL3CACHE:OVERSUBSCRIBE  Ranking policy: UNKNOWN
 Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 0
 	Num new daemons: 0	New daemon starting vpid INVALID
 	Num nodes: 1

 Data for node: ccnpusc4000001	State: 1	Flags: 10
 	Daemon: [[40608,0],0]	Daemon launched: False
 	Num slots: 176	Slots in use: 0	Oversubscribed: FALSE
 	Num slots allocated: 176	Max slots: 0
 	Num procs: 8	Next node_rank: 8
 	Data for proc: [[40608,1],0]
 		Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40608,1],1]
 		Pid: 0	Local rank: 1	Node rank: 1	App rank: 1
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40608,1],2]
 		Pid: 0	Local rank: 2	Node rank: 2	App rank: 2
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40608,1],3]
 		Pid: 0	Local rank: 3	Node rank: 3	App rank: 3
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: UNBOUND
 	Data for proc: [[40608,1],4]
 		Pid: 0	Local rank: 4	Node rank: 4	App rank: 4
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40608,1],5]
 		Pid: 0	Local rank: 5	Node rank: 5	App rank: 5
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40608,1],6]
 		Pid: 0	Local rank: 6	Node rank: 6	App rank: 6
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40608,1],7]
 		Pid: 0	Local rank: 7	Node rank: 7	App rank: 7
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: UNBOUND
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 1 bound to socket 0[core 44[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././B/././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 2 bound to socket 1[core 88[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 3 bound to socket 1[core 132[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 4 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 5 bound to socket 0[core 45[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././B/./././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 6 bound to socket 1[core 89[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835469] MCW rank 7 bound to socket 1[core 133[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././B/./././././././././././././././././././././././././././././././././././././././././.]
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net

Here is map-by L3cache and bind-to L3cache.

mtml@ccnpusc4000001[pts/32]~ $ mpirun --oversubscribe --np 8  --hostfile  ~/hosts/Zen4-2.hosts --display-devel-map --report-bindings --map-by L3cache --bind-to L3cache hostname 


 Data for JOB [40340,1] offset 0 Total slots allocated 352

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYL3CACHE:OVERSUBSCRIBE  Ranking policy: UNKNOWN
 Binding policy: L3CACHE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 0
 	Num new daemons: 0	New daemon starting vpid INVALID
 	Num nodes: 1

 Data for node: ccnpusc4000001	State: 3	Flags: 11
 	Daemon: [[40340,0],0]	Daemon launched: True
 	Num slots: 176	Slots in use: 8	Oversubscribed: FALSE
 	Num slots allocated: 176	Max slots: 0
 	Num procs: 8	Next node_rank: 8
 	Data for proc: [[40340,1],0]
 		Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40340,1],1]
 		Pid: 0	Local rank: 1	Node rank: 1	App rank: 1
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40340,1],2]
 		Pid: 0	Local rank: 2	Node rank: 2	App rank: 2
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40340,1],3]
 		Pid: 0	Local rank: 3	Node rank: 3	App rank: 3
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 	Data for proc: [[40340,1],4]
 		Pid: 0	Local rank: 4	Node rank: 4	App rank: 4
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40340,1],5]
 		Pid: 0	Local rank: 5	Node rank: 5	App rank: 5
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40340,1],6]
 		Pid: 0	Local rank: 6	Node rank: 6	App rank: 6
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 	Data for proc: [[40340,1],7]
 		Pid: 0	Local rank: 7	Node rank: 7	App rank: 7
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 Data for JOB [40340,1] offset 0 Total slots allocated 352

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYL3CACHE:OVERSUBSCRIBE  Ranking policy: UNKNOWN
 Binding policy: L3CACHE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 0
 	Num new daemons: 0	New daemon starting vpid INVALID
 	Num nodes: 1

 Data for node: ccnpusc4000001	State: 1	Flags: 10
 	Daemon: [[40340,0],0]	Daemon launched: False
 	Num slots: 176	Slots in use: 0	Oversubscribed: FALSE
 	Num slots allocated: 176	Max slots: 0
 	Num procs: 8	Next node_rank: 8
 	Data for proc: [[40340,1],0]
 		Pid: 0	Local rank: 0	Node rank: 0	App rank: 0
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40340,1],1]
 		Pid: 0	Local rank: 1	Node rank: 1	App rank: 1
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40340,1],2]
 		Pid: 0	Local rank: 2	Node rank: 2	App rank: 2
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40340,1],3]
 		Pid: 0	Local rank: 3	Node rank: 3	App rank: 3
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: UNBOUND
 	Data for proc: [[40340,1],4]
 		Pid: 0	Local rank: 4	Node rank: 4	App rank: 4
 		State: INITIALIZED	App_context: 0
 		Locale:  [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40340,1],5]
 		Pid: 0	Local rank: 5	Node rank: 5	App rank: 5
 		State: INITIALIZED	App_context: 0
 		Locale:  [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40340,1],6]
 		Pid: 0	Local rank: 6	Node rank: 6	App rank: 6
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
 		Binding: UNBOUND
 	Data for proc: [[40340,1],7]
 		Pid: 0	Local rank: 7	Node rank: 7	App rank: 7
 		State: INITIALIZED	App_context: 0
 		Locale:  [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
 		Binding: UNBOUND
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]], socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]], socket 0[core 16[hwt 0]], socket 0[core 17[hwt 0]], socket 0[core 18[hwt 0]], socket 0[core 19[hwt 0]], socket 0[core 20[hwt 0]], socket 0[core 21[hwt 0]], socket 0[core 22[hwt 0]], socket 0[core 23[hwt 0]], socket 0[core 24[hwt 0]], socket 0[core 25[hwt 0]], socket 0[core 26[hwt 0]], socket 0[core 27[hwt 0]], socket 0[core 28[hwt 0]], socket 0[core 29[hwt 0]], socket 0[core 30[hwt 0]], socket 0[core 31[hwt 0]], socket 0[core 32[hwt 0]], socket 0[core 33[hwt 0]], socket 0[core 34[hwt 0]], socket 0[core 35[hwt 0]], socket 0[core 36[hwt 0]], socket 0[core 37[hwt 0]], socket 0[core 38[hwt 0]], socket 0[core 39[hw: [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 1 bound to socket 0[core 44[hwt 0]], socket 0[core 45[hwt 0]], socket 0[core 46[hwt 0]], socket 0[core 47[hwt 0]], socket 0[core 48[hwt 0]], socket 0[core 49[hwt 0]], socket 0[core 50[hwt 0]], socket 0[core 51[hwt 0]], socket 0[core 52[hwt 0]], socket 0[core 53[hwt 0]], socket 0[core 54[hwt 0]], socket 0[core 55[hwt 0]], socket 0[core 56[hwt 0]], socket 0[core 57[hwt 0]], socket 0[core 58[hwt 0]], socket 0[core 59[hwt 0]], socket 0[core 60[hwt 0]], socket 0[core 61[hwt 0]], socket 0[core 62[hwt 0]], socket 0[core 63[hwt 0]], socket 0[core 64[hwt 0]], socket 0[core 65[hwt 0]], socket 0[core 66[hwt 0]], socket 0[core 67[hwt 0]], socket 0[core 68[hwt 0]], socket 0[core 69[hwt 0]], socket 0[core 70[hwt 0]], socket 0[core 71[hwt 0]], socket 0[core 72[hwt 0]], socket 0[core 73[hwt 0]], socket 0[core 74[hwt 0]], socket 0[core 75[hwt 0]], socket 0[core 76[hwt 0]], socket 0[core 77[hwt 0]], socket 0[core 78[hwt 0]], socket 0[core 79[hwt 0]], socket 0[core 80[hwt 0]], socket 0[core 81[hwt 0]], socket 0[core 82[hwt 0]], socket 0[: [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 2 bound to socket 1[core 88[hwt 0]], socket 1[core 89[hwt 0]], socket 1[core 90[hwt 0]], socket 1[core 91[hwt 0]], socket 1[core 92[hwt 0]], socket 1[core 93[hwt 0]], socket 1[core 94[hwt 0]], socket 1[core 95[hwt 0]], socket 1[core 96[hwt 0]], socket 1[core 97[hwt 0]], socket 1[core 98[hwt 0]], socket 1[core 99[hwt 0]], socket 1[core 100[hwt 0]], socket 1[core 101[hwt 0]], socket 1[core 102[hwt 0]], socket 1[core 103[hwt 0]], socket 1[core 104[hwt 0]], socket 1[core 105[hwt 0]], socket 1[core 106[hwt 0]], socket 1[core 107[hwt 0]], socket 1[core 108[hwt 0]], socket 1[core 109[hwt 0]], socket 1[core 110[hwt 0]], socket 1[core 111[hwt 0]], socket 1[core 112[hwt 0]], socket 1[core 113[hwt 0]], socket 1[core 114[hwt 0]], socket 1[core 115[hwt 0]], socket 1[core 116[hwt 0]], socket 1[core 117[hwt 0]], socket 1[core 118[hwt 0]], socket 1[core 119[hwt 0]], socket 1[core 120[hwt 0]], socket 1[core 121[hwt 0]], socket 1[core 122[hwt 0]], socket 1[core 123[hwt 0]], socket 1[core 124[hwt 0]], socket 1[core 125[hwt 0]], socket 1[: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 3 bound to socket 1[core 132[hwt 0]], socket 1[core 133[hwt 0]], socket 1[core 134[hwt 0]], socket 1[core 135[hwt 0]], socket 1[core 136[hwt 0]], socket 1[core 137[hwt 0]], socket 1[core 138[hwt 0]], socket 1[core 139[hwt 0]], socket 1[core 140[hwt 0]], socket 1[core 141[hwt 0]], socket 1[core 142[hwt 0]], socket 1[core 143[hwt 0]], socket 1[core 144[hwt 0]], socket 1[core 145[hwt 0]], socket 1[core 146[hwt 0]], socket 1[core 147[hwt 0]], socket 1[core 148[hwt 0]], socket 1[core 149[hwt 0]], socket 1[core 150[hwt 0]], socket 1[core 151[hwt 0]], socket 1[core 152[hwt 0]], socket 1[core 153[hwt 0]], socket 1[core 154[hwt 0]], socket 1[core 155[hwt 0]], socket 1[core 156[hwt 0]], socket 1[core 157[hwt 0]], socket 1[core 158[hwt 0]], socket 1[core 159[hwt 0]], socket 1[core 160[hwt 0]], socket 1[core 161[hwt 0]], socket 1[core 162[hwt 0]], socket 1[core 163[hwt 0]], socket 1[core 164[hwt 0]], socket 1[core 165[hwt 0]], socket 1[core 166[hwt 0]], socket 1[core 167[hwt 0]], socket 1[core 168[hwt 0]], socket 1[core 169[hwt 0]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]], socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]], socket 0[core 16[hwt 0]], socket 0[core 17[hwt 0]], socket 0[core 18[hwt 0]], socket 0[core 19[hwt 0]], socket 0[core 20[hwt 0]], socket 0[core 21[hwt 0]], socket 0[core 22[hwt 0]], socket 0[core 23[hwt 0]], socket 0[core 24[hwt 0]], socket 0[core 25[hwt 0]], socket 0[core 26[hwt 0]], socket 0[core 27[hwt 0]], socket 0[core 28[hwt 0]], socket 0[core 29[hwt 0]], socket 0[core 30[hwt 0]], socket 0[core 31[hwt 0]], socket 0[core 32[hwt 0]], socket 0[core 33[hwt 0]], socket 0[core 34[hwt 0]], socket 0[core 35[hwt 0]], socket 0[core 36[hwt 0]], socket 0[core 37[hwt 0]], socket 0[core 38[hwt 0]], socket 0[core 39[hw: [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 5 bound to socket 0[core 44[hwt 0]], socket 0[core 45[hwt 0]], socket 0[core 46[hwt 0]], socket 0[core 47[hwt 0]], socket 0[core 48[hwt 0]], socket 0[core 49[hwt 0]], socket 0[core 50[hwt 0]], socket 0[core 51[hwt 0]], socket 0[core 52[hwt 0]], socket 0[core 53[hwt 0]], socket 0[core 54[hwt 0]], socket 0[core 55[hwt 0]], socket 0[core 56[hwt 0]], socket 0[core 57[hwt 0]], socket 0[core 58[hwt 0]], socket 0[core 59[hwt 0]], socket 0[core 60[hwt 0]], socket 0[core 61[hwt 0]], socket 0[core 62[hwt 0]], socket 0[core 63[hwt 0]], socket 0[core 64[hwt 0]], socket 0[core 65[hwt 0]], socket 0[core 66[hwt 0]], socket 0[core 67[hwt 0]], socket 0[core 68[hwt 0]], socket 0[core 69[hwt 0]], socket 0[core 70[hwt 0]], socket 0[core 71[hwt 0]], socket 0[core 72[hwt 0]], socket 0[core 73[hwt 0]], socket 0[core 74[hwt 0]], socket 0[core 75[hwt 0]], socket 0[core 76[hwt 0]], socket 0[core 77[hwt 0]], socket 0[core 78[hwt 0]], socket 0[core 79[hwt 0]], socket 0[core 80[hwt 0]], socket 0[core 81[hwt 0]], socket 0[core 82[hwt 0]], socket 0[: [././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 6 bound to socket 1[core 88[hwt 0]], socket 1[core 89[hwt 0]], socket 1[core 90[hwt 0]], socket 1[core 91[hwt 0]], socket 1[core 92[hwt 0]], socket 1[core 93[hwt 0]], socket 1[core 94[hwt 0]], socket 1[core 95[hwt 0]], socket 1[core 96[hwt 0]], socket 1[core 97[hwt 0]], socket 1[core 98[hwt 0]], socket 1[core 99[hwt 0]], socket 1[core 100[hwt 0]], socket 1[core 101[hwt 0]], socket 1[core 102[hwt 0]], socket 1[core 103[hwt 0]], socket 1[core 104[hwt 0]], socket 1[core 105[hwt 0]], socket 1[core 106[hwt 0]], socket 1[core 107[hwt 0]], socket 1[core 108[hwt 0]], socket 1[core 109[hwt 0]], socket 1[core 110[hwt 0]], socket 1[core 111[hwt 0]], socket 1[core 112[hwt 0]], socket 1[core 113[hwt 0]], socket 1[core 114[hwt 0]], socket 1[core 115[hwt 0]], socket 1[core 116[hwt 0]], socket 1[core 117[hwt 0]], socket 1[core 118[hwt 0]], socket 1[core 119[hwt 0]], socket 1[core 120[hwt 0]], socket 1[core 121[hwt 0]], socket 1[core 122[hwt 0]], socket 1[core 123[hwt 0]], socket 1[core 124[hwt 0]], socket 1[core 125[hwt 0]], socket 1[: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/./././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:1835769] MCW rank 7 bound to socket 1[core 132[hwt 0]], socket 1[core 133[hwt 0]], socket 1[core 134[hwt 0]], socket 1[core 135[hwt 0]], socket 1[core 136[hwt 0]], socket 1[core 137[hwt 0]], socket 1[core 138[hwt 0]], socket 1[core 139[hwt 0]], socket 1[core 140[hwt 0]], socket 1[core 141[hwt 0]], socket 1[core 142[hwt 0]], socket 1[core 143[hwt 0]], socket 1[core 144[hwt 0]], socket 1[core 145[hwt 0]], socket 1[core 146[hwt 0]], socket 1[core 147[hwt 0]], socket 1[core 148[hwt 0]], socket 1[core 149[hwt 0]], socket 1[core 150[hwt 0]], socket 1[core 151[hwt 0]], socket 1[core 152[hwt 0]], socket 1[core 153[hwt 0]], socket 1[core 154[hwt 0]], socket 1[core 155[hwt 0]], socket 1[core 156[hwt 0]], socket 1[core 157[hwt 0]], socket 1[core 158[hwt 0]], socket 1[core 159[hwt 0]], socket 1[core 160[hwt 0]], socket 1[core 161[hwt 0]], socket 1[core 162[hwt 0]], socket 1[core 163[hwt 0]], socket 1[core 164[hwt 0]], socket 1[core 165[hwt 0]], socket 1[core 166[hwt 0]], socket 1[core 167[hwt 0]], socket 1[core 168[hwt 0]], socket 1[core 169[hwt 0]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B]
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net
ccnpusc4000001.p.ussc.az.chevron.net

@drmichaeltcvx
Copy link
Author

Here is the XML from lstopo --output-format xml

lstopo-Zen4.txt

@drmichaeltcvx
Copy link
Author

On the "--cpu-set ..." option, can I combine it with any of the rest mapping policies in any way ?

Overall. can you simplify a bit the syntax for getting these two following cases properly supported (with correct handling of OMP threads per rank) when nodes are under-subscribed:

  • Compact the ranks as close as possible to resources (cores, L3caches, numa, sockets) and

  • Spread the ranks as evenly as possible over resources (cores, L3caches, numa, sockets)

IntelMPI has these two env vars to describe how to select core subsets for ranks consisting of > 1 OMP threads and how to arrange the relative placement of ranks among themselves over all available resources:

  • I_MPI_PIN_DOMAIN : how/where to select/bind the treads of each rank, and

  • I_MPI_PIN_ORDER : how to place ranks relative to one another.

@rhc54
Copy link
Contributor

rhc54 commented Mar 17, 2025

Should I assume that the above behavior only applies to OMPI v5.XX and beyond?

It could be true for higher v4.x releases as well - I don't know. It appears you are using NVIDIA's package, which is based on some v4.1.x release (not sure which one). If you stay with that package, then you may need to wait for them to upgrade to v5. You'd have to ask them for potential release dates.

If you want to use OMPI directly, then you can just download/build one of their v5.0.x tarballs: https://www.open-mpi.org/software/ompi/v5.0/

What is the recommendation for applying user-directed "complex" mappings with ompi v4.1.x? As those mentioned on the issue description?

You could try the rank_file mapper: https://docs.prrte.org/en/latest/placement/rankfiles.html

I usually just experiment with the mappings offline until I find something like what I want. You can do that with a cmd line like:
mpirun --prtemca hwloc_use_topo_file foo.xml --runtime-options donotlaunch --display map ...
where "foo.xml" is the topo file (like the one you provided) of the target system. The other options just tell mpirun not to try and launch the result, and to display the mapping/binding of the result. If you want a more detailed map like the one I pasted, then use "map-devel" instead.

Since you seem to know the chip pretty well, you could do something like --map-by ppr:N:l3:pe=X. This tells us to place N procs on each L3, binding each proc to X cpus.

Compact the ranks as close as possible to resources

Not quite sure I understand, but if you want to densely pack the chip, then usually something like --map-by core:pe=N is all you really need. Once you give a specific number of cpu's to assign to each process, we automatically bind you to those cpus, so no binding directive should be given. You can replace "core" with any of the object types. Note that we place a proc on the object, assigning N cpus to it, and then move to the next object to assign the next proc until we hit the end of the node. We then circle around to the first object to assign the next proc to it, if enough cpus remain. We do this until the node is full, and then move to the next node.

Spread the ranks as evenly as possible over resources

Same as above, just add ":span" to the end of the map-by option.

There is an envar equivalent to these, but you probably should play a bit with the cmd line until you determine what you want.

I'll play a little with that topo file and see if I can (a) spot any problems in the mapper and (b) maybe come up with a mapping for you.

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2025

Also, L3cache on AMD Zen4 systems (with 8 cores sharing an L3 cache) is incorrectly treated the same as a memory domain mapping.

I took a look at your topology, and it appears the L3Cache spans the entire package. I gather this doesn't match your expectations.

@bgoglin Any thoughts on this? The topology is attached above - this is the AMD Zen4 chip. This is the OMPI v4.1 series, so it would be HWLOC 2.0.1

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2025

I next tested your given cmd line with the current head of the OMPI v5.0 branch, simulating two nodes with your provided topology:

$ mpirun --prtemca hwloc_use_topo_file ~/topologies/amd-Zen4.xml --runtime-options donotlaunch --display map --map-by numa:span --prtemca ras_simulator_num_nodes 2  -np 16  hostname

========================   JOB MAP   ========================
Data for JOB prterun-Ralphs-iMac-60759@1 offset 0 Total slots allocated 352
    Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN  Ranking policy: SPAN Binding policy: NUMA:IF-SUPPORTED
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: nodeA0	Num slots: 176	Max slots: 176	Num procs: 8
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 0 Bound: package[0][core:0-43]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 1 Bound: package[0][core:44-87]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 2 Bound: package[1][core:88-131]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 3 Bound: package[1][core:132-175]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 8 Bound: package[0][core:0-43]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 9 Bound: package[0][core:44-87]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 10 Bound: package[1][core:88-131]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 11 Bound: package[1][core:132-175]

Data for node: nodeA1	Num slots: 176	Max slots: 176	Num procs: 8
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 4 Bound: package[0][core:0-43]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 5 Bound: package[0][core:44-87]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 6 Bound: package[1][core:88-131]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 7 Bound: package[1][core:132-175]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 12 Bound: package[0][core:0-43]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 13 Bound: package[0][core:44-87]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 14 Bound: package[1][core:88-131]
        Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 15 Bound: package[1][core:132-175]

As you can see, it is correctly performing the span placement strategy. The procs are bound to NUMA because you specified --map-by numa, and we automatically bind to the mapped object by default.

I next added your desired binding of each proc to 2 cpus:

$ mpirun --prtemca hwloc_use_topo_file ~/topologies/amd-Zen4.xml --runtime-options donotlaunch --display map --map-by numa:pe=2:span --prtemca ras_simulator_num_nodes 2  -np 16  hostname

========================   JOB MAP   ========================
Data for JOB prterun-Ralphs-iMac-60839@1 offset 0 Total slots allocated 352
    Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN  Ranking policy: SPAN Binding policy: CORE:IF-SUPPORTED
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: 2  Cpu Type: CORE


Data for node: nodeA0	Num slots: 176	Max slots: 176	Num procs: 8
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 1 Bound: package[0][core:44-45]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 2 Bound: package[1][core:88-89]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 3 Bound: package[1][core:132-133]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 8 Bound: package[0][core:2-3]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 9 Bound: package[0][core:46-47]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 10 Bound: package[1][core:90-91]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 11 Bound: package[1][core:134-135]

Data for node: nodeA1	Num slots: 176	Max slots: 176	Num procs: 8
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 4 Bound: package[0][core:0-1]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 5 Bound: package[0][core:44-45]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 6 Bound: package[1][core:88-89]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 7 Bound: package[1][core:132-133]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 12 Bound: package[0][core:2-3]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 13 Bound: package[0][core:46-47]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 14 Bound: package[1][core:90-91]
        Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 15 Bound: package[1][core:134-135]

Note that it continues to correctly compute the span, but now binds each proc to only 2 cpus. Note also that the procs are not placed on top of each other, but instead are on adjacent pairs of cores.

So I think the mapper is working correctly, at least at the head of the v5 series. Not sure how far down from that you might be able to go, but you could try some intermediate versions (between where you are and head of v5) if you prefer.

I'd also suggest playing with the various mapping options. Note that you can independently control the mapping, the ranking, and the binding of the procs. The way it works is that we first compute the location of each proc, then we go back to rank them. So you could, for example, change this last mapping to make all the procs on the first node sequentially ranked, followed by all the procs on the second node, by simply adding --rank-by fill to the cmd line:

$ mpirun --prtemca hwloc_use_topo_file ~/topologies/amd-Zen4.xml --runtime-options donotlaunch --display map --map-by numa:pe=2:span --rank-by fill  --prtemca ras_simulator_num_nodes 2  -np 16  hostname

========================   JOB MAP   ========================
Data for JOB prterun-Ralphs-iMac-60915@1 offset 0 Total slots allocated 352
    Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN  Ranking policy: FILL Binding policy: CORE:IF-SUPPORTED
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: 2  Cpu Type: CORE


Data for node: nodeA0	Num slots: 176	Max slots: 176	Num procs: 8
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 2 Bound: package[0][core:44-45]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 3 Bound: package[0][core:46-47]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 4 Bound: package[1][core:88-89]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 5 Bound: package[1][core:90-91]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 6 Bound: package[1][core:132-133]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 7 Bound: package[1][core:134-135]

Data for node: nodeA1	Num slots: 176	Max slots: 176	Num procs: 8
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 8 Bound: package[0][core:0-1]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 9 Bound: package[0][core:2-3]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 10 Bound: package[0][core:44-45]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 11 Bound: package[0][core:46-47]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 12 Bound: package[1][core:88-89]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 13 Bound: package[1][core:90-91]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 14 Bound: package[1][core:132-133]
        Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 15 Bound: package[1][core:134-135]

Hope all that proves helpful

@bgoglin
Copy link
Contributor

bgoglin commented Mar 18, 2025

We have 2 NUMAs per package here, and one L3 per NUMA. The former is perfectly normal on AMD since they have NPS1/2/4 configs in the BIOS. The latter is more surprising since it's usually L3 per 8-core CCX/CCD. However, this CPU a "X" model where they have a huge additional L3 on die, I am not sure how that one is shared. 192MB of L3 per package looks too small to me for such a X model. It seems it should be 1152MB although I can't find good AMD doc for this CPU.

So on the OMPI side, I think things are correct. hwloc is defintely way too old, but Zen3/4 support only required some CPUID changes, which is likely used here on Linux. The kernel could be too old too, but I'd hope RHEL8 has good support for Zen3.

If you want to debug this further (I'd like to clarify this too), build a recent hwloc, run "hwloc-gather-topology zen4" and send the generated tarball zen4.tar.b2 either to me or in a new hwloc issue at https://github.com/open-mpi/hwloc/issues/new/

@drmichaeltcvx
Copy link
Author

Using --cpu-set cpu number list --bind-to core binds to all cores on the node and not to the specified CPUs in the list

This may just be an older version - here's what I get when testing with a complex topology on my system:

$ mpirun --prtemca hwloc_use_topo_file xxx.xml --runtime-options donotlaunch --display map-devel -n 4 --cpu-set 1,2,3,4 --bind-to core hostname

================================= JOB MAP =================================
Data for JOB prterun-Ralphs-iMac-41243@1 offset 0 Total slots allocated 48
Mapper requested: NULL Last mapper: round_robin Mapping policy: PE-LIST:NOOVERSUBSCRIBE Ranking policy: SLOT
Binding policy: CORE Cpu set: 1,2,3,4 PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1

Data for node: Ralphs-iMac State: 3 Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:MAPPED:SLOTS_GIVEN
resolved from Ralphs-iMac.local
resolved from Ralphs-iMac
Daemon: [prterun-Ralphs-iMac-41243@0,0] Daemon launched: True
Num slots: 48 Slots in use: 4 Oversubscribed: FALSE
Num slots allocated: 48 Max slots: 0 Num procs: 4
Data for proc: [prterun-Ralphs-iMac-41243@1,0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4]
Data for proc: [prterun-Ralphs-iMac-41243@1,1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4]
Data for proc: [prterun-Ralphs-iMac-41243@1,2]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4]
Data for proc: [prterun-Ralphs-iMac-41243@1,3]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4]
The above was with the head of the PRRTE v3.0 branch, which is in the upcoming OMPI v5.0.x release. Looks like it all worked as expected.

I am using OMPI off HPC_X which is currently stuck at OMPI v4.1. So it works as expected on v5.X? Good news.

What do you think about the suggestion to get these two basic modes (compact or spread over h/w resources) become explicitly specifiable as CLI options?

@drmichaeltcvx
Copy link
Author

drmichaeltcvx commented Mar 18, 2025

Should I assume that the above behavior only applies to OMPI v5.XX and beyond?

It could be true for higher v4.x releases as well - I don't know. It appears you are using NVIDIA's package, which is based on some v4.1.x release (not sure which one). If you stay with that package, then you may need to wait for them to upgrade to v5. You'd have to ask them for potential release dates.

If you want to use OMPI directly, then you can just download/build one of their v5.0.x tarballs: https://www.open-mpi.org/software/ompi/v5.0/

What is the recommendation for applying user-directed "complex" mappings with ompi v4.1.x? As those mentioned on the issue description?

You could try the rank_file mapper: https://docs.prrte.org/en/latest/placement/rankfiles.html

I usually just experiment with the mappings offline until I find something like what I want. You can do that with a cmd line like: mpirun --prtemca hwloc_use_topo_file foo.xml --runtime-options donotlaunch --display map ... where "foo.xml" is the topo file (like the one you provided) of the target system. The other options just tell mpirun not to try and launch the result, and to display the mapping/binding of the result. If you want a more detailed map like the one I pasted, then use "map-devel" instead.

Since you seem to know the chip pretty well, you could do something like --map-by ppr:N:l3:pe=X. This tells us to place N procs on each L3, binding each proc to X cpus.

Compact the ranks as close as possible to resources

Not quite sure I understand, but if you want to densely pack the chip, then usually something like --map-by core:pe=N is all you really need. Once you give a specific number of cpu's to assign to each process, we automatically bind you to those cpus, so no binding directive should be given. You can replace "core" with any of the object types. Note that we place a proc on the object, assigning N cpus to it, and then move to the next object to assign the next proc until we hit the end of the node. We then circle around to the first object to assign the next proc to it, if enough cpus remain. We do this until the node is full, and then move to the next node.

Spread the ranks as evenly as possible over resources

Same as above, just add ":span" to the end of the map-by option.

There is an envar equivalent to these, but you probably should play a bit with the cmd line until you determine what you want.

I'll play a little with that topo file and see if I can (a) spot any problems in the mapper and (b) maybe come up with a mapping for you.

Thanks, I was also thinking after the fact about --map-by core to densely populate cores with ranks.

So under ompi v4.1, assuming 2 nodes, with 4 numa domains each and 44 cores per numa domain (176 cores / node), to evenly spread 88 ranks across 2 nodes, and then evenly spread ranks onto resources on each node (e.g., L3, numa, sockets, and provided that :PE=n also allocates n resource units / rank), I need to specify something like :

--np 88 --map-by ppr:11:numa:span

We need to ensure that ":PE=k" is now working correctly. Will that get back-ported to ompi v4.1?

@drmichaeltcvx
Copy link
Author

We have 2 NUMAs per package here, and one L3 per NUMA. The former is perfectly normal on AMD since they have NPS1/2/4 configs in the BIOS. The latter is more surprising since it's usually L3 per 8-core CCX/CCD. However, this CPU a "X" model where they have a huge additional L3 on die, I am not sure how that one is shared. 192MB of L3 per package looks too small to me for such a X model. It seems it should be 1152MB although I can't find good AMD doc for this CPU.

So on the OMPI side, I think things are correct. hwloc is defintely way too old, but Zen3/4 support only required some CPUID changes, which is likely used here on Linux. The kernel could be too old too, but I'd hope RHEL8 has good support for Zen3.

If you want to debug this further (I'd like to clarify this too), build a recent hwloc, run "

zen4.output.txt
zen4.tar.bz2.txt
zen4.xml.txt

" and send the generated tarball zen4.tar.b2 either to me or in a new hwloc issue at https://github.com/open-mpi/hwloc/issues/new/

I just added suffix .txt to all files to allow github to pick them up.

Can you please update hwloc to correctly account for the 8 cores / CCD/CCX/L3 segment? Indeed, these are the 3D-Vcache with 96 MiBs / CCD.

@drmichaeltcvx
Copy link
Author

drmichaeltcvx commented Mar 18, 2025

@bgoglin On Zen4s, 3D V-cache supports 1152 MiBs L3 per socket,

https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series.html

@bgoglin
Copy link
Contributor

bgoglin commented Mar 18, 2025

Both Linux and x86 CPUID backends report incorrect information. Either both hwloc and Linux are wrong when parsing CPUID information (they'd miss something specific to Zen4 X models), or the CPU just reports incorrect information. I tend for the latter since the L3 hierarchy seems wrong here (all CCD are merged into a single one). I am going to ask some AMD people.

@drmichaeltcvx
Copy link
Author

These are units that the Azure cloud serves out. Do you guys have contacts with Microsoft for Azure?

@bgoglin
Copy link
Contributor

bgoglin commented Mar 18, 2025

Oh wait, are these machines bare metal hardware, or virtualized by Microsoft?
I don't have any useful contact at Microsoft, especially for hwloc.

@rhc54
Copy link
Contributor

rhc54 commented Mar 18, 2025

What do you think about the suggestion to get these two basic modes (compact or spread over h/w resources) become explicitly specifiable as CLI options?

Not sure I understand - they already are specifiable as CLI options. Are you asking about a shorthand name for those modes? If so, I don't know how to effectively do that - what you might consider a reasonable "compact" placement might not match someone else's definition, so we'd wind up with a plethora of names for a range of different results. It's why we just give you the knobs and try to provide some (hopefully) reasonable defaults that work for most users.

So under ompi v4.1, assuming 2 nodes, with 4 numa domains each and 44 cores per numa domain (176 cores / node), to evenly spread 88 ranks across 2 nodes, and then evenly spread ranks onto resources on each node (e.g., L3, numa, sockets, and provided that :PE=n also allocates n resource units / rank), I need to specify something like :

--np 88 --map-by ppr:11:numa:span

Yes, at least in OMPI v5 I verified that it would produce what I think you are after, though you don't really need the "span" part unless you aren't filling the nodes. Just depends on how you want the results ranked (see my above examples).

We need to ensure that ":PE=k" is now working correctly. Will that get back-ported to ompi v4.1?

If there is a bug in v4.1, I very much doubt it would be fixed - you'd probably need to upgrade. I double-checked and your proposed cmd line with the ":pe=2" worked fine with v5. You might try it without the "span" on v4 and see if that works.

These are units that the Azure cloud serves out

Oh my - yeah, you'd probably need to talk to them about it, and I'm afraid I have no contacts there.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 18, 2025

This page in French https://learn.microsoft.com/fr-fr/azure/virtual-machines/hx-series-overview shows the same issue with 176 core instances. One reason might be that with 176 cores and 2 NUMA per socket, you get 44 cores per NUMA, which cannot be divided in 6 CCX per NUMA. Other instances (144rs, 96rs, 48rs, 24rs) show a correct lstopo output with 12x 96MB L3 per socket. Those numbers of cores can be divided in 6 CCX per NUMA. So my guess would be that the Azure hypervisor has no way to expose a sane L3 topology on 176-core instances and thus breaks things my reporting a single L3 per NUMA (with the size of a single real L3 instead of their aggregated sizes).

@drmichaeltcvx
Copy link
Author

@bgoglin Azure cloud sets aside 8+8 cores for the "Hypervisor" layer but the L3 caches are still available. There are actually 96 cores / chip

https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/high-performance-compute/hbv4-series?tabs=sizebasic

@bgoglin
Copy link
Contributor

bgoglin commented Mar 18, 2025

If you can try a smaller instance where the number of cores per CPU is a multiple of 12, I wouldn't be surprised if things works fine there. 88 cannot be divided by 12, I'd bet it's the reason for all this L3 issue.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 19, 2025

Yes, by default on Linux, we read loooots of files /sys/devices/system/{cpu,node}. There's also a way to tell hwloc to directly ask the CPU for topology information (CPUID instruction, what the Linux kernel does for populate /sys files) but that's not used by default. Both report wrong info, that's why I think the hypervisor is exposing incorrect topology information from the virtual CPUs.

@drmichaeltcvx
Copy link
Author

drmichaeltcvx commented Mar 19, 2025 via email

@drmichaeltcvx
Copy link
Author

The question was rather to check lstopo on smaller instances to see if L3 is correctly exposed by Azure there. However HBv3 is Milan-X, and they seem to hide 4 cores out of 64 per socket there too. So the issue will likely be the same (60 is not dividable by 8 hence the Azure hypervisor would likely break the L3 there too).

Well, they are supposed to expose the actual h/w correctly. But then hwloc should be able to identify work around these annoying pitfalls. hwloc developers should get in touch with cloud providers to ensure that the virtualization does not end up exposing something incorrectly. Does hwloc currently make any assumptions outside what is exposed by sysfs and procfs ?

@rhc54
Copy link
Contributor

rhc54 commented Mar 19, 2025

Hmmm...if the hypervisor is saying "this is the architecture", I'm not sure it is HWLOC's responsibility to run to the cloud provider and correct them. Ditto for the chip vendor. That sounds more like something the users should insist upon. You are the ones paying for the service - you are the ones who therefore have some clout (however much) to request changes/corrections.

I would suggest filing an Azure ticket pointing out that the hypervisor is exposing an architecture that isn't what you expect/need, and ask for an explanation and/or correction. My guess is that they will tell you why they do it, and that there really isn't a way to change it.

If you really want/need to precisely utilize a chip, cloud is probably not the right place - bare metal is almost certainly going to be a better option.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 20, 2025

Also, if topology info in wrong in Linux /sys because the hypervisor reports wrong CPUID info, why would it be hwloc's responsability to fix what it gets from /sys and not Linux to fix what it gets from CPUID and shows in /sys? There are too many bugs in hardware, BIOS, operating systems and now hypervisors unfortunately. We only work around some hardware bugs that are easy to identify (things like https://github.com/open-mpi/hwloc/wiki/Linux-kernel-bugs) whose actual fix will take too much time to reach end users. Here we don't even have a clear explanation of where the problem comes from, we cannot test on other kind of instances to clearly set the scope of the fix. Once we get some feedback from AMD and/or Azure, it'll be much easier to fix and deploy for them than having hwloc users wait for a new hwloc release being installed everywhere.

@drmichaeltcvx
Copy link
Author

@rhc54 Sure, on our part as cloud users we should request to fix their virtualization. Before that time we are putting effort to work around these wrong views. Note that both Open and Intel MPI are affected.

I am saying that since these architectures are public and "well-known" it would be quite nice on your part if hwloc (or other similar tools) would have a mode (requested by the user at the command line) where they provide they correct view in known situations where the published architecture conflicts with the one presented by the virtualization.

It seems that so far only the L3cache is misrepresented by the Hypervisor. You don't have to do fix their view of course, and I understand that you cannot provide for all known cases of the system providing wrong information.

Our HPC is the Azure cloud and we are fixing and dealing with these and similar issues. True bare metal is the way to go for HPC.

@rhc54
Copy link
Contributor

rhc54 commented Mar 20, 2025

Hmmm...I hear you, but I'm not sure that is a feasible request. For example, I'm pretty sure you are working with an HWLOC that was released well before the Zen4 chip you are working with came out. Given the wide variety of software update strategies out there, I'm not sure there is a practical, effective way to provide that service. 🤷‍♂

That said, HWLOC does maintain a public site containing the topologies for a wide range of chips: https://gitlab.inria.fr/hwloc/xmls/-/tree/master/xml. Not everything is there, but it might be of some help. I think it used to be available on a prettier web interface, but at least it's still there.

@drmichaeltcvx
Copy link
Author

@bgoglin As I mentioned to Ralph, we are already asking the cloud provider to fix their virtualization, and it is way too messy for hwloc to present the correct the view in the different defective cases.

Unfortunately with the proliferation of these large count processor chips that are not represented properly by virtualization s/w, more and more users will have to be manually providing their own ad-hoc workarounds.

@drmichaeltcvx
Copy link
Author

drmichaeltcvx commented Mar 20, 2025

@rhc54 Is it too difficult to provide a topology file for our case? I mean for the AMD Zen3 and Zen4 presenting the correct L3 view?

@drmichaeltcvx
Copy link
Author

I will keep you updated on our discussions with Azure about these issues.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 20, 2025

I was actually looking at generating a topology file. I can easy generate the XML for the entire physical machine (by modifying the "synthetic description" of your current XML).

$ lstopo -i azure.xml -.synthetic
Package:2 L3Cache:2(size=100663296) [NUMANode(memory=202856591360)] L2Cache:44(size=1048576) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1 PU:1
$ lstopo -i "Package:2 Group:2 [NUMANode(memory=202856591360)] L3Cache:6(size=100663296) L2Cache:8(size=1048576) L1dCache:1(size=32768) L1iCache:1(size=32768) Core:1 PU:1" azure-full.xml

Removing cores from this is easy BUT it creates a hole in the CPU numbering, that wouldn't match Azure's numbering. So I need to tweak the above line to force hwloc to put cores 0-87 is first socket and 99-175 in second socket, and the 16 hidden cores after 176.

However I need to know which cores to remove among the 96. Are they all below a single L3 per socket? Are they under different L3?

Also this removes IO locality, but I tend to think it doesn't matter in your case.

@drmichaeltcvx
Copy link
Author

@bgoglin Thanks for this! We have some bare metal machines with AMD Zen4s. I will extract the topology files from there and share here. I am also quite curious to see what the machine looks like (physical/logical core numbering, L3, etc) w/o Azure's "hypervisor"

I'll share topology files for virtualized and non-virtualized h/w.,

@bgoglin
Copy link
Contributor

bgoglin commented Mar 20, 2025

If you have baremetal access to the same platform (Epyc 9V33X), that's interesting indeed. Otherwise I already have some dual-socket 96-core Epyc 9654 Genoa (Zen4) at https://gitlab.inria.fr/hwloc/xmls/-/blob/master/xml/AMD-Epyc-Zen4-2pa4nu3ca8co+1hsn.xml?ref_type=heads
Unfortunately that machine was configured with 4 NUMA per socket.

Right now I am a bit confused by AMD numbering. Zen4 CPUs usually end with digit 4, but 9V33X seems to be Genoa as well.

@drmichaeltcvx
Copy link
Author

drmichaeltcvx commented Mar 20, 2025

@bgoglin I saw both Zen3 and the Zen4 above where each L3Cache is correctly reported as being associated to an 8-core block.

Yes, the 9V33X name is "weird" since it is an AMD EPYC Zen4. I believe these are specialized versions meant for cloud environments.

They likely changed their numbering to let the 1st digit indicate the micro-architecture. In https://www.amd.com/en/products/processors/server/epyc/4th-generation-architecture.html
they are using ``AMD EPYC 9004, 8004 ‘GENOA’, ‘SIENA’'' for Zen4.

@drmichaeltcvx
Copy link
Author

This page in French https://learn.microsoft.com/fr-fr/azure/virtual-machines/hx-series-overview shows the same issue with 176 core instances. One reason might be that with 176 cores and 2 NUMA per socket, you get 44 cores per NUMA, which cannot be divided in 6 CCX per NUMA. Other instances (144rs, 96rs, 48rs, 24rs) show a correct lstopo output with 12x 96MB L3 per socket. Those numbers of cores can be divided in 6 CCX per NUMA. So my guess would be that the Azure hypervisor has no way to expose a sane L3 topology on 176-core instances and thus breaks things my reporting a single L3 per NUMA (with the size of a single real L3 instead of their aggregated sizes).

That is a reasonable speculation. They should be reporting the CCDs with the restricted cores as 6-core CCDs though.

AMD EPYC processors can have 6-core CCDs in both Zen 3 and Zen 4 architectures. For example, the AMD EPYC 7443P (Zen3), has 24 cores distributed across 4 CCDs, with each CCD containing 6 cores. Similarly, the Zen 4 architecture also supports configurations with 6-core CCDs.

@drmichaeltcvx
Copy link
Author

BTW, thank you for providing this capability. I started looking into h/w topologies and optimization back in the late 90's on Unix. I understand how tedious it is, but it is a quite useful capability.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 21, 2025

I got some news from AMD explaining all the reasoning behind this "wrong" topology. I don't know yet if I can share everything here, but they are working at disseminating it.

@drmichaeltcvx
Copy link
Author

drmichaeltcvx commented Mar 25, 2025

I got some news from AMD explaining all the reasoning behind this "wrong" topology. I don't know yet if I can share everything here, but they are working at disseminating it.

Thanks. I was notified that this issue has already reached both AMD and Azure engineering, We will touch bases with AMD internally.

To me the big question is what the "proper" view of the underlying h/w (virtualized or not) should be. For instance, since Azure removes the first 2 cores out of each memory domain, should the physical numbering of the cores presented by hwloc correspond to the actual physical cores or we just pretend that the reserved cores are not there. My ask is that if a CCD has 6 cores we represent it as such, but in the Azure case we have a CCD with 6 active cores whereas in the rest we have 8-core CCDs (for the HBv3 with 120 visible and HBv4 with 176 visible cores).

I support the idea that hwloc exposes the actual h/w structure so if someone cares to do their own selection of cores they know they have the view of the actual h/w structure.

Of course the L3 caches need to be correctly associated with their physical CCDs.

Azure engineering is actually open to suggestions for reasonable restorations of the h/w structure to better reflect the h/w structure.

@drmichaeltcvx
Copy link
Author

Related question is if we run under a Linux CGroup, should the physical core numbers shown by hwloc be the actual physical core numbers? I think they should.

@rhc54
Copy link
Contributor

rhc54 commented Mar 25, 2025

Not sure what hwloc exposes today, but keep in mind that OMPI only works with logical cpu numbers.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 25, 2025

hwloc cannot change the physical IDs given by Linux/Azure because they are required when asking the OS to bind tasks to specific cores. For cgroups on Linux, the physical IDs are unchanged for the same reason (the binding mask you give to Linux is not relative to the cgroup, although that's an option for memory binding).

As long as they expose the corresponding topology info, physical IDs don't matter much. There are used internally for binding, shown in lstopo in case experts really want to see them, but they often have strange ordering, can be non-consecutive, etc (that's why hwloc recommends the use of its logical IDs instead).

There were some discussions in the past where the hypervisor (Xen at that time) would tell hwloc "oh btw Linux core #4 is actually physical core #5" but they never actually released their part of the code.

If Azure is actually open to changing things, they could optionally expose the real L3 topology, which means you would have different numbers of cores per L3. It would make scheduling/placement decisions harder, but that's something the user is accepting to do if he clicks on the optional "show me the real topology even if it's assymmetric".

@drmichaeltcvx
Copy link
Author

Not sure what hwloc exposes today, but keep in mind that OMPI only works with logical cpu numbers.

That is a much larger discussion we may need to start. Ultimately, MPI should assist with proper rank to (physical) resource placement to aid user with optimizing run-time performance. Going back to my initial suggestion of having some syntactical shortcuts for requesting "compact" or "balanced" rank to resource placement, if MPI does not "see" the physical resource hierarchy, these rank placements are more or less meaningless.

This issue is exacerbated in high core count nodes where we want to populate the cores in a fashion that leverages maximally memory controllers and / or L3 cache capacities.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 25, 2025

hwloc's logical numbering is built using the physical organization. OMPI uses logical IDs as Ralph said, and hwloc's "tree" of resources, to implement placement. I don't know if there's a shortcut for compact or balanced but all these things are possible with existing mapping/binding options. This topic has actually been the main reason why hwloc was created more than 15 years ago (we had the same need for thread placement and MPI wanted it for process placement). There's lot of research on finding the best placement for different apps.

Physical IDs are a different thing. Two neighbor cores can be numbered with physical IDs 24 and 47, or 2 and 3, it doesn't matter as long as you know that those two cores are neighbors and share this cache, NUMA, or whatever.

@drmichaeltcvx
Copy link
Author

hwloc cannot change the physical IDs given by Linux/Azure because they are required when asking the OS to bind tasks to specific cores. For cgroups on Linux, the physical IDs are unchanged for the same reason (the binding mask you give to Linux is not relative to the cgroup, although that's an option for memory binding).

As long as they expose the corresponding topology info, physical IDs don't matter much. There are used internally for binding, shown in lstopo in case experts really want to see them, but they often have strange ordering, can be non-consecutive, etc (that's why hwloc recommends the use of its logical IDs instead).

There were some discussions in the past where the hypervisor (Xen at that time) would tell hwloc "oh btw Linux core #4 is actually physical core #5" but they never actually released their part of the code.

If Azure is actually open to changing things, they could optionally expose the real L3 topology, which means you would have different numbers of cores per L3. It would make scheduling/placement decisions harder, but that's something the user is accepting to do if he clicks on the optional "show me the real topology even if it's assymmetric".

Well, in our HPC environment before launching the ranks, we inspect the underlying h/w and try to place the ranks (usually these are MPI+OMP or MPI + OMP + GPUs) at locations that can use maximally the DRAM BW/L3 capacities. So far Azure h/w is configured to have the same number of cores / mem domain and I believe that it exposes them consistently. Let's take the example of 44 or 30 active cores / mem domain. Often it is beneficial to have > 1 rank per mem domain but currently "--map-by numa" allocates ranks to initial sets of physical cores thus leaving L3 caches at the end of the domain unused. That's just an example of having an easy way to ask MPI to "please spread ranks evenly over resources".

I think in general then MPI should have the correct view of the physical h/w so that the user or MPI itself can support balanced use of under-subscribed h/w resources.

@drmichaeltcvx
Copy link
Author

hwloc's logical numbering is built using the physical organization. OMPI uses logical IDs as Ralph said, and hwloc's "tree" of resources, to implement placement. I don't know if there's a shortcut for compact or balanced but all these things are possible with existing mapping/binding options. This topic has actually been the main reason why hwloc was created more than 15 years ago (we had the same need for thread placement and MPI wanted it for process placement). There's lot of research on finding the best placement for different apps.

Physical IDs are a different thing. Two neighbor cores can be numbered with physical IDs 24 and 47, or 2 and 3, it doesn't matter as long as you know that those two cores are neighbors and share this cache, NUMA, or whatever.

As long as the logical numbering is based in a reasonable physical location ordering (yes physical core numbers will be misleading if physical core numbers do not correspond to physical proximity) this is in most cases acceptable. In essence we are interested in exposing the affinity of cores to L3 caches and memory controllers. All placement decisions I believe are with resect to these affinities. There could maybe then be mapping (placement) options that make decisions wrt to these affinities and the number of ranks and cores / rank.

In my codes I end up needing to say "use N ranks, having n OMP threads, and map each rank on k physically contiguous cores, k >= n)" and place ranks as "compactly", "spread", or "balanced", as possible wrt to these resource affinities.

I think that the ability to specify these requirements should remove the logical or physical core numbers from the discussion.

Any feedback on this suggestion?

@rhc54
Copy link
Contributor

rhc54 commented Mar 25, 2025

I think you are conflating several topics here and it is getting out-of-hand. We standardized on logical cpus, which doesn't mean you cannot place ranks anywhere - you just specify it in terms of logical cpus if you want to tell us specific ones to use. The mapper doesn't care - when we map-by numa (or whatever) we look at the physical layout and map accordingly. When asked to bind to multiple cores, we bind to adjacent physical cores - but that is done under-the-covers. You can see that in the map output when asked to display it.

So it is only the user interface side that is locked to logical, and that is because it gets way too complicated to keep dealing with "did they say logical or physical, what if the physical core they specify isn't available, etc". Much easier for the user to just stick with logical and move on.

I'm fairly opposed to your notion of putting the mapping in terms of some named tag. You have an idea of what you mean by the tag, but I guarantee others would disagree with that interpretation. So it becomes a game of trying to match expectations with names, and that just becomes a nightmare to manage. We give you the controls - you can easily specify your interpretation of "spread" or whatever, and it doesn't impact anyone else. I fail to see the need to hardcode your particular layout to a name in the runtime.

@bgoglin
Copy link
Contributor

bgoglin commented Mar 25, 2025

I think in general then MPI should have the correct view of the physical h/w so that the user or MPI itself can support balanced use of under-subscribed h/w resources.

MPI implems have a correct view of the physical h/w. Except when Azure hides it in this specific case.

Well, in our HPC environment before launching the ranks, we inspect the underlying h/w and try to place the ranks (usually these are MPI+OMP or MPI + OMP + GPUs) at locations that can use maximally the DRAM BW/L3 capacities. So far Azure h/w is configured to have the same number of cores / mem domain and I believe that it exposes them consistently. Let's take the example of 44 or 30 active cores / mem domain. Often it is beneficial to have > 1 rank per mem domain but currently "--map-by numa" allocates ranks to initial sets of physical cores thus leaving L3 caches at the end of the domain unused. That's just an example of having an easy way to ask MPI to "please spread ranks evenly over resources".

The issue is that there looooots of cases. It's hard to make a shortcut when different people will want different shortcuts. In your case, you want to map by NUMA and scatter internally. Some people will want to map by NUMA but compact internally (e.g. to maximize cache sharing). I am not going to comment further on this because I don't know enough about what OMPI can do.

As long as the logical numbering is based in a reasonable physical location ordering (yes physical core numbers will be misleading if physical core numbers do not correspond to physical proximity) this is in most cases acceptable.

Physical core numbers have a long history of NOT corresponding to physical proximity. This was pretty much NEVER the case in the past because HW vendors assumed most users wanted BW and the OS was bad at doing a scatter. Things have changed since then, we see quite a lot of cases where the numbering is obvious (except for SMT when non-first threads are often moved to the end of the numbering), but you should not assume anything like this, or your code will break on some uncommon platforms.

@drmichaeltcvx
Copy link
Author

drmichaeltcvx commented Mar 25, 2025

hwloc is a high-quality library. I am not claiming hwloc needs to support all kinds of exceptions or weird scenarios. Vendors should make the affinities associations readily available.

In my codes I end up needing to say "use N ranks, having n OMP threads, and map each rank on k physically contiguous cores, k >= n)" and place ranks as "compactly", "spread", or "balanced", as possible wrt to these resource affinities.

hwloc and OpenMPI have contributed a great deal towards that end. However, the above is highly desired when we optimize for run-time performance. Making physical placement decisions ignoring physical affinities information makes optimization efforts unnecessarily difficult.

@rhc54
Copy link
Contributor

rhc54 commented Mar 25, 2025

I think we can safely table this discussion at this point. We are not going to create "named" placement strategies for the reasons we have given several times. Nobody is ignoring physical information - we just have the user tell us things in logical space. I'm glad Azure is cooperating. I think we've beaten this issue into the ground, so I (at least) shall now move on.

@drmichaeltcvx
Copy link
Author

hwloc's logical numbering is built using the physical organization. OMPI uses logical IDs as Ralph said, and hwloc's "tree" of resources, to implement placement. I don't know if there's a shortcut for compact or balanced but all these things are possible with existing mapping/binding options.

Sure, but I had to go through the current discussion to know that some of the mapping/binding defects were addressed in OpenMPI v5.x

This topic has actually been the main reason why hwloc was created more than 15 years ago (we had the same need for thread placement and MPI wanted it for process placement). There's lot of research on finding the best placement for different apps.

Physical IDs are a different thing. Two neighbor cores can be numbered with physical IDs 24 and 47, or 2 and 3, it doesn't matter as long as you know that those two cores are neighbors and share this cache, NUMA, or whatever.

I totally agree. Physical numbering does not necessarily expose affinity/proximity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants