-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping ranks evenly across nodes and across resources within nodes with MPI+OMP codes (--map-by Res:PE=n:span is broken) #13143
Comments
Using |
Yeah, that's not a new feature, just a bug. Could you please tell us what version of OMPI you are using? |
This may just be an older version - here's what I get when testing with a complex topology on my system: $ mpirun --prtemca hwloc_use_topo_file xxx.xml --runtime-options donotlaunch --display map-devel -n 4 --cpu-set 1,2,3,4 --bind-to core hostname
================================= JOB MAP =================================
Data for JOB prterun-Ralphs-iMac-41243@1 offset 0 Total slots allocated 48
Mapper requested: NULL Last mapper: round_robin Mapping policy: PE-LIST:NOOVERSUBSCRIBE Ranking policy: SLOT
Binding policy: CORE Cpu set: 1,2,3,4 PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1
Data for node: Ralphs-iMac State: 3 Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:MAPPED:SLOTS_GIVEN
resolved from Ralphs-iMac.local
resolved from Ralphs-iMac
Daemon: [prterun-Ralphs-iMac-41243@0,0] Daemon launched: True
Num slots: 48 Slots in use: 4 Oversubscribed: FALSE
Num slots allocated: 48 Max slots: 0 Num procs: 4
Data for proc: [prterun-Ralphs-iMac-41243@1,0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4]
Data for proc: [prterun-Ralphs-iMac-41243@1,1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4]
Data for proc: [prterun-Ralphs-iMac-41243@1,2]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4]
Data for proc: [prterun-Ralphs-iMac-41243@1,3]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
State: INITIALIZED App_context: 0
Binding: package[0][core:1-4] The above was with the head of the PRRTE v3.0 branch, which is in the upcoming OMPI v5.0.x release. Looks like it all worked as expected. |
Likewise, the original concern with the "pe=2:span" modifiers seems to work as well for two nodes: $ mpirun --prtemca hwloc_use_topo_file xxx.xml --runtime-options donotlaunch --display map-devel --map-by numa:PE=2:span -np 16 hostname
================================= JOB MAP =================================
Data for JOB prterun-Ralphs-iMac-41412@1 offset 0 Total slots allocated 96
Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN Ranking policy: SPAN
Binding policy: CORE:IF-SUPPORTED Cpu set: N/A PPR: N/A Cpus-per-rank: 2 Cpu Type: CORE
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 2
Data for node: nodeA0 State: 3 Flags: MAPPED:SLOTS_GIVEN
Daemon: [prterun-Ralphs-iMac-41412@0,1] Daemon launched: False
Num slots: 48 Slots in use: 8 Oversubscribed: FALSE
Num slots allocated: 48 Max slots: 48 Num procs: 8
Data for proc: [prterun-Ralphs-iMac-41412@1,0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED App_context: 0
Binding: package[0][core:0-1]
Data for proc: [prterun-Ralphs-iMac-41412@1,1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED App_context: 0
Binding: package[1][core:24-25]
Data for proc: [prterun-Ralphs-iMac-41412@1,4]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 4
State: INITIALIZED App_context: 0
Binding: package[0][core:2-3]
Data for proc: [prterun-Ralphs-iMac-41412@1,5]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 5
State: INITIALIZED App_context: 0
Binding: package[1][core:26-27]
Data for proc: [prterun-Ralphs-iMac-41412@1,8]
Pid: 0 Local rank: 4 Node rank: 4 App rank: 8
State: INITIALIZED App_context: 0
Binding: package[0][core:4-5]
Data for proc: [prterun-Ralphs-iMac-41412@1,9]
Pid: 0 Local rank: 5 Node rank: 5 App rank: 9
State: INITIALIZED App_context: 0
Binding: package[1][core:28-29]
Data for proc: [prterun-Ralphs-iMac-41412@1,12]
Pid: 0 Local rank: 6 Node rank: 6 App rank: 12
State: INITIALIZED App_context: 0
Binding: package[0][core:6-7]
Data for proc: [prterun-Ralphs-iMac-41412@1,13]
Pid: 0 Local rank: 7 Node rank: 7 App rank: 13
State: INITIALIZED App_context: 0
Binding: package[1][core:30-31]
Data for node: nodeA1 State: 3 Flags: MAPPED:SLOTS_GIVEN
Daemon: [prterun-Ralphs-iMac-41412@0,2] Daemon launched: False
Num slots: 48 Slots in use: 8 Oversubscribed: FALSE
Num slots allocated: 48 Max slots: 48 Num procs: 8
Data for proc: [prterun-Ralphs-iMac-41412@1,2]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
State: INITIALIZED App_context: 0
Binding: package[0][core:0-1]
Data for proc: [prterun-Ralphs-iMac-41412@1,3]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 3
State: INITIALIZED App_context: 0
Binding: package[1][core:24-25]
Data for proc: [prterun-Ralphs-iMac-41412@1,6]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 6
State: INITIALIZED App_context: 0
Binding: package[0][core:2-3]
Data for proc: [prterun-Ralphs-iMac-41412@1,7]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 7
State: INITIALIZED App_context: 0
Binding: package[1][core:26-27]
Data for proc: [prterun-Ralphs-iMac-41412@1,10]
Pid: 0 Local rank: 4 Node rank: 4 App rank: 10
State: INITIALIZED App_context: 0
Binding: package[0][core:4-5]
Data for proc: [prterun-Ralphs-iMac-41412@1,11]
Pid: 0 Local rank: 5 Node rank: 5 App rank: 11
State: INITIALIZED App_context: 0
Binding: package[1][core:28-29]
Data for proc: [prterun-Ralphs-iMac-41412@1,14]
Pid: 0 Local rank: 6 Node rank: 6 App rank: 14
State: INITIALIZED App_context: 0
Binding: package[0][core:6-7]
Data for proc: [prterun-Ralphs-iMac-41412@1,15]
Pid: 0 Local rank: 7 Node rank: 7 App rank: 15
State: INITIALIZED App_context: 0
Binding: package[1][core:30-31] So I think this is just a case of updating to a newer OMPI release. |
Don't know anything about that - could you perhaps provide your topology? You can attach the output of |
Thanks for the prompt response! I am using ompi out of HPC_X : |
Should I assume that the above behavior only applies to OMPI v5.XX and beyond? |
What is the recommendation for applying user-directed "complex" mappings with ompi v4.1.x? As those mentioned on the issue description? Actually this mapping is common when we run on h/w platforms with large number of cores. We undersubscribe the compute resources in order to allow for higher and even allocation of available DRAM BW to ranks. |
The Zen4 and Zen3 architectures have 8 cores more closely associated with each L3 cache sub-unit, whereas Zen2 has 4 cores associated with each L3 sub-unit. The examples below is for Zen4 and each node has 176 cores. Here is map-by L3cache and bind-to core.
Here is map-by L3cache and bind-to L3cache.
|
Here is the XML from |
On the " Overall. can you simplify a bit the syntax for getting these two following cases properly supported (with correct handling of OMP threads per rank) when nodes are under-subscribed:
IntelMPI has these two env vars to describe how to select core subsets for ranks consisting of > 1 OMP threads and how to arrange the relative placement of ranks among themselves over all available resources:
|
It could be true for higher v4.x releases as well - I don't know. It appears you are using NVIDIA's package, which is based on some v4.1.x release (not sure which one). If you stay with that package, then you may need to wait for them to upgrade to v5. You'd have to ask them for potential release dates. If you want to use OMPI directly, then you can just download/build one of their v5.0.x tarballs: https://www.open-mpi.org/software/ompi/v5.0/
You could try the rank_file mapper: https://docs.prrte.org/en/latest/placement/rankfiles.html I usually just experiment with the mappings offline until I find something like what I want. You can do that with a cmd line like: Since you seem to know the chip pretty well, you could do something like
Not quite sure I understand, but if you want to densely pack the chip, then usually something like
Same as above, just add ":span" to the end of the map-by option. There is an envar equivalent to these, but you probably should play a bit with the cmd line until you determine what you want. I'll play a little with that topo file and see if I can (a) spot any problems in the mapper and (b) maybe come up with a mapping for you. |
I took a look at your topology, and it appears the L3Cache spans the entire package. I gather this doesn't match your expectations. @bgoglin Any thoughts on this? The topology is attached above - this is the AMD Zen4 chip. This is the OMPI v4.1 series, so it would be HWLOC 2.0.1 |
I next tested your given cmd line with the current head of the OMPI v5.0 branch, simulating two nodes with your provided topology: $ mpirun --prtemca hwloc_use_topo_file ~/topologies/amd-Zen4.xml --runtime-options donotlaunch --display map --map-by numa:span --prtemca ras_simulator_num_nodes 2 -np 16 hostname
======================== JOB MAP ========================
Data for JOB prterun-Ralphs-iMac-60759@1 offset 0 Total slots allocated 352
Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN Ranking policy: SPAN Binding policy: NUMA:IF-SUPPORTED
Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
Data for node: nodeA0 Num slots: 176 Max slots: 176 Num procs: 8
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 0 Bound: package[0][core:0-43]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 1 Bound: package[0][core:44-87]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 2 Bound: package[1][core:88-131]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 3 Bound: package[1][core:132-175]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 8 Bound: package[0][core:0-43]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 9 Bound: package[0][core:44-87]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 10 Bound: package[1][core:88-131]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 11 Bound: package[1][core:132-175]
Data for node: nodeA1 Num slots: 176 Max slots: 176 Num procs: 8
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 4 Bound: package[0][core:0-43]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 5 Bound: package[0][core:44-87]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 6 Bound: package[1][core:88-131]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 7 Bound: package[1][core:132-175]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 12 Bound: package[0][core:0-43]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 13 Bound: package[0][core:44-87]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 14 Bound: package[1][core:88-131]
Process jobid: prterun-Ralphs-iMac-60759@1 App: 0 Process rank: 15 Bound: package[1][core:132-175] As you can see, it is correctly performing the span placement strategy. The procs are bound to NUMA because you specified I next added your desired binding of each proc to 2 cpus: $ mpirun --prtemca hwloc_use_topo_file ~/topologies/amd-Zen4.xml --runtime-options donotlaunch --display map --map-by numa:pe=2:span --prtemca ras_simulator_num_nodes 2 -np 16 hostname
======================== JOB MAP ========================
Data for JOB prterun-Ralphs-iMac-60839@1 offset 0 Total slots allocated 352
Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN Ranking policy: SPAN Binding policy: CORE:IF-SUPPORTED
Cpu set: N/A PPR: N/A Cpus-per-rank: 2 Cpu Type: CORE
Data for node: nodeA0 Num slots: 176 Max slots: 176 Num procs: 8
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 1 Bound: package[0][core:44-45]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 2 Bound: package[1][core:88-89]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 3 Bound: package[1][core:132-133]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 8 Bound: package[0][core:2-3]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 9 Bound: package[0][core:46-47]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 10 Bound: package[1][core:90-91]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 11 Bound: package[1][core:134-135]
Data for node: nodeA1 Num slots: 176 Max slots: 176 Num procs: 8
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 4 Bound: package[0][core:0-1]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 5 Bound: package[0][core:44-45]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 6 Bound: package[1][core:88-89]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 7 Bound: package[1][core:132-133]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 12 Bound: package[0][core:2-3]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 13 Bound: package[0][core:46-47]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 14 Bound: package[1][core:90-91]
Process jobid: prterun-Ralphs-iMac-60839@1 App: 0 Process rank: 15 Bound: package[1][core:134-135] Note that it continues to correctly compute the span, but now binds each proc to only 2 cpus. Note also that the procs are not placed on top of each other, but instead are on adjacent pairs of cores. So I think the mapper is working correctly, at least at the head of the v5 series. Not sure how far down from that you might be able to go, but you could try some intermediate versions (between where you are and head of v5) if you prefer. I'd also suggest playing with the various mapping options. Note that you can independently control the mapping, the ranking, and the binding of the procs. The way it works is that we first compute the location of each proc, then we go back to rank them. So you could, for example, change this last mapping to make all the procs on the first node sequentially ranked, followed by all the procs on the second node, by simply adding $ mpirun --prtemca hwloc_use_topo_file ~/topologies/amd-Zen4.xml --runtime-options donotlaunch --display map --map-by numa:pe=2:span --rank-by fill --prtemca ras_simulator_num_nodes 2 -np 16 hostname
======================== JOB MAP ========================
Data for JOB prterun-Ralphs-iMac-60915@1 offset 0 Total slots allocated 352
Mapping policy: BYNUMA:NOOVERSUBSCRIBE:SPAN Ranking policy: FILL Binding policy: CORE:IF-SUPPORTED
Cpu set: N/A PPR: N/A Cpus-per-rank: 2 Cpu Type: CORE
Data for node: nodeA0 Num slots: 176 Max slots: 176 Num procs: 8
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 0 Bound: package[0][core:0-1]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 1 Bound: package[0][core:2-3]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 2 Bound: package[0][core:44-45]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 3 Bound: package[0][core:46-47]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 4 Bound: package[1][core:88-89]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 5 Bound: package[1][core:90-91]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 6 Bound: package[1][core:132-133]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 7 Bound: package[1][core:134-135]
Data for node: nodeA1 Num slots: 176 Max slots: 176 Num procs: 8
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 8 Bound: package[0][core:0-1]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 9 Bound: package[0][core:2-3]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 10 Bound: package[0][core:44-45]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 11 Bound: package[0][core:46-47]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 12 Bound: package[1][core:88-89]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 13 Bound: package[1][core:90-91]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 14 Bound: package[1][core:132-133]
Process jobid: prterun-Ralphs-iMac-60915@1 App: 0 Process rank: 15 Bound: package[1][core:134-135] Hope all that proves helpful |
We have 2 NUMAs per package here, and one L3 per NUMA. The former is perfectly normal on AMD since they have NPS1/2/4 configs in the BIOS. The latter is more surprising since it's usually L3 per 8-core CCX/CCD. However, this CPU a "X" model where they have a huge additional L3 on die, I am not sure how that one is shared. 192MB of L3 per package looks too small to me for such a X model. It seems it should be 1152MB although I can't find good AMD doc for this CPU. So on the OMPI side, I think things are correct. hwloc is defintely way too old, but Zen3/4 support only required some CPUID changes, which is likely used here on Linux. The kernel could be too old too, but I'd hope RHEL8 has good support for Zen3. If you want to debug this further (I'd like to clarify this too), build a recent hwloc, run "hwloc-gather-topology zen4" and send the generated tarball zen4.tar.b2 either to me or in a new hwloc issue at https://github.com/open-mpi/hwloc/issues/new/ |
I am using OMPI off HPC_X which is currently stuck at OMPI v4.1. So it works as expected on v5.X? Good news. What do you think about the suggestion to get these two basic modes (compact or spread over h/w resources) become explicitly specifiable as CLI options? |
Thanks, I was also thinking after the fact about So under ompi v4.1, assuming 2 nodes, with 4 numa domains each and 44 cores per numa domain (176 cores / node), to evenly spread 88 ranks across 2 nodes, and then evenly spread ranks onto resources on each node (e.g., L3, numa, sockets, and provided that :PE=n also allocates n resource units / rank), I need to specify something like :
We need to ensure that ":PE=k" is now working correctly. Will that get back-ported to ompi v4.1? |
zen4.output.txt " and send the generated tarball zen4.tar.b2 either to me or in a new hwloc issue at https://github.com/open-mpi/hwloc/issues/new/ I just added suffix .txt to all files to allow github to pick them up. Can you please update hwloc to correctly account for the 8 cores / CCD/CCX/L3 segment? Indeed, these are the 3D-Vcache with 96 MiBs / CCD. |
@bgoglin On Zen4s, 3D V-cache supports 1152 MiBs L3 per socket, https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series.html |
Both Linux and x86 CPUID backends report incorrect information. Either both hwloc and Linux are wrong when parsing CPUID information (they'd miss something specific to Zen4 X models), or the CPU just reports incorrect information. I tend for the latter since the L3 hierarchy seems wrong here (all CCD are merged into a single one). I am going to ask some AMD people. |
These are units that the Azure cloud serves out. Do you guys have contacts with Microsoft for Azure? |
Oh wait, are these machines bare metal hardware, or virtualized by Microsoft? |
Not sure I understand - they already are specifiable as CLI options. Are you asking about a shorthand name for those modes? If so, I don't know how to effectively do that - what you might consider a reasonable "compact" placement might not match someone else's definition, so we'd wind up with a plethora of names for a range of different results. It's why we just give you the knobs and try to provide some (hopefully) reasonable defaults that work for most users.
Yes, at least in OMPI v5 I verified that it would produce what I think you are after, though you don't really need the "span" part unless you aren't filling the nodes. Just depends on how you want the results ranked (see my above examples).
If there is a bug in v4.1, I very much doubt it would be fixed - you'd probably need to upgrade. I double-checked and your proposed cmd line with the ":pe=2" worked fine with v5. You might try it without the "span" on v4 and see if that works.
Oh my - yeah, you'd probably need to talk to them about it, and I'm afraid I have no contacts there. |
This page in French https://learn.microsoft.com/fr-fr/azure/virtual-machines/hx-series-overview shows the same issue with 176 core instances. One reason might be that with 176 cores and 2 NUMA per socket, you get 44 cores per NUMA, which cannot be divided in 6 CCX per NUMA. Other instances (144rs, 96rs, 48rs, 24rs) show a correct lstopo output with 12x 96MB L3 per socket. Those numbers of cores can be divided in 6 CCX per NUMA. So my guess would be that the Azure hypervisor has no way to expose a sane L3 topology on 176-core instances and thus breaks things my reporting a single L3 per NUMA (with the size of a single real L3 instead of their aggregated sizes). |
@bgoglin Azure cloud sets aside 8+8 cores for the "Hypervisor" layer but the L3 caches are still available. There are actually 96 cores / chip |
If you can try a smaller instance where the number of cores per CPU is a multiple of 12, I wouldn't be surprised if things works fine there. 88 cannot be divided by 12, I'd bet it's the reason for all this L3 issue. |
Yes, by default on Linux, we read loooots of files /sys/devices/system/{cpu,node}. There's also a way to tell hwloc to directly ask the CPU for topology information (CPUID instruction, what the Linux kernel does for populate /sys files) but that's not used by default. Both report wrong info, that's why I think the hypervisor is exposing incorrect topology information from the virtual CPUs. |
I will try see if I can engage Azure people here.
Best regards
Michael
Michael E. Thomadakis, Ph.D.
Innovation and HPC R&D
Chevron Technology Center
1400 Smith Str, Houston, TX 77002
HOU140 room 04054; +1-832-854-3859
***@***.******@***.***>
Go.chevron.com/technicalcomputing<https://go.chevron.com/technicalcomputing>
…________________________________
From: Brice Goglin ***@***.***>
Sent: Wednesday, March 19, 2025 11:23:01 AM
To: open-mpi/ompi ***@***.***>
Cc: Thomadakis, Michael ***@***.***>; Author ***@***.***>
Subject: [**EXTERNAL**] Re: [open-mpi/ompi] Mapping ranks evenly across nodes and across resources within nodes with MPI+OMP codes (--map-by Res:PE=n:span is broken) (Issue #13143)
Be aware this external email contains an attachment and/or link.
Ensure the email and contents are expected. If there are concerns, please submit suspicious messages to the Cyber Intelligence Center using the Report Phishing button.
Yes, by default on Linux, we read loooots of files /sys/devices/system/{cpu,node}. There's also a way to tell hwloc to directly ask the CPU for topology information (CPUID instruction, what the Linux kernel does for populate /sys files) but that's not used by default. Both report wrong info, that's why I think the hypervisor is exposing incorrect topology information from the virtual CPUs.
—
Reply to this email directly, view it on GitHub<#13143 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS6ZG2Q5YCKJZB432RRBHRL2VGKWLAVCNFSM6AAAAABZBMIVVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMZXGI4DCOBQHA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
[bgoglin]bgoglin left a comment (open-mpi/ompi#13143)<#13143 (comment)>
Yes, by default on Linux, we read loooots of files /sys/devices/system/{cpu,node}. There's also a way to tell hwloc to directly ask the CPU for topology information (CPUID instruction, what the Linux kernel does for populate /sys files) but that's not used by default. Both report wrong info, that's why I think the hypervisor is exposing incorrect topology information from the virtual CPUs.
—
Reply to this email directly, view it on GitHub<#13143 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AS6ZG2Q5YCKJZB432RRBHRL2VGKWLAVCNFSM6AAAAABZBMIVVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMZXGI4DCOBQHA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Well, they are supposed to expose the actual h/w correctly. But then |
Hmmm...if the hypervisor is saying "this is the architecture", I'm not sure it is HWLOC's responsibility to run to the cloud provider and correct them. Ditto for the chip vendor. That sounds more like something the users should insist upon. You are the ones paying for the service - you are the ones who therefore have some clout (however much) to request changes/corrections. I would suggest filing an Azure ticket pointing out that the hypervisor is exposing an architecture that isn't what you expect/need, and ask for an explanation and/or correction. My guess is that they will tell you why they do it, and that there really isn't a way to change it. If you really want/need to precisely utilize a chip, cloud is probably not the right place - bare metal is almost certainly going to be a better option. |
Also, if topology info in wrong in Linux /sys because the hypervisor reports wrong CPUID info, why would it be hwloc's responsability to fix what it gets from /sys and not Linux to fix what it gets from CPUID and shows in /sys? There are too many bugs in hardware, BIOS, operating systems and now hypervisors unfortunately. We only work around some hardware bugs that are easy to identify (things like https://github.com/open-mpi/hwloc/wiki/Linux-kernel-bugs) whose actual fix will take too much time to reach end users. Here we don't even have a clear explanation of where the problem comes from, we cannot test on other kind of instances to clearly set the scope of the fix. Once we get some feedback from AMD and/or Azure, it'll be much easier to fix and deploy for them than having hwloc users wait for a new hwloc release being installed everywhere. |
@rhc54 Sure, on our part as cloud users we should request to fix their virtualization. Before that time we are putting effort to work around these wrong views. Note that both Open and Intel MPI are affected. I am saying that since these architectures are public and "well-known" it would be quite nice on your part if It seems that so far only the L3cache is misrepresented by the Hypervisor. You don't have to do fix their view of course, and I understand that you cannot provide for all known cases of the system providing wrong information. Our HPC is the Azure cloud and we are fixing and dealing with these and similar issues. True bare metal is the way to go for HPC. |
Hmmm...I hear you, but I'm not sure that is a feasible request. For example, I'm pretty sure you are working with an HWLOC that was released well before the Zen4 chip you are working with came out. Given the wide variety of software update strategies out there, I'm not sure there is a practical, effective way to provide that service. 🤷♂ That said, HWLOC does maintain a public site containing the topologies for a wide range of chips: https://gitlab.inria.fr/hwloc/xmls/-/tree/master/xml. Not everything is there, but it might be of some help. I think it used to be available on a prettier web interface, but at least it's still there. |
@bgoglin As I mentioned to Ralph, we are already asking the cloud provider to fix their virtualization, and it is way too messy for Unfortunately with the proliferation of these large count processor chips that are not represented properly by virtualization s/w, more and more users will have to be manually providing their own ad-hoc workarounds. |
@rhc54 Is it too difficult to provide a topology file for our case? I mean for the AMD Zen3 and Zen4 presenting the correct L3 view? |
I will keep you updated on our discussions with Azure about these issues. |
I was actually looking at generating a topology file. I can easy generate the XML for the entire physical machine (by modifying the "synthetic description" of your current XML).
Removing cores from this is easy BUT it creates a hole in the CPU numbering, that wouldn't match Azure's numbering. So I need to tweak the above line to force hwloc to put cores 0-87 is first socket and 99-175 in second socket, and the 16 hidden cores after 176. However I need to know which cores to remove among the 96. Are they all below a single L3 per socket? Are they under different L3? Also this removes IO locality, but I tend to think it doesn't matter in your case. |
@bgoglin Thanks for this! We have some bare metal machines with AMD Zen4s. I will extract the topology files from there and share here. I am also quite curious to see what the machine looks like (physical/logical core numbering, L3, etc) w/o Azure's "hypervisor" I'll share topology files for virtualized and non-virtualized h/w., |
If you have baremetal access to the same platform (Epyc 9V33X), that's interesting indeed. Otherwise I already have some dual-socket 96-core Epyc 9654 Genoa (Zen4) at https://gitlab.inria.fr/hwloc/xmls/-/blob/master/xml/AMD-Epyc-Zen4-2pa4nu3ca8co+1hsn.xml?ref_type=heads Right now I am a bit confused by AMD numbering. Zen4 CPUs usually end with digit 4, but 9V33X seems to be Genoa as well. |
@bgoglin I saw both Zen3 and the Zen4 above where each L3Cache is correctly reported as being associated to an 8-core block. Yes, the 9V33X name is "weird" since it is an AMD EPYC Zen4. I believe these are specialized versions meant for cloud environments. They likely changed their numbering to let the 1st digit indicate the micro-architecture. In https://www.amd.com/en/products/processors/server/epyc/4th-generation-architecture.html |
That is a reasonable speculation. They should be reporting the CCDs with the restricted cores as 6-core CCDs though. AMD EPYC processors can have 6-core CCDs in both Zen 3 and Zen 4 architectures. For example, the AMD EPYC 7443P (Zen3), has 24 cores distributed across 4 CCDs, with each CCD containing 6 cores. Similarly, the Zen 4 architecture also supports configurations with 6-core CCDs. |
BTW, thank you for providing this capability. I started looking into h/w topologies and optimization back in the late 90's on Unix. I understand how tedious it is, but it is a quite useful capability. |
I got some news from AMD explaining all the reasoning behind this "wrong" topology. I don't know yet if I can share everything here, but they are working at disseminating it. |
Thanks. I was notified that this issue has already reached both AMD and Azure engineering, We will touch bases with AMD internally. To me the big question is what the "proper" view of the underlying h/w (virtualized or not) should be. For instance, since Azure removes the first 2 cores out of each memory domain, should the physical numbering of the cores presented by I support the idea that Of course the L3 caches need to be correctly associated with their physical CCDs. Azure engineering is actually open to suggestions for reasonable restorations of the h/w structure to better reflect the h/w structure. |
Related question is if we run under a Linux |
Not sure what hwloc exposes today, but keep in mind that OMPI only works with logical cpu numbers. |
hwloc cannot change the physical IDs given by Linux/Azure because they are required when asking the OS to bind tasks to specific cores. For cgroups on Linux, the physical IDs are unchanged for the same reason (the binding mask you give to Linux is not relative to the cgroup, although that's an option for memory binding). As long as they expose the corresponding topology info, physical IDs don't matter much. There are used internally for binding, shown in lstopo in case experts really want to see them, but they often have strange ordering, can be non-consecutive, etc (that's why hwloc recommends the use of its logical IDs instead). There were some discussions in the past where the hypervisor (Xen at that time) would tell hwloc "oh btw Linux core #4 is actually physical core #5" but they never actually released their part of the code. If Azure is actually open to changing things, they could optionally expose the real L3 topology, which means you would have different numbers of cores per L3. It would make scheduling/placement decisions harder, but that's something the user is accepting to do if he clicks on the optional "show me the real topology even if it's assymmetric". |
That is a much larger discussion we may need to start. Ultimately, MPI should assist with proper rank to (physical) resource placement to aid user with optimizing run-time performance. Going back to my initial suggestion of having some syntactical shortcuts for requesting "compact" or "balanced" rank to resource placement, if MPI does not "see" the physical resource hierarchy, these rank placements are more or less meaningless. This issue is exacerbated in high core count nodes where we want to populate the cores in a fashion that leverages maximally memory controllers and / or L3 cache capacities. |
hwloc's logical numbering is built using the physical organization. OMPI uses logical IDs as Ralph said, and hwloc's "tree" of resources, to implement placement. I don't know if there's a shortcut for compact or balanced but all these things are possible with existing mapping/binding options. This topic has actually been the main reason why hwloc was created more than 15 years ago (we had the same need for thread placement and MPI wanted it for process placement). There's lot of research on finding the best placement for different apps. Physical IDs are a different thing. Two neighbor cores can be numbered with physical IDs 24 and 47, or 2 and 3, it doesn't matter as long as you know that those two cores are neighbors and share this cache, NUMA, or whatever. |
Well, in our HPC environment before launching the ranks, we inspect the underlying h/w and try to place the ranks (usually these are MPI+OMP or MPI + OMP + GPUs) at locations that can use maximally the DRAM BW/L3 capacities. So far Azure h/w is configured to have the same number of cores / mem domain and I believe that it exposes them consistently. Let's take the example of 44 or 30 active cores / mem domain. Often it is beneficial to have > 1 rank per mem domain but currently " I think in general then MPI should have the correct view of the physical h/w so that the user or MPI itself can support balanced use of under-subscribed h/w resources. |
As long as the logical numbering is based in a reasonable physical location ordering (yes physical core numbers will be misleading if physical core numbers do not correspond to physical proximity) this is in most cases acceptable. In essence we are interested in exposing the affinity of cores to L3 caches and memory controllers. All placement decisions I believe are with resect to these affinities. There could maybe then be mapping (placement) options that make decisions wrt to these affinities and the number of ranks and cores / rank. In my codes I end up needing to say "use N ranks, having n OMP threads, and map each rank on k physically contiguous cores, k >= n)" and place ranks as "compactly", "spread", or "balanced", as possible wrt to these resource affinities. I think that the ability to specify these requirements should remove the logical or physical core numbers from the discussion. Any feedback on this suggestion? |
I think you are conflating several topics here and it is getting out-of-hand. We standardized on logical cpus, which doesn't mean you cannot place ranks anywhere - you just specify it in terms of logical cpus if you want to tell us specific ones to use. The mapper doesn't care - when we map-by numa (or whatever) we look at the physical layout and map accordingly. When asked to bind to multiple cores, we bind to adjacent physical cores - but that is done under-the-covers. You can see that in the map output when asked to display it. So it is only the user interface side that is locked to logical, and that is because it gets way too complicated to keep dealing with "did they say logical or physical, what if the physical core they specify isn't available, etc". Much easier for the user to just stick with logical and move on. I'm fairly opposed to your notion of putting the mapping in terms of some named tag. You have an idea of what you mean by the tag, but I guarantee others would disagree with that interpretation. So it becomes a game of trying to match expectations with names, and that just becomes a nightmare to manage. We give you the controls - you can easily specify your interpretation of "spread" or whatever, and it doesn't impact anyone else. I fail to see the need to hardcode your particular layout to a name in the runtime. |
MPI implems have a correct view of the physical h/w. Except when Azure hides it in this specific case.
The issue is that there looooots of cases. It's hard to make a shortcut when different people will want different shortcuts. In your case, you want to map by NUMA and scatter internally. Some people will want to map by NUMA but compact internally (e.g. to maximize cache sharing). I am not going to comment further on this because I don't know enough about what OMPI can do.
Physical core numbers have a long history of NOT corresponding to physical proximity. This was pretty much NEVER the case in the past because HW vendors assumed most users wanted BW and the OS was bad at doing a scatter. Things have changed since then, we see quite a lot of cases where the numbering is obvious (except for SMT when non-first threads are often moved to the end of the numbering), but you should not assume anything like this, or your code will break on some uncommon platforms. |
|
I think we can safely table this discussion at this point. We are not going to create "named" placement strategies for the reasons we have given several times. Nobody is ignoring physical information - we just have the user tell us things in logical space. I'm glad Azure is cooperating. I think we've beaten this issue into the ground, so I (at least) shall now move on. |
Sure, but I had to go through the current discussion to know that some of the mapping/binding defects were addressed in OpenMPI v5.x
I totally agree. Physical numbering does not necessarily expose affinity/proximity. |
Is your feature request related to a problem? Please describe.
I am requesting the following mapping objective: map ranks of MPI+OMP codes as evenly as possible across nodes and spread ranks within nodes as evenly as possible over NUMA domains or L3 caches. This is a show-stopper issue for us.
Also,
L3cache
on AMD Zen4 systems (with 8 cores sharing an L3 cache) is incorrectly treated the same as a memory domain mapping.Describe the solution you'd like
This is a HPC use case: Memory BW (or L3 cache capacity) limited codes would benefit by spreading the ranks evenly over as many memory controllers (L3 Caches) as possible. Example of these codes are large numerical simulation including finite-difference, finite element or finite volume methods. They traverse large swaths of memory easily overwhelming L3 capacities (thus limiting effectiveness of cache-blocking in loops) and are almost exclusively DRAM BW limited. They would most benefit by spreading ranks over as many memory controllers as available. In this scenario users want to launch MPI OMP ranks onto a subset of the available cores so as not to exceed the BW each controller can server to the ranks mapped to these controller.
Unfortunately, OpenMPI's mapping logic is missing the capability of spreading ranks with OMP threads as evenly as possible over nodes and then over numa domains or L3 caches within nodes. When we use a "
--map-by resource:span
" or a "--map-by ppr:_N_:resource:span
" clause mapping selects theresource
in sequence and coreectly allocates 1 core (slot) to each rank. However, when we have MPI+OMP codes we need to allocate as many cores to each rank as OMP threads. The syntax for this would be "--map-by resource:PE=n:span", n OMP threads, but this will bunch up all ranks to the first set of nodes that can satisfy the request as opposed to evenly spreading ranks across nodes.For instance assume 2 nodes with 176 cores each with 2 sockets and 4 memory domains per node. If we want to map say, 32 ranks evenly over NUMA domains we can use
mpirun $CLIMPIOPTS --map-by numa:span -np 16 hostname
that correctly maps 16 ranks evenly across nodes and NUMA domains:
``
$ mpirun $CLIMPIOPTS --map-by socket:span -np 16 hostname
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 1 bound to socket 1[core 88[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 3 bound to socket 1[core 89[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 4 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 5 bound to socket 1[core 90[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 6 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:103408] MCW rank 7 bound to socket 1[core 91[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 8 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 9 bound to socket 1[core 88[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 10 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 11 bound to socket 1[core 89[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 12 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 13 bound to socket 1[core 90[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 14 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000001.p.ussc.az.chevron.net:99972] MCW rank 15 bound to socket 1[core 91[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
``
However, when we have say 2 OMP threads / rank, using
mpirun $CLIMPIOPTS --map-by numa:PE=2:span -np 16 hostname
maps all ranks to the first node :
``
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 1 bound to socket 0[core 44[hwt 0]], socket 0[core 45[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 2 bound to socket 1[core 88[hwt 0]], socket 1[core 89[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 3 bound to socket 1[core 132[hwt 0]], socket 1[core 133[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 4 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 5 bound to socket 0[core 46[hwt 0]], socket 0[core 47[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 6 bound to socket 1[core 90[hwt 0]], socket 1[core 91[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 7 bound to socket 1[core 134[hwt 0]], socket 1[core 135[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 8 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 9 bound to socket 0[core 48[hwt 0]], socket 0[core 49[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 10 bound to socket 1[core 92[hwt 0]], socket 1[core 93[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 11 bound to socket 1[core 136[hwt 0]], socket 1[core 137[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 12 bound to socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 13 bound to socket 0[core 50[hwt 0]], socket 0[core 51[hwt 0]]: [././././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 14 bound to socket 1[core 94[hwt 0]], socket 1[core 95[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././B/B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[ccnpusc4000000.p.ussc.az.chevron.net:111912] MCW rank 15 bound to socket 1[core 138[hwt 0]], socket 1[core 139[hwt 0]]: [./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][././././././././././././././././././././././././././././././././././././././././././././././././././B/B/./././././././././././././././././././././././././././././././././././.]
``
Describe alternatives you've considered
The meaningful mapping of MPI+OMP codes for even spreading over all resources should let the
--map-by numa:PE=_n_:span
clause allocate n cores per rank and replicate the effect of the same clause with `PE=n' absent.Additional context
This is a show stopper for us, as we cannot control the placement of the ranks in MPI+OMP codes under to evenly use as many memory controllers as available.
The text was updated successfully, but these errors were encountered: