You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
I'm trying to run mmdetection on slurm using pyxis container and getting this error:
dist.init_process_group(backend=backend)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Temporary failure in name resolution
I investigate the problem and I think the problem is related to using scontrol command since this command is not available in pyxis container. I tried a simple script with this line
addr = subprocess.getoutput(
f'scontrol show hostname {node_list} | head -n1')
which results in: /bin/dash: 1: scontrol: not found
what is the purpose of this line?
is there an alternative not using scontrol command?
The text was updated successfully, but these errors were encountered:
minooei
changed the title
scontrol: not found in slurm
scontrol: not found in slurm using pyxis container
Dec 8, 2020
HI @minooei
Thanks for pointing out!
Would you like to create a PR to fix it?
As for the scontrol, we use it on our own slurm cluster. However, the slurm cluster may be configured differently. I think your proposal looks good to me.
Hi
I'm trying to run mmdetection on slurm using pyxis container and getting this error:
I investigate the problem and I think the problem is related to using scontrol command since this command is not available in pyxis container. I tried a simple script with this line
which results in:
/bin/dash: 1: scontrol: not found
what is the purpose of this line?
is there an alternative not using scontrol command?
The text was updated successfully, but these errors were encountered: