Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scontrol: not found in slurm using pyxis container #700

Closed
minooei opened this issue Dec 8, 2020 · 3 comments
Closed

scontrol: not found in slurm using pyxis container #700

minooei opened this issue Dec 8, 2020 · 3 comments
Assignees

Comments

@minooei
Copy link
Contributor

minooei commented Dec 8, 2020

Hi
I'm trying to run mmdetection on slurm using pyxis container and getting this error:

  dist.init_process_group(backend=backend)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Temporary failure in name resolution

I investigate the problem and I think the problem is related to using scontrol command since this command is not available in pyxis container. I tried a simple script with this line

addr = subprocess.getoutput(
        f'scontrol show hostname {node_list} | head -n1')

which results in: /bin/dash: 1: scontrol: not found

what is the purpose of this line?
is there an alternative not using scontrol command?

@minooei minooei changed the title scontrol: not found in slurm scontrol: not found in slurm using pyxis container Dec 8, 2020
@minooei
Copy link
Contributor Author

minooei commented Dec 8, 2020

after commenting this line my code works as intended.

maybe it's better to check os.environ['MASTER_ADDR'] before assignment like line 62:

    if 'MASTER_ADDR' in os.environ:
        pass
    else:
        os.environ['MASTER_ADDR'] = addr

@xvjiarui
Copy link
Collaborator

HI @minooei
Thanks for pointing out!
Would you like to create a PR to fix it?
As for the scontrol, we use it on our own slurm cluster. However, the slurm cluster may be configured differently. I think your proposal looks good to me.

@minooei
Copy link
Contributor Author

minooei commented Dec 11, 2020

yes sure, I submitted a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants