-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: host not found: Name or service not known in _env_rendezvous_handler #1542
Comments
Hi! thanks for your contribution!, great first issue! |
@newwhitecheng I am also facing the same issue, do you have any update/solution regarding this issue? |
My best guess is that NCCL is not install properly, but I'm not certain since I don't have root to install. |
@newwhitecheng @mmiakashs Single node works fine, but multi-node results in the exact same error. I've set I've put:
and got:
I'm trying to use 2 nodes with 1 GPU each. Btw, when I don't set |
Found the problem: there was a '0' added to the node name. It's even visible from the output I put here. I don't know why it happened, I'll look into it and find line responsible for a '0' in the master_addr. |
I updated to PL 0.8.0 and additional '0' in node name is no longer there. |
@isekulic is it working now? |
@mmiakashs yes, this part seems OK now. However, I'm facing a separate issue: My current setup is quite cluster-dependent, but it might be helpful anyways, so I'll post the slurm script:
To clarify, my cluster has only 1 GPU per node. Relevant part of Trainer:
I hope it helps. Make sure to check |
sounds solved! we can reopen otherwise |
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Code sample
Expected behavior
I was following the README in basic_examples folder, I can pass through single node example. But it shows this error in multi-nodes.
Environment
The text was updated successfully, but these errors were encountered: