-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to launch multiple gpus nodes #2578
Comments
Hi! thanks for your contribution!, great first issue! |
@williamFalcon was the assertion maybe supposed to go somewhere else? |
@mortonjt does simply removing the assertion solve the problem for you? (sorry can't test myself since I don't have multi node) |
I've commented the offending line, and it now it's been hanging for 16 min (the single node version is able to boot within 1 min). |
for multinode you have to set the master port yourself since the port can’t be random bc all the nodes need to know where to connect MASTER_PORT=1234, MASTER_ADDRESS=some.ip python main.py ... |
This fails first because python needs to be called by The second failure is seemingly because of how lightning parses slurm environment variables. The I'm not sure the most portable way to extract hosts from |
Also seems that multi-node testing is not supported, right?
|
Even with the edits suggested by @blackwer, the problem persists. |
I am facing same problem. Even if I use MASTER_PORT, or comment out the assertion line, the problem remains. |
Are there any updates on this? I'm having the same problem. |
Just as a heads up, I took @blackwer 's solution and now have a multi-gpu example working on slurm.
|
🐛 Bug
I'm having trouble launching multiple GPU nodes with pytorch-lightning-0.8.5-dev. I'm getting the following error
To Reproduce
I've setup my model similar to as follows
Environment
Output of python collect_env_details.py
Additional context
Only 4 out of 8 GPUs are recognized.
I'm curious why the assert statement is there.
The text was updated successfully, but these errors were encountered: