-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can i run multiple ddp jobs on single node #697
Comments
I believe the # if user gave a port number, use that one instead
try:
default_port = os.environ['MASTER_PORT']
except Exception:
os.environ['MASTER_PORT'] = str(default_port) and there is support to set the master port via environment variable. I believe that should solve the For using different GPUs on the same node, that should amount to passing different device ids to the |
Thanks - I was aware of the above, but it is nice to have it confirmed. i thought perhaps pl had a more pl style solution. I will try a few approaches and share results here. |
Yes, resetting the env var MASTER_PORT to something higher than 12910 did the trick. This is successfully done from the command line or similar - not lost as the procs get spawned. trainer.proc_rank is 0 based, so it does not follow the gpu index. There are a few io issues and I haven't sorted out how to gather loss and related statistics from each of the jobs yet. |
I think this is going to be a pain no matter what you do without something like Slurm or docker to isolate gpus from each other. You might want to try hiding some GPUs from some jobs using the |
@sneiman on a single machine you probably should just use DP. not sure ddp was designed for single machine use? |
FWIW - this does work, and is recommended by pytorch (from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
In my experience it is substantially faster. Didn't do careful testing but maybe 25%? Main memory hog, though. |
I am running on a 14 core, 7 gpu machine. Ubuntu 18.04.2LTS, python 3.6.8, lightning 0.5.3.2, no virtual environment, no SLURM.
I have moved a tried and true model to ddp. It works great in all scenarios, including ddp as a single invocation.
I cannot succesfully start a second one, unfortunately. I get the following failure:
This second job is running on different gpus and has a different log path.
After a brief investigation, it seems to me that the second job is trying to use the same master address as the first. I did not see any way to alter this with pytorch-lightning, though it seems straightforward in pytorch.
My questions are:
Can I run multiple simultaneous ddp jobs on the same node with different GPUs?
If so, how?
Thanks
The text was updated successfully, but these errors were encountered: