Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can i run multiple ddp jobs on single node #697

Closed
sneiman opened this issue Jan 17, 2020 · 6 comments · Fixed by #1010
Closed

can i run multiple ddp jobs on single node #697

sneiman opened this issue Jan 17, 2020 · 6 comments · Fixed by #1010
Labels
bug Something isn't working feature Is an improvement or enhancement

Comments

@sneiman
Copy link
Contributor

sneiman commented Jan 17, 2020

I am running on a 14 core, 7 gpu machine. Ubuntu 18.04.2LTS, python 3.6.8, lightning 0.5.3.2, no virtual environment, no SLURM.

I have moved a tried and true model to ddp. It works great in all scenarios, including ddp as a single invocation.

I cannot succesfully start a second one, unfortunately. I get the following failure:

  File "/home/seth/.local/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use

This second job is running on different gpus and has a different log path.
After a brief investigation, it seems to me that the second job is trying to use the same master address as the first. I did not see any way to alter this with pytorch-lightning, though it seems straightforward in pytorch.

My questions are:
Can I run multiple simultaneous ddp jobs on the same node with different GPUs?
If so, how?

Thanks

@sneiman sneiman added the question Further information is requested label Jan 17, 2020
@shoarora
Copy link
Contributor

I believe the init_ddp_connection() function is what handles setting the master address.

# if user gave a port number, use that one instead
        try:
            default_port = os.environ['MASTER_PORT']
        except Exception:
            os.environ['MASTER_PORT'] = str(default_port)

and there is support to set the master port via environment variable. I believe that should solve the Address already in use issue.

For using different GPUs on the same node, that should amount to passing different device ids to the gpus kwarg in Trainer

@sneiman
Copy link
Contributor Author

sneiman commented Jan 17, 2020

Thanks - I was aware of the above, but it is nice to have it confirmed. i thought perhaps pl had a more pl style solution. I will try a few approaches and share results here.

@sneiman
Copy link
Contributor Author

sneiman commented Jan 18, 2020

Yes, resetting the env var MASTER_PORT to something higher than 12910 did the trick. This is successfully done from the command line or similar - not lost as the procs get spawned.

trainer.proc_rank is 0 based, so it does not follow the gpu index. There are a few io issues and I haven't sorted out how to gather loss and related statistics from each of the jobs yet.

@neggert
Copy link
Contributor

neggert commented Jan 21, 2020

I think this is going to be a pain no matter what you do without something like Slurm or docker to isolate gpus from each other. You might want to try hiding some GPUs from some jobs using the CUDA_VISIBLE_DEVICES environment variable. I think that, plus adjusting MASTER_PORT should do what you want.

@williamFalcon
Copy link
Contributor

@sneiman on a single machine you probably should just use DP. not sure ddp was designed for single machine use?

@sneiman
Copy link
Contributor Author

sneiman commented Jan 21, 2020

FWIW - this does work, and is recommended by pytorch (from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)

Thus, even for single machine training, where your data is small enough to fit on a single machine, DistributedDataParallel is expected to be faster than DataParallel. DistributedDataParallel also replicates models upfront instead of on each iteration and gets Global Interpreter Lock out of the way.

In my experience it is substantially faster. Didn't do careful testing but maybe 25%? Main memory hog, though.

@Borda Borda added bug Something isn't working feature Is an improvement or enhancement and removed question Further information is requested labels Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants