can i run multiple ddp jobs on single node #697

sneiman · 2020-01-17T00:13:41Z

I am running on a 14 core, 7 gpu machine. Ubuntu 18.04.2LTS, python 3.6.8, lightning 0.5.3.2, no virtual environment, no SLURM.

I have moved a tried and true model to ddp. It works great in all scenarios, including ddp as a single invocation.

I cannot succesfully start a second one, unfortunately. I get the following failure:

  File "/home/seth/.local/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use

This second job is running on different gpus and has a different log path.
After a brief investigation, it seems to me that the second job is trying to use the same master address as the first. I did not see any way to alter this with pytorch-lightning, though it seems straightforward in pytorch.

My questions are:
Can I run multiple simultaneous ddp jobs on the same node with different GPUs?
If so, how?

Thanks

The text was updated successfully, but these errors were encountered:

shoarora · 2020-01-17T19:06:52Z

I believe the init_ddp_connection() function is what handles setting the master address.

# if user gave a port number, use that one instead
        try:
            default_port = os.environ['MASTER_PORT']
        except Exception:
            os.environ['MASTER_PORT'] = str(default_port)

and there is support to set the master port via environment variable. I believe that should solve the Address already in use issue.

For using different GPUs on the same node, that should amount to passing different device ids to the gpus kwarg in Trainer

sneiman · 2020-01-17T23:29:01Z

Thanks - I was aware of the above, but it is nice to have it confirmed. i thought perhaps pl had a more pl style solution. I will try a few approaches and share results here.

sneiman · 2020-01-18T01:50:54Z

Yes, resetting the env var MASTER_PORT to something higher than 12910 did the trick. This is successfully done from the command line or similar - not lost as the procs get spawned.

trainer.proc_rank is 0 based, so it does not follow the gpu index. There are a few io issues and I haven't sorted out how to gather loss and related statistics from each of the jobs yet.

neggert · 2020-01-21T03:41:35Z

I think this is going to be a pain no matter what you do without something like Slurm or docker to isolate gpus from each other. You might want to try hiding some GPUs from some jobs using the CUDA_VISIBLE_DEVICES environment variable. I think that, plus adjusting MASTER_PORT should do what you want.

williamFalcon · 2020-01-21T12:25:57Z

@sneiman on a single machine you probably should just use DP. not sure ddp was designed for single machine use?

sneiman · 2020-01-21T16:03:31Z

FWIW - this does work, and is recommended by pytorch (from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)

Thus, even for single machine training, where your data is small enough to fit on a single machine, DistributedDataParallel is expected to be faster than DataParallel. DistributedDataParallel also replicates models upfront instead of on each iteration and gets Global Interpreter Lock out of the way.

In my experience it is substantially faster. Didn't do careful testing but maybe 25%? Main memory hog, though.

sneiman added the question Further information is requested label Jan 17, 2020

williamFalcon closed this as completed Jan 21, 2020

williamFalcon mentioned this issue Mar 2, 2020

fix port collision on DDP #1010

Merged

Borda added bug Something isn't working feature Is an improvement or enhancement and removed question Further information is requested labels Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can i run multiple ddp jobs on single node #697

can i run multiple ddp jobs on single node #697

sneiman commented Jan 17, 2020

shoarora commented Jan 17, 2020

sneiman commented Jan 17, 2020

sneiman commented Jan 18, 2020

neggert commented Jan 21, 2020

williamFalcon commented Jan 21, 2020

sneiman commented Jan 21, 2020

can i run multiple ddp jobs on single node #697

can i run multiple ddp jobs on single node #697

Comments

sneiman commented Jan 17, 2020

shoarora commented Jan 17, 2020

sneiman commented Jan 17, 2020

sneiman commented Jan 18, 2020

neggert commented Jan 21, 2020

williamFalcon commented Jan 21, 2020

sneiman commented Jan 21, 2020