benchmark subprocess vs spawn #5772

edenlightning · 2021-02-03T20:14:12Z

A while back we replaced ddp .spawn with subprocess due to issues with subprocess and spawning multiple processes in the dataloader: #2029

Are there still performance issues using spawn? If these are fixed, we can change the messaging in our docs (https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel-spawn)

BlockWaving · 2021-02-10T05:10:36Z

Using ddp_spawn (with multiple gpus at one node) , I observed that a) batch size can not set as large as ddp; b) under the ddp, the gpu memory can be used as high as 90% (21 out of 24GB each gpu), but with ddp_spawn, always get cuda memory error when gpu memory exceeding 12 (out of 24GB each GPU); c) with ddp_spawn, my training tends to crash after 7-8 epoches (num_workers=3); d) overall mid-training gpu cuda utilization is lower, around 65% with ddp_spawn, vs. that of 85% with ddp.

justusschock · 2021-02-10T08:18:22Z

@BlockWaving Thanks for these informations. Do you have a script to reproduce these benchmarks?

BlockWaving · 2021-02-10T13:04:27Z

@justusschock can not disclose detail script for company policy. but you can see the Trainer setup snippets in the new thread i asked today -- have been forced to use ddp_spawn instead of ddp, because of the replicas errors with ddp mode.

This afternoon the training with ddp_spawn stuck again at epoch 2 at 84%.

pls be free to let me know if you have further questions.

BlockWaving · 2021-02-10T13:06:46Z

@justusschock check out #5894

kaushikb11 · 2021-04-20T18:26:47Z

Hi @justusschock, do we have any updates on this?

edenlightning assigned justusschock Feb 3, 2021

Borda added this to the 1.2.x milestone Feb 4, 2021

Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021

edenlightning modified the milestones: v1.3, v1.4 Apr 27, 2021

edenlightning added the distributed Generic distributed-related topic label May 9, 2021

edenlightning closed this as completed Jul 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark subprocess vs spawn #5772

benchmark subprocess vs spawn #5772

edenlightning commented Feb 3, 2021 •

edited

Loading

BlockWaving commented Feb 10, 2021

justusschock commented Feb 10, 2021

BlockWaving commented Feb 10, 2021

BlockWaving commented Feb 10, 2021

kaushikb11 commented Apr 20, 2021

benchmark subprocess vs spawn #5772

benchmark subprocess vs spawn #5772

Comments

edenlightning commented Feb 3, 2021 • edited Loading

BlockWaving commented Feb 10, 2021

justusschock commented Feb 10, 2021

BlockWaving commented Feb 10, 2021

BlockWaving commented Feb 10, 2021

kaushikb11 commented Apr 20, 2021

edenlightning commented Feb 3, 2021 •

edited

Loading