-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmark subprocess vs spawn #5772
Comments
Using ddp_spawn (with multiple gpus at one node) , I observed that a) batch size can not set as large as ddp; b) under the ddp, the gpu memory can be used as high as 90% (21 out of 24GB each gpu), but with ddp_spawn, always get cuda memory error when gpu memory exceeding 12 (out of 24GB each GPU); c) with ddp_spawn, my training tends to crash after 7-8 epoches (num_workers=3); d) overall mid-training gpu cuda utilization is lower, around 65% with ddp_spawn, vs. that of 85% with ddp. |
@BlockWaving Thanks for these informations. Do you have a script to reproduce these benchmarks? |
@justusschock can not disclose detail script for company policy. but you can see the Trainer setup snippets in the new thread i asked today -- have been forced to use ddp_spawn instead of ddp, because of the replicas errors with ddp mode. This afternoon the training with ddp_spawn stuck again at epoch 2 at 84%. pls be free to let me know if you have further questions. |
@justusschock check out #5894 |
Hi @justusschock, do we have any updates on this? |
A while back we replaced ddp .spawn with subprocess due to issues with subprocess and spawning multiple processes in the dataloader: #2029
Are there still performance issues using spawn? If these are fixed, we can change the messaging in our docs (https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#distributed-data-parallel-spawn)
The text was updated successfully, but these errors were encountered: