-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the Trainer class more than once fails with "Address already in use" with the DDP backend #2537
Comments
Your problem seems similar to #401. |
You mean something like os.environ['MASTER_PORT'] = "44513"
train()
os.environ['MASTER_PORT'] = "44514"
train() I don't get the
|
Check master, this PR #2512 chooses the port randomly. I think that would solve your issue. Not sure :) |
Using the latest master (and latest PyTorch) I now get yet another, different error:
Are you sure that the socket used for ddp is properly closed when the |
I also meet |
Not really a solution but maybe some hints towards the bug: I noticed that this seems to be an issue with |
This also seems to be the root cause of an issue when trying to do LR find on distributed compute. |
Fixed it in #2790 |
Thanks! Just to confirm, is the crux of the fix here the |
Yes! The PR solves many interconnected issues, but one case is the following situation
solution is to find a new free port and connect all processes to that. (this applies only to single node training) |
@JanSellner @dthiagarajan Just a quick follow up here why this issue got closed: For your use case it means:
It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs. (and ignore my previous post here, #2537 (comment)) |
Hi, I am still stuck in here. So the final solution is change 'ddp' to 'ddp_spawn' mode? I was stuck when using ddp to fit kfold trainers. |
exactly |
I often meet this issue when using |
🐛 Bug
It is not possible to create and use the Trainer class more than once with the DDP backend since the program crashes the second time with
RuntimeError: Address already in use
.To Reproduce
Here is a minimal code example which reproduces the issue:
Expected behavior
It should be possible to use the Trainer object multiple times. In my case, it breaks my k-fold validation loop since I create a new Trainer for each fold.
Environment
The text was updated successfully, but these errors were encountered: