-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddp: trainer.test failure #2133
Comments
I am also getting this issue in a CV loop where |
I think I have a functioning workaround. As the message describes, the problem is that the processes are initialized a second time, but a simple check can avoid this error. Changing this line: https://github.com/PyTorchLightning/pytorch-lightning/blob/7245e48153909d9de8458b1f5b8b2bc740d80104/pytorch_lightning/trainer/distrib_data_parallel.py#L429 To this: if not torch.distributed.is_initialized():
model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing_tasks) seems to get my CV loop to work. Happy to open a PR if the workaround looks ok. |
Maybe this isn't as straightforward as I thought. After some time, one of my DataLoader processes aborts and gives this error:
I'm not sure if this is related to the timeout here: https://github.com/pytorch/pytorch/blob/master/torch/distributed/constants.py |
I'm also having this same issue on the latest version! |
I am also troubled by this question |
maybe william fixed this in #2512 |
fixed! in 0.8.5 |
Versions:
Wit or without fp16,
trainer.test(model)
fails withThe text was updated successfully, but these errors were encountered: