ddp: trainer.test failure #2133

sshleifer · 2020-06-09T15:06:55Z

Versions:

torch==1.5
pytorch-lightning==0.8.0rc1
cuda=10.1

Wit or without fp16, trainer.test(model) fails with

initializing ddp: LOCAL_RANK: 0/1 WORLD_SIZE:2
Traceback (most recent call last):
  File "finetune.py", line 791, in <module>
    main(args)
  File "finetune.py", line 738, in main
    trainer.test(model)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line
1096, in test
    self.fit(model)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line
876, in fit
    self.ddp_train(task, model)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/trainer/distrib_data_paral
lel.py", line 429, in ddp_train
    model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing
_tasks)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/core/lightning.py", line 9
60, in init_ddp_connection
    torch_distrib.init_process_group(torch_backend, rank=proc_rank, world_size=world_
size)
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/torch/distributed/$
istributed_c10d.py", line 364, in init_process_group
    raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!

The text was updated successfully, but these errors were encountered:

Anjum48 · 2020-06-09T15:44:50Z

I am also getting this issue in a CV loop where trainer.fit is called for the second time.

Anjum48 · 2020-06-09T17:35:09Z

I think I have a functioning workaround. As the message describes, the problem is that the processes are initialized a second time, but a simple check can avoid this error.

Changing this line: https://github.com/PyTorchLightning/pytorch-lightning/blob/7245e48153909d9de8458b1f5b8b2bc740d80104/pytorch_lightning/trainer/distrib_data_parallel.py#L429

To this:

if not torch.distributed.is_initialized():
    model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing_tasks)

seems to get my CV loop to work.

Happy to open a PR if the workaround looks ok.

Anjum48 · 2020-06-10T09:20:51Z

Maybe this isn't as straightforward as I thought. After some time, one of my DataLoader processes aborts and gives this error:

Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 147, in <module>
    cross_validation(args)
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 93, in cross_validation
    train_single_fold(
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 78, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 876, in fit
    self.ddp_train(task, model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 474, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in run_pretrain_routine
    self.train()
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
    self.run_training_epoch()
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 628, in run_training_batch
    self.batch_loss_value.append(loss)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 44, in append
    x = x.to(self.memory)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3159278) is killed by signal: Aborted. 
Exception ignored in: <function tqdm.__del__ at 0x7f12b6606dc0>
Traceback (most recent call last):
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1077, in __del__
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1284, in close
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1461, in display
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1080, in __repr__
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1424, in format_dict
TypeError: cannot unpack non-iterable NoneType object

I'm not sure if this is related to the timeout here: https://github.com/pytorch/pytorch/blob/master/torch/distributed/constants.py

zackcarson · 2020-06-17T21:54:21Z

I'm also having this same issue on the latest version!

MRI000000 · 2020-06-24T13:38:22Z

I am also troubled by this question

awaelchli · 2020-07-07T19:13:32Z

maybe william fixed this in #2512
Could you try master branch?

williamFalcon · 2020-07-10T01:21:08Z

fixed! in 0.8.5

sshleifer added the help wanted Open to be worked on label Jun 9, 2020

Borda added bug Something isn't working priority: 0 High priority task labels Jun 9, 2020

Borda added this to the 0.8.0 milestone Jun 9, 2020

Borda modified the milestones: 0.8.0, 0.8.x Jun 18, 2020

williamFalcon modified the milestones: 0.8.x, fix .test() Jun 26, 2020

Borda modified the milestones: fix .test(), 0.9.0 Jul 7, 2020

williamFalcon closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddp: trainer.test failure #2133

ddp: trainer.test failure #2133

sshleifer commented Jun 9, 2020

Anjum48 commented Jun 9, 2020

Anjum48 commented Jun 9, 2020 •

edited

Loading

Anjum48 commented Jun 10, 2020 •

edited

Loading

zackcarson commented Jun 17, 2020

MRI000000 commented Jun 24, 2020

awaelchli commented Jul 7, 2020

williamFalcon commented Jul 10, 2020

ddp: trainer.test failure #2133

ddp: trainer.test failure #2133

Comments

sshleifer commented Jun 9, 2020

Anjum48 commented Jun 9, 2020

Anjum48 commented Jun 9, 2020 • edited Loading

Anjum48 commented Jun 10, 2020 • edited Loading

zackcarson commented Jun 17, 2020

MRI000000 commented Jun 24, 2020

awaelchli commented Jul 7, 2020

williamFalcon commented Jul 10, 2020

Anjum48 commented Jun 9, 2020 •

edited

Loading

Anjum48 commented Jun 10, 2020 •

edited

Loading