Single node DDP: "Default process group is not initialized" #2254

s-rog · 2020-06-19T02:37:22Z

🐛 Bug

Unable to start single node ddp training on 0.8.0

To Reproduce

~~was going to run the gpu_template but... #2235~~
both methods of running the template result in the same error

$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp_spawn
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in fit
    self.barrier('fit_prepare_data')
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in barrier
    torch_distrib.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1484, in barrier
    _check_default_pg()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 187, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

The text was updated successfully, but these errors were encountered:

williamFalcon · 2020-06-19T02:47:50Z

can you post code to reproduce? just a minimal example that breaks

BTW, the GPU template is fixed...

s-rog · 2020-06-19T02:50:00Z

done, let me post my env as well

williamFalcon · 2020-06-19T02:50:36Z

ok wait... i think i see it. one sec

s-rog · 2020-06-19T04:50:07Z

I just tested the merged changes with both ddp and ddp_spawn again got this:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    self.ddp_train(task, model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 907, in fit
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'
    self.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 441, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'

williamFalcon · 2020-06-19T05:14:30Z

try again. that was a typo

s-rog · 2020-06-19T05:47:52Z

cheers, works now!

armancohan · 2020-06-23T05:35:19Z

Still having the Default process group is not initialized issue when using trainer.test

wukailu · 2020-06-23T06:30:56Z

Still having the Default process group is not initialized issue when using trainer.test

I still have this bug as well. One temporary solution is creating a new single GPU trainer to do the test.

Like

trainer = Trainer(gpus=1, deterministic=True, logger=logger)
trainer.model = model
trainer.test()

armancohan · 2020-06-23T19:57:28Z

Right, I know it works on single gpu. I have a large test set and ideally want faster inference using multiple gpus.

zackcarson · 2020-07-02T15:11:23Z

Can we re-open this issue? I am still having the Default process group is not initialized issue when I hit trainer.test() with ddp (with any number of gpus, even 1). I'm using the latest release from yesterday.

armancohan · 2020-07-02T15:33:13Z

+1, doesn't look like the issue is resolved yet.

jxchen01 · 2020-07-04T05:32:04Z

having the same problem..... I also tried to downgrade pl to an older version, like 0.7.5, and try to using the older version to do the inference. But, the model trained and saved using the 0.8.x seems to not directly be compatible with older version.

channingxiao · 2020-07-09T12:11:00Z

version: 0.8.4 train with ddp, Got "Default process group is not initialized" when run trainer.test()

williamFalcon · 2020-07-09T12:18:32Z

could you try master? this is fixed there

zackcarson · 2020-07-09T19:06:49Z

Just tried it, it works fine now! Thank you!

jxchen01 · 2020-08-17T19:13:27Z

@williamFalcon Trying 0.8.5

Trained with ddp, and testing with ddp, but got the following error message:

AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

Any idea?

Thanks!

s-rog added bug Something isn't working help wanted Open to be worked on labels Jun 19, 2020

williamFalcon mentioned this issue Jun 19, 2020

Barrier #2257

Merged

williamFalcon closed this as completed in #2257 Jun 19, 2020

williamFalcon reopened this Jul 2, 2020

williamFalcon closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single node DDP: "Default process group is not initialized" #2254

Single node DDP: "Default process group is not initialized" #2254

s-rog commented Jun 19, 2020 •

edited

Loading

williamFalcon commented Jun 19, 2020 •

edited

Loading

s-rog commented Jun 19, 2020

williamFalcon commented Jun 19, 2020

s-rog commented Jun 19, 2020

williamFalcon commented Jun 19, 2020

s-rog commented Jun 19, 2020

armancohan commented Jun 23, 2020

wukailu commented Jun 23, 2020

armancohan commented Jun 23, 2020

zackcarson commented Jul 2, 2020 •

edited

Loading

armancohan commented Jul 2, 2020

jxchen01 commented Jul 4, 2020

channingxiao commented Jul 9, 2020

williamFalcon commented Jul 9, 2020

zackcarson commented Jul 9, 2020

jxchen01 commented Aug 17, 2020

Single node DDP: "Default process group is not initialized" #2254

Single node DDP: "Default process group is not initialized" #2254

Comments

s-rog commented Jun 19, 2020 • edited Loading

🐛 Bug

To Reproduce

williamFalcon commented Jun 19, 2020 • edited Loading

s-rog commented Jun 19, 2020

williamFalcon commented Jun 19, 2020

s-rog commented Jun 19, 2020

williamFalcon commented Jun 19, 2020

s-rog commented Jun 19, 2020

armancohan commented Jun 23, 2020

wukailu commented Jun 23, 2020

armancohan commented Jun 23, 2020

zackcarson commented Jul 2, 2020 • edited Loading

armancohan commented Jul 2, 2020

jxchen01 commented Jul 4, 2020

channingxiao commented Jul 9, 2020

williamFalcon commented Jul 9, 2020

zackcarson commented Jul 9, 2020

jxchen01 commented Aug 17, 2020

s-rog commented Jun 19, 2020 •

edited

Loading

williamFalcon commented Jun 19, 2020 •

edited

Loading

zackcarson commented Jul 2, 2020 •

edited

Loading