Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single node DDP: "Default process group is not initialized" #2254

Closed
s-rog opened this issue Jun 19, 2020 · 16 comments · Fixed by #2257
Closed

Single node DDP: "Default process group is not initialized" #2254

s-rog opened this issue Jun 19, 2020 · 16 comments · Fixed by #2257
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@s-rog
Copy link
Contributor

s-rog commented Jun 19, 2020

🐛 Bug

Unable to start single node ddp training on 0.8.0

To Reproduce

was going to run the gpu_template but... #2235
both methods of running the template result in the same error

$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp_spawn
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in fit
    self.barrier('fit_prepare_data')
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in barrier
    torch_distrib.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1484, in barrier
    _check_default_pg()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 187, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized
@s-rog s-rog added bug Something isn't working help wanted Open to be worked on labels Jun 19, 2020
@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 19, 2020

can you post code to reproduce? just a minimal example that breaks

BTW, the GPU template is fixed...

@s-rog
Copy link
Contributor Author

s-rog commented Jun 19, 2020

done, let me post my env as well

@williamFalcon
Copy link
Contributor

ok wait... i think i see it. one sec

@s-rog
Copy link
Contributor Author

s-rog commented Jun 19, 2020

I just tested the merged changes with both ddp and ddp_spawn again got this:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    self.ddp_train(task, model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 907, in fit
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'
    self.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 441, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 479, in ddp_train
    self.setup()
TypeError: setup() missing 1 required positional argument: 'stage'

@williamFalcon
Copy link
Contributor

try again. that was a typo

@s-rog
Copy link
Contributor Author

s-rog commented Jun 19, 2020

cheers, works now!

@armancohan
Copy link

Still having the Default process group is not initialized issue when using trainer.test

@wukailu
Copy link

wukailu commented Jun 23, 2020

Still having the Default process group is not initialized issue when using trainer.test

I still have this bug as well. One temporary solution is creating a new single GPU trainer to do the test.

Like

trainer = Trainer(gpus=1, deterministic=True, logger=logger)
trainer.model = model
trainer.test()

@armancohan
Copy link

Right, I know it works on single gpu. I have a large test set and ideally want faster inference using multiple gpus.

@zackcarson
Copy link

zackcarson commented Jul 2, 2020

Can we re-open this issue? I am still having the Default process group is not initialized issue when I hit trainer.test() with ddp (with any number of gpus, even 1). I'm using the latest release from yesterday.

@armancohan
Copy link

+1, doesn't look like the issue is resolved yet.

@williamFalcon williamFalcon reopened this Jul 2, 2020
@jxchen01
Copy link

jxchen01 commented Jul 4, 2020

having the same problem..... I also tried to downgrade pl to an older version, like 0.7.5, and try to using the older version to do the inference. But, the model trained and saved using the 0.8.x seems to not directly be compatible with older version.

@channingxiao
Copy link

version: 0.8.4 train with ddp, Got "Default process group is not initialized" when run trainer.test()

@williamFalcon
Copy link
Contributor

could you try master? this is fixed there

@zackcarson
Copy link

Just tried it, it works fine now! Thank you!

@jxchen01
Copy link

@williamFalcon Trying 0.8.5

Trained with ddp, and testing with ddp, but got the following error message:

AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

Any idea?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants