-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed training crashes with dp (list comprehension issue from torch?) #1861
Comments
Hi! thanks for your contribution!, great first issue! |
I was experiencing this problem the other day. It's somewhat related to PyTorch. if you look at the function that gathers the outputs from GPU devices, https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py#L47 def gather_map(outputs):
out = outputs[0]
if isinstance(out, torch.Tensor):
return Gather.apply(target_device, dim, *outputs)
if out is None:
return None
if isinstance(out, dict):
if not all((len(out) == len(d) for d in outputs)):
raise ValueError('All dicts must have the same number of keys')
return type(out)(((k, gather_map([d[k] for d in outputs]))
for k in out))
return type(out)(map(gather_map, zip(*outputs))) You'll see that It only supports tensors or dictionries that contain tensors. The problem for me was that my results = {
"loss": loss,
"log": all_logs,
"progress_bar": progress_logs,
}
def _fix_dp_return_type(self, result, device):
if isinstance(result, torch.Tensor):
return result.to(device)
if isinstance(result, dict):
return {k:self._fix_dp_return_type(v, device) for k,v in result.items()}
# Must be a number then
return torch.Tensor([result]).to(device) I hope there's a better fix for this :) |
hmmm I am just returning loss and log so I have to convert the loss to tensor and feed into the device? |
feels like this is something that should be covered in the parallel tools in lightning.... though I guess |
I agree. One way to fix is to override the default |
@nsarang maybe submit a PR with this patch? @ananyahjha93 |
@williamFalcon Alright. Are you referring to #1895? |
@nsarang you can override the gather function and create a separate PR for it. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Where do you call _fix_dp_return_type to fix the issue? |
@Skylixia |
Yes I am on the latest version. I also have as output "Validation: 0it [00:00, ?it/s]" and used : |
@Skylixia |
🐛 Bug
I ran a distributed GPU template and get an error with data parallel and scatter_gather from torch nn parallel in particular.
To Reproduce
Steps to reproduce the behavior:
install packages
git clone from master
run basic example gpu job with distributed
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit self.dp_train(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train self.run_pretrain_routine(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1001, in run_pretrain_routine False) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 277, in _evaluate output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 424, in evaluation_forward output = model(*args) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward return self.gather(outputs, self.output_device) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather return gather(outputs, output_device, dim=self.dim) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr> for k in out)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) TypeError: zip argument #1 must support iteration
Code sample
run
python3 gpu_template.py --gpus 2 --distributed_backend dp
Expected behavior
should run distributed demo job without errors
Environment
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.2
- numpy: 1.18.4
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.7.6
- tensorboard: 2.2.1
- tqdm: 4.46.0
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.6
- version: #201812030624 SMP Mon Dec 3 11:25:55 UTC 2018
Additional context
python3 gpu_template.py --gpus 2 --distributed_backend ddp
worksThe text was updated successfully, but these errors were encountered: