Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix selecting GPUs using CUDA_VISIBLE_DEVICES #2739

Merged
merged 2 commits into from
Aug 2, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions pytorch_lightning/trainer/distrib_data_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -551,8 +551,7 @@ def ddp_train(self, process_idx, mp_queue, model, is_master=False, proc_offset=0
gpu_idx = process_idx
if is_master:
# source of truth is cuda for gpu idx
gpus = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
gpu_idx = int(gpus[self.local_rank])
gpu_idx = self.local_rank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won’t work... because if you have access to gpus 4,5,6,7 and you request “2,3” you’re asking for “6,7”

Copy link
Contributor Author

@ibeltagy ibeltagy Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, having PL run your mode on GPUs 6,7 is the expected behavior in this case.

Copy link
Contributor Author

@ibeltagy ibeltagy Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes another problem with ddp. If gpus=3 and CUDA_VISIBLE_DEVICES=4,5,6,7, ddp will run only two jobs on GPUs 5,6, and the job on GPU4 won't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree but here's what happens:

gpus available: 0, 1, 2, 3, 4, 5
index: 0, 1, 2, 3, 4, 5
gpu[2] = 2

when you set CUDA_VISIBLE_DEVICES your numbering changes
CUDA_VISIBLE_DEVICES='2, 4,5'
now your indexes 0, 1, 2

So once you set visible devices the mapping changes:
gpus[0] = 2
gpus[2] = 5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share code that breaks so i can reproduce and verify? your fix might fix this problem but it's likely to break other DDP settings


self.root_gpu = gpu_idx
torch.cuda.set_device(self.root_gpu)
Expand Down