-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm ntasks-per-node is ignored #102
Comments
@neggert nb_requested_gpus is the world_size self.nb_requested_gpus = len(self.data_parallel_device_ids) * self.nb_gpu_nodes |
ntasks-per-node should be 2 for your slurm job (per the warning). Try running again? in your case self.nb_requested_gpus = len(self.data_parallel_device_ids) * self.nb_gpu_nodes equals 4. And Otherwise an alternative might be to say: self.nb_requested_gpus = len(self.data_parallel_device_ids)
self.nb_slurm_tasks = 0
try:
self.nb_slurm_tasks = int(os.environ['SLURM_NTASKS_PER_NODE'])
self.is_slurm_managing_tasks = self.nb_slurm_tasks == self.nb_requested_gpus
except Exception:
# likely not on slurm, so set the slurm managed flag to false
self.is_slurm_managing_tasks = False |
Okay, thanks. Will take another look; I must have misunderstood something. |
A generation question about DDP and ntasks-per-node: should ntasks-per-node match the number of GPUs requested, or the num_workers in your dataloader or whichever is bigger? Say I need 4 gpus but I have num_workers=8 in my dataloader, and I request --ntasks-per-node=4. Will I have any problem? I already know that training never really proceeds with ntasks-per-node=1. |
Describe the bug
When running with DDP, Lightning throws this warning:
I made the suggested change, but I still get the warning. Digging into the code a bit, it looks like this warning goes away when
$SLURM_NTASKS
matchestrainer.nb_requested_gpus
. If I'm understanding the code correctly, this should be changed to check$SLURM_NTASKS_PER_NODE
, sincetrainer.nb_requested_gpus
is the number of gpus per node.I'm happy to make the change if you agree that this is the correct fix.
To Reproduce
Submit job with
test_tube.SlurmCluster
Expected behavior
Warning should go away and lightning should use slurm-created tasks
The text was updated successfully, but these errors were encountered: