Slurm ntasks-per-node is ignored #102

neggert · 2019-08-12T17:45:56Z

Describe the bug

When running with DDP, Lightning throws this warning:

UserWarning: 
You requested 2 GPUs but launched 1 slurm tasks.
We will launch 2 processes for you.
We recommend you let slurm manage the processes by setting: --ntasks-per-node=2
If you're not using SLURM, ignore this message!

I made the suggested change, but I still get the warning. Digging into the code a bit, it looks like this warning goes away when $SLURM_NTASKS matches trainer.nb_requested_gpus. If I'm understanding the code correctly, this should be changed to check $SLURM_NTASKS_PER_NODE, since trainer.nb_requested_gpus is the number of gpus per node.

I'm happy to make the change if you agree that this is the correct fix.

To Reproduce
Submit job with test_tube.SlurmCluster

    cluster = SlurmCluster(
        hyperparam_optimizer=args,
        log_path="./logs"
    )

    cluster.per_experiment_nb_gpus = 2
    cluster.per_experiment_nb_nodes = 2
    cluster.per_experiment_nb_cpus = 16
    cluster.add_slurm_cmd(cmd="ntasks-per-node", value=str(cluster.per_experiment_nb_gpus), comment="1 task per gpu, for ddp")
    cluster.job_time = "1:00:00"
    cluster.gpu_type = "p100"
    cluster.memory_mb_per_node = 300000

    cluster.optimize_parallel_cluster_gpu(train, nb_trials=1, job_name="tml")

Expected behavior
Warning should go away and lightning should use slurm-created tasks

The text was updated successfully, but these errors were encountered:

williamFalcon · 2019-08-12T19:25:30Z

@neggert nb_requested_gpus is the world_size

self.nb_requested_gpus = len(self.data_parallel_device_ids) * self.nb_gpu_nodes

williamFalcon · 2019-08-12T19:31:01Z

ntasks-per-node should be 2 for your slurm job (per the warning).

Try running again?

in your case

self.nb_requested_gpus = len(self.data_parallel_device_ids) * self.nb_gpu_nodes

equals 4.

And self.nb_slurm_tasks = int(os.environ['SLURM_NTASKS']) also equals 4. So the warning shouldn't be triggered.

Otherwise an alternative might be to say:

            self.nb_requested_gpus = len(self.data_parallel_device_ids)
            self.nb_slurm_tasks = 0
            try:
                self.nb_slurm_tasks = int(os.environ['SLURM_NTASKS_PER_NODE'])
                self.is_slurm_managing_tasks = self.nb_slurm_tasks == self.nb_requested_gpus
            except Exception:
                # likely not on slurm, so set the slurm managed flag to false
                self.is_slurm_managing_tasks = False

neggert · 2019-08-12T19:41:20Z

Okay, thanks. Will take another look; I must have misunderstood something.

zhanwenchen · 2022-09-07T16:38:42Z

A generation question about DDP and ntasks-per-node: should ntasks-per-node match the number of GPUs requested, or the num_workers in your dataloader or whichever is bigger? Say I need 4 gpus but I have num_workers=8 in my dataloader, and I request --ntasks-per-node=4. Will I have any problem? I already know that training never really proceeds with ntasks-per-node=1.

neggert added the bug Something isn't working label Aug 12, 2019

neggert closed this as completed Aug 12, 2019

This was referenced May 31, 2020

Bug in TensorBoardLogger add_hparams() #2023

Closed

A warning that may comes from legacy code #2024

Closed

lucadiliello mentioned this issue Jun 3, 2020

Logging not working correctly with optimizer frequency. #2064

Closed

tshu-w mentioned this issue Dec 22, 2021

cannot access class attribute in setup which assign in prepare_data when using multi gpus #11220

Closed

cschell mentioned this issue Jan 3, 2022

Trainer#tune/scale_batch_size with mode="binsearch" does not work properly #11299

Closed

nithinraok mentioned this issue Jan 12, 2022

ModelCheckpointCallback doesnt delete previous -last.ckpt while saving recent one #11451

Closed

Blakey-Gavin mentioned this issue Feb 7, 2024

Single machine multi-gpus ddp training stuck #19425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm ntasks-per-node is ignored #102

Slurm ntasks-per-node is ignored #102

neggert commented Aug 12, 2019

williamFalcon commented Aug 12, 2019

williamFalcon commented Aug 12, 2019

neggert commented Aug 12, 2019

zhanwenchen commented Sep 7, 2022

Slurm ntasks-per-node is ignored #102

Slurm ntasks-per-node is ignored #102

Comments

neggert commented Aug 12, 2019

williamFalcon commented Aug 12, 2019

williamFalcon commented Aug 12, 2019

neggert commented Aug 12, 2019

zhanwenchen commented Sep 7, 2022