Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm ntasks-per-node is ignored #102

Closed
neggert opened this issue Aug 12, 2019 · 4 comments
Closed

Slurm ntasks-per-node is ignored #102

neggert opened this issue Aug 12, 2019 · 4 comments
Labels
bug Something isn't working

Comments

@neggert
Copy link
Contributor

neggert commented Aug 12, 2019

Describe the bug

When running with DDP, Lightning throws this warning:

UserWarning: 
You requested 2 GPUs but launched 1 slurm tasks.
We will launch 2 processes for you.
We recommend you let slurm manage the processes by setting: --ntasks-per-node=2
If you're not using SLURM, ignore this message!

I made the suggested change, but I still get the warning. Digging into the code a bit, it looks like this warning goes away when $SLURM_NTASKS matches trainer.nb_requested_gpus. If I'm understanding the code correctly, this should be changed to check $SLURM_NTASKS_PER_NODE, since trainer.nb_requested_gpus is the number of gpus per node.

I'm happy to make the change if you agree that this is the correct fix.

To Reproduce
Submit job with test_tube.SlurmCluster

    cluster = SlurmCluster(
        hyperparam_optimizer=args,
        log_path="./logs"
    )

    cluster.per_experiment_nb_gpus = 2
    cluster.per_experiment_nb_nodes = 2
    cluster.per_experiment_nb_cpus = 16
    cluster.add_slurm_cmd(cmd="ntasks-per-node", value=str(cluster.per_experiment_nb_gpus), comment="1 task per gpu, for ddp")
    cluster.job_time = "1:00:00"
    cluster.gpu_type = "p100"
    cluster.memory_mb_per_node = 300000

    cluster.optimize_parallel_cluster_gpu(train, nb_trials=1, job_name="tml")

Expected behavior
Warning should go away and lightning should use slurm-created tasks

@neggert neggert added the bug Something isn't working label Aug 12, 2019
@williamFalcon
Copy link
Contributor

@neggert nb_requested_gpus is the world_size

self.nb_requested_gpus = len(self.data_parallel_device_ids) * self.nb_gpu_nodes

@williamFalcon
Copy link
Contributor

ntasks-per-node should be 2 for your slurm job (per the warning).

Try running again?

in your case

self.nb_requested_gpus = len(self.data_parallel_device_ids) * self.nb_gpu_nodes

equals 4.

And self.nb_slurm_tasks = int(os.environ['SLURM_NTASKS']) also equals 4. So the warning shouldn't be triggered.

Otherwise an alternative might be to say:

            self.nb_requested_gpus = len(self.data_parallel_device_ids)
            self.nb_slurm_tasks = 0
            try:
                self.nb_slurm_tasks = int(os.environ['SLURM_NTASKS_PER_NODE'])
                self.is_slurm_managing_tasks = self.nb_slurm_tasks == self.nb_requested_gpus
            except Exception:
                # likely not on slurm, so set the slurm managed flag to false
                self.is_slurm_managing_tasks = False

@neggert
Copy link
Contributor Author

neggert commented Aug 12, 2019

Okay, thanks. Will take another look; I must have misunderstood something.

@zhanwenchen
Copy link

A generation question about DDP and ntasks-per-node: should ntasks-per-node match the number of GPUs requested, or the num_workers in your dataloader or whichever is bigger? Say I need 4 gpus but I have num_workers=8 in my dataloader, and I request --ntasks-per-node=4. Will I have any problem? I already know that training never really proceeds with ntasks-per-node=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants