Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddp_cpu crashing on SLURM cluster because of save_spawn_weights() #1890

Closed
Dunrar opened this issue May 19, 2020 · 0 comments · Fixed by #2029
Closed

ddp_cpu crashing on SLURM cluster because of save_spawn_weights() #1890

Dunrar opened this issue May 19, 2020 · 0 comments · Fixed by #2029
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@Dunrar
Copy link

Dunrar commented May 19, 2020

🐛 Bug

I'm seeing a seemingly similar problem as issues #1335 and #1637 on current master when using ddp_cpu on my universities SLURM cluster. It's failing at a certain epoch (not the same for every job, but in the same range) for all jobs in a job array, on all nodes. But it's not happening when load_spawn_weights() is being invoked, but when when save_spawn_weights() is.

Error

slurmstepd-breitach: error: Unable to create TMPDIR [/tmp/user/30335]: Permission denied
slurmstepd-breitach: error: Setting TMPDIR to /tmp
psutil is not installed. You will not be able to abort this experiment from the UI.
psutil is not installed. Hardware metrics will not be collected.
NeptuneLogger will work in online mode
GPU available: True, used: False
No environment variable for node rank defined. Set as 0.
MASTER_ADDR environment variable is not defined. Set as localhost
initializing proc_rank 0 world 1
Set SLURM handle signals.

  | Name                  | Type           | Params
-----------------------------------------------------
0 | model                 | RecurrentModel | 5     
1 | model.recurrent_model | RNN            | 4     
2 | model.fc              | Linear         | 1     
/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "experiment_core.py", line 40, in <module>
    main(hyperparams, parser)
  File "experiment_core.py", line 25, in main
    trainer.fit(model)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 856, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 391, in ddp_train
    self.save_spawn_weights(model)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 401, in save_spawn_weights
    self.save_checkpoint(path)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 265, in save_checkpoint
    self._atomic_save(checkpoint, filepath)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 256, in _atomic_save
    torch.save(checkpoint, tmp_path)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 369, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 234, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 215, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/sch/schillmann/Bachelor-Thesis/Code/Memory-Network-Memory-Horizons/src/logs/__temp_weight_ddp_end.ckpt.part'

Expected behavior

All jobs to finish on all nodes without problems.

@Dunrar Dunrar added bug Something isn't working help wanted Open to be worked on labels May 19, 2020
@Dunrar Dunrar changed the title ddp_cpu on SLURM cluster still crashing because of load_spawn_weights() ddp_cpu crashing on SLURM cluster crashing May 27, 2020
@Dunrar Dunrar changed the title ddp_cpu crashing on SLURM cluster crashing ddp_cpu crashing on SLURM cluster May 27, 2020
@Dunrar Dunrar changed the title ddp_cpu crashing on SLURM cluster ddp_cpu crashing on SLURM cluster because of save_spawn_weights() May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant