ddp_cpu crashing on SLURM cluster because of save_spawn_weights() #1890

Dunrar · 2020-05-19T11:01:55Z

🐛 Bug

I'm seeing a seemingly similar problem as issues #1335 and #1637 on current master when using ddp_cpu on my universities SLURM cluster. It's failing at a certain epoch (not the same for every job, but in the same range) for all jobs in a job array, on all nodes. But it's not happening when load_spawn_weights() is being invoked, but when when save_spawn_weights() is.

Error

slurmstepd-breitach: error: Unable to create TMPDIR [/tmp/user/30335]: Permission denied
slurmstepd-breitach: error: Setting TMPDIR to /tmp
psutil is not installed. You will not be able to abort this experiment from the UI.
psutil is not installed. Hardware metrics will not be collected.
NeptuneLogger will work in online mode
GPU available: True, used: False
No environment variable for node rank defined. Set as 0.
MASTER_ADDR environment variable is not defined. Set as localhost
initializing proc_rank 0 world 1
Set SLURM handle signals.

  | Name                  | Type           | Params
-----------------------------------------------------
0 | model                 | RecurrentModel | 5     
1 | model.recurrent_model | RNN            | 4     
2 | model.fc              | Linear         | 1     
/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Traceback (most recent call last):
  File "experiment_core.py", line 40, in <module>
    main(hyperparams, parser)
  File "experiment_core.py", line 25, in main
    trainer.fit(model)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 856, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 391, in ddp_train
    self.save_spawn_weights(model)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 401, in save_spawn_weights
    self.save_checkpoint(path)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 265, in save_checkpoint
    self._atomic_save(checkpoint, filepath)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 256, in _atomic_save
    torch.save(checkpoint, tmp_path)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 369, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 234, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 215, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/sch/schillmann/Bachelor-Thesis/Code/Memory-Network-Memory-Horizons/src/logs/__temp_weight_ddp_end.ckpt.part'

Expected behavior

All jobs to finish on all nodes without problems.

The text was updated successfully, but these errors were encountered:

Dunrar added bug Something isn't working help wanted Open to be worked on labels May 19, 2020

Dunrar changed the title ~~ddp_cpu on SLURM cluster still crashing because of load_spawn_weights()~~ ddp_cpu crashing on SLURM cluster crashing May 27, 2020

Dunrar changed the title ~~ddp_cpu crashing on SLURM cluster crashing~~ ddp_cpu crashing on SLURM cluster May 27, 2020

Dunrar changed the title ~~ddp_cpu crashing on SLURM cluster~~ ddp_cpu crashing on SLURM cluster because of save_spawn_weights() May 27, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed in #2029 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddp_cpu crashing on SLURM cluster because of save_spawn_weights() #1890

ddp_cpu crashing on SLURM cluster because of save_spawn_weights() #1890

Dunrar commented May 19, 2020 •

edited

Loading

ddp_cpu crashing on SLURM cluster because of save_spawn_weights() #1890

ddp_cpu crashing on SLURM cluster because of save_spawn_weights() #1890

Comments

Dunrar commented May 19, 2020 • edited Loading

🐛 Bug

Error

Expected behavior

Dunrar commented May 19, 2020 •

edited

Loading