prepare_data called multiple times per node for slurm and elastic training #1878

tullie · 2020-05-18T23:00:34Z

🐛 Bug

Slurm and elastic training create the training processes per node outside of the lightning context. This means that when the fit function calls prepare_data, the assumption that it's only being called on proc 0 is broken and it gets called for each process.

This is an issue computational reasons (e.g. downloading a whole dataset) and for training stability if the data preparation process isn't deterministic.

See calling code here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/7c7e50ca4702a5b35bc1b80d44bca7606552093a/pytorch_lightning/trainer/trainer.py#L825

To Reproduce

Steps to reproduce the behavior:

Add print statements to prepare_data
Train a lightning model with either slurm or elastic training
See that it's being called multiple times.

Expected behavior

Expected prepare_data to only be called once per node.

edenlightning · 2020-06-08T11:01:00Z

@Borda any idea how to fix?

tullie added bug Something isn't working help wanted Open to be worked on labels May 18, 2020

ananthsub mentioned this issue Jun 12, 2020

Call prepare_data once per node in DDP (torchelastic) #2163

Closed

5 tasks

williamFalcon mentioned this issue Jun 13, 2020

enable prepare_data from correct processes - clarify local vs global rank #2166

Merged

williamFalcon closed this as completed in #2166 Jun 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepare_data called multiple times per node for slurm and elastic training #1878

prepare_data called multiple times per node for slurm and elastic training #1878

tullie commented May 18, 2020

edenlightning commented Jun 8, 2020

prepare_data called multiple times per node for slurm and elastic training #1878

prepare_data called multiple times per node for slurm and elastic training #1878

Comments

tullie commented May 18, 2020

🐛 Bug

To Reproduce

Expected behavior

edenlightning commented Jun 8, 2020