DDP Bug with Model Checkpoint parsing #2299

Laksh1997 · 2020-06-20T14:15:24Z

🐛 Bug

My script works with CPU, single-GPU and dp.

I need ddp to do 16 bit training. Also even on a single machine ddp is faster.

Here is my ModelCheckpoint code:

def setup_model_checkpoint(config):
    kwargs = config["model_checkpoint_kwargs"]
    metrics = kwargs.pop("metrics", ["val_loss"])
    if isinstance(metrics, str):
        metrics = [metrics]

    fp = "checkpoints/{epoch}"
    for metric in metrics:
        fp += "-{"
        fp += str(metric)
        fp += ":.2f}"

    return ModelCheckpoint(filepath=fp, **kwargs)

In my case it would generate the checkpoint: checkpoints/epoch=4_val_loss=0.6_auc=0.85 for example.

Although I even tried it with just checkpoints and it's the same issue.

The issue is the following:

2020-06-20T14:50:19.704+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
2020-06-20T14:50:19.704+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
2020-06-20T14:50:19.704+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
2020-06-20T14:50:19.705+01:00
self.ddp_train(task, model)
2020-06-20T14:50:19.705+01:00
self.ddp_train(task, model)
2020-06-20T14:50:19.705+01:00
self.ddp_train(task, model)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 530, in ddp_train
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 530, in ddp_train
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 530, in ddp_train
2020-06-20T14:50:19.705+01:00
self.run_pretrain_routine(model)
2020-06-20T14:50:19.705+01:00
self.run_pretrain_routine(model)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in run_pretrain_routine
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in run_pretrain_routine
2020-06-20T14:50:19.705+01:00
self.run_pretrain_routine(model)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in run_pretrain_routine
2020-06-20T14:50:19.705+01:00
self.configure_checkpoint_callback()
2020-06-20T14:50:19.705+01:00
self.configure_checkpoint_callback()
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_config.py", line 60, in configure_checkpoint_callback
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_config.py", line 60, in configure_checkpoint_callback
2020-06-20T14:50:19.705+01:00
self.configure_checkpoint_callback()
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_config.py", line 60, in configure_checkpoint_callback
2020-06-20T14:50:19.705+01:00
"checkpoints"
2020-06-20T14:50:19.705+01:00
"checkpoints"
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/posixpath.py", line 94, in join
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/posixpath.py", line 94, in join
2020-06-20T14:50:19.705+01:00
"checkpoints"
2020-06-20T14:50:19.705+01:00
genericpath._check_arg_types('join', a, *p)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/genericpath.py", line 149, in _check_arg_types
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/posixpath.py", line 94, in join
2020-06-20T14:50:19.705+01:00
genericpath._check_arg_types('join', a, *p)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/genericpath.py", line 149, in _check_arg_types
2020-06-20T14:50:19.705+01:00
(funcname, s.__class__.__name__)) from None
2020-06-20T14:50:19.705+01:00
genericpath._check_arg_types('join', a, *p)
2020-06-20T14:50:19.705+01:00
TypeError: join() argument must be str or bytes, not 'NoneType'
2020-06-20T14:50:19.706+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/genericpath.py", line 149, in _check_arg_types
2020-06-20T14:50:19.706+01:00
(funcname, s.__class__.__name__)) from None
2020-06-20T14:50:19.706+01:00
TypeError: join() argument must be str or bytes, not 'NoneType'
2020-06-20T14:50:19.706+01:00
(funcname, s.__class__.__name__)) from None
2020-06-20T14:50:19.706+01:00
TypeError: join() argument must be str or bytes, not 'NoneType'

Environment

PyTorch Version (e.g., 1.0): 1.4
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): Conda
Build command you used (if compiling from source):
Python version: 3.6.5
CUDA/cuDNN version: 10.1
GPU models and configuration: 4 x V100
Any other relevant information: Pytorch lightning 0.8.0

Additional context

The text was updated successfully, but these errors were encountered:

Laksh1997 · 2020-06-20T14:32:05Z

Ahh, it may be because I do kwargs.pop("metrics") which then means for the other processes its a NoneType.

I've copy.deepcopy'd kwargs. Let's see if that fixes it!

Laksh1997 · 2020-06-20T14:46:40Z

Turns out the above didn't help.

Laksh1997 · 2020-06-20T14:50:09Z

The error seems to be here (trainer/callback_config.py line 60)

    def configure_checkpoint_callback(self):
        """
        Weight path set in this priority:
        Checkpoint_callback's path (if passed in).
        User provided weights_saved_path
        Otherwise use os.getcwd()
        """
        ckpt_path = self.default_root_dir
        if self.checkpoint_callback:
            # init a default one
            if self.logger is not None:
                save_dir = (getattr(self.logger, 'save_dir', None) or
                            getattr(self.logger, '_save_dir', None) or
                            self.default_root_dir)

                # weights_save_path overrides anything
                if self.weights_save_path is not None:
                    save_dir = self.weights_save_path

                version = self.logger.version if isinstance(
                    self.logger.version, str) else f'version_{self.logger.version}'
                ckpt_path = os.path.join(
                    save_dir,
                    self.logger.name,
                    version,
                    "checkpoints"
                )

Laksh1997 · 2020-06-20T14:52:13Z

One of the arguments in the last os.path.join is a NoneType. But it only happens on ddp2. Confusing...

More confusing- exactly 3 (not 4) processes log this error. So one process seems to be fine but not the other 3!

Laksh1997 · 2020-06-20T14:57:52Z

I have a suspicion self.default_root_dir is None.

I am now explicitly passing in a default root dir.

Laksh1997 · 2020-06-20T15:14:09Z

The above didn't work. I now think that it's something to do with the logger, specifically logger.name

Laksh1997 · 2020-06-20T19:03:40Z

Here is my logger code:

def setup_wandb_logging(total_cfg: Dict):
    """
    Helper function to set-up WandB logging.
    Parameters
    ----------
    total_cfg: A dictionary containing all possible config for a training run! (Model + Data + Training config)

    Returns
    -------
    a WandbLogger ready to log in the training run.
    """
    wandb_logger = WandbLogger(
        name=total_cfg["name"],
        version=total_cfg["name"],
        save_dir="checkpoints",
        offline=False,
        anonymous=False,
        project=total_cfg["project"],
        tags=None,
        experiment=None,
    )
    wandb_logger.log_hyperparams(total_cfg)
    return wandb_logger

which I then pass into trainer as Trainer(logger=wandb_logger, **kwargs)

Laksh1997 · 2020-06-22T17:26:12Z

Response from wandb:

Hi Laksh - Try using wandb.init (reinit=True) wandb.join at the end of each run

williamFalcon · 2020-06-24T02:16:44Z

looking at this... so it looks to be specifically wb related?

williamFalcon · 2020-06-24T02:17:02Z

can you put up a colab that creates this issue?

Laksh1997 · 2020-06-24T08:48:34Z

@williamFalcon Yeah it looks like it. I have tried to contact the wandb team but they've given me limited response so far. Sure, let me put together a basic colab.

vanpelt · 2020-06-24T16:49:12Z

Hey guys, we can look into this. @Laksh1997 can you share a basic colab to reproduce?

Laksh1997 · 2020-06-24T18:22:22Z

Hi @vanpelt, thanks! I'm working on a colab right now. Will need to go on a multi GPU machine to confirm error.

I have tried 4 x V100 (p3.8xlarge) and 16 x K80 (p2.16xlarge) and I get the same error on only ddp (but works fine on dp).

aced125 · 2020-06-24T18:48:55Z

@vanpelt @williamFalcon Here's a working notebook.

https://colab.research.google.com/drive/1dTgKDU4S8Oy8g9AvXZLEoE5_aDyzBLb_?usp=sharing

To run on multi GPU you will need to copy the code (probs only 150 lines) input a script.

The key things to note is in the Hparams, change your wandb name and project to whatever you want.

Also, remember to turn on multi GPU (set gpus: 4 and distributed_backend="ddp" in trainer_kwargs in hparams).

Let me know if any issues

Laksh1997 · 2020-06-25T00:59:45Z

@vanpelt @williamFalcon Update - I just tried this on pytorch 1.6 nightly, and the error persists.

Laksh1997 · 2020-06-27T13:12:19Z

@vanpelt @williamFalcon Have you by any chance had any luck with the issue?

williamFalcon · 2020-06-27T13:41:19Z

looking in a few hours. want to get this fix into 0.8.2

Laksh1997 · 2020-06-27T14:27:33Z

Thanks so much!

williamFalcon · 2020-06-27T17:13:52Z

ok... running your exact code on a single GPU:


  | Name  | Type                | Params
----------------------------------------------
0 | model | EncoderDecoderModel | 245 M 
Epoch 1:   0%|                                                                                                                                                       | 0/828 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:761.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Epoch 1:  16%|██████████████████▏                                                                                              | 133/828 [00:50<04:23,  2.64it/s, loss=6.848, v_num=2nn4c31m]^Cwandb: Ctrl-c pressed.
wandb: Program failed with code 255. Press ctrl-c to abort syncing.
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
  warnings.warn(*args, **kwargs)
Epoch 1:  16%|██████████████████▏                                                                                              | 133/828 [00:50<04:23,  2.63it/s, loss=6.848, v_num=2nn4c31m]

wandb: Waiting for W&B process to finish, PID 29584
wandb: Run summary:
wandb:         _step 135
wandb:    _timestamp 1593277979.7518523
wandb:      _runtime 72.95207810401917
wandb:          loss 6.491087913513184
wandb:   global_step 100
wandb:         epoch 0
wandb: Syncing 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:                                                                                
wandb: Synced MY-WANDB-NAME: https://app.wandb.ai/stk/MY-WANDB-PROJECT/runs/2nn4c31m

Trying ddp now

williamFalcon · 2020-06-27T17:49:00Z

ok fixed!

/opt/conda/lib/python3.7/site-packages/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
name: MY-WANDB-NAME
project: MY-WANDB-PROJECT
train_bs: 4
val_bs: 4
num_workers: 4
max_length: 160
num_datapoints: 100000
optimizer: Ranger
optimizer_kwargs:
  lr: 0.0003
  alpha: 0.5
  betas:
  - 0.95
  - 0.999
  eps: 1.0e-05
  weight_decay: 0.001
schedulers_kwargs:
  num_warmup_steps: 1000
trainer_kwargs:
  gpus: 2
  gradient_clip_val: 0.5
  accumulate_grad_batches: 4
  min_epochs: 5
  max_epochs: 100
  precision: 32
  distributed_backend: ddp

initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
wandb: Tracking run with wandb version 0.9.1
wandb: Run data is saved locally in wandb/run-20200627_174833-2sdoruup
wandb: Syncing run MY-WANDB-NAME
wandb: ⭐️ View project at https://app.wandb.ai/stk/MY-WANDB-PROJECT
wandb: 🚀 View run at https://app.wandb.ai/stk/MY-WANDB-PROJECT/runs/2sdoruup
wandb: Run `wandb off` to turn off syncing.


  | Name  | Type                | Params
----------------------------------------------
0 | model | EncoderDecoderModel | 245 M 
wandb: Tracking run with wandb version 0.9.1
wandb: Run data is saved locally in wandb/run-20200627_174834-3gkane0x
wandb: Syncing run MY-WANDB-NAME
wandb: ⭐️ View project at https://app.wandb.ai/stk/MY-WANDB-PROJECT
wandb: 🚀 View run at https://app.wandb.ai/stk/MY-WANDB-PROJECT/runs/3gkane0x
wandb: Run `wandb off` to turn off syncing.

Epoch 1:   0%|                                                                                                                                                       | 0/414 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:761.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
/opt/conda/lib/python3.7/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:761.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Epoch 1:   7%|███████▉                                                                                                          | 29/414 [00:11<02:36,  2.46it/s, loss=9.714, v_num=2sdoruup]

williamFalcon · 2020-06-27T17:53:24Z

2 things:

we had a bug where something was trying to access a property not on rank zero which is now fixed.
your prepare_data is not correct now.

Prepare data is only ever called on the root GPU... this means assigning something self.something = a will only work on GPU 0. So, when you try to access that it will break on other GPUs.

We fixed this by introducing

def setup(self, step):

in setup, you can assign whatever you want.

Here are all the details on how to prepare data
https://pytorch-lightning.readthedocs.io/en/stable/lightning-module.html#data-preparation

The fix is very simple in the HF example:

# old
def prepare_data(self):
    self.x = train_split

# new
def setup(self, step):
    self.x = train_split

And use prepare_data only for downloads

def prepare_data(self):
    self.download()
    tokenize()
    etc()

   self.dont_assing = it_wont_work_on_other_gpus

FYI @sshleifer

@Laksh1997 here's the fixed code
https://gist.github.com/williamFalcon/645019619bdd897d135d232556bcf27d

Laksh1997 · 2020-06-27T20:48:48Z

@williamFalcon Thank you so much! Amazing!

So if the data is particularly light to download, we might as well just put all the code in setup?

Also what is the step argument for?

Laksh1997 added bug Something isn't working help wanted Open to be worked on labels Jun 20, 2020

williamFalcon mentioned this issue Jun 27, 2020

fixes logger crash on ddp #2388

Merged

williamFalcon closed this as completed in #2388 Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP Bug with Model Checkpoint parsing #2299

DDP Bug with Model Checkpoint parsing #2299

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 22, 2020

williamFalcon commented Jun 24, 2020

williamFalcon commented Jun 24, 2020

Laksh1997 commented Jun 24, 2020

vanpelt commented Jun 24, 2020

Laksh1997 commented Jun 24, 2020

aced125 commented Jun 24, 2020

Laksh1997 commented Jun 25, 2020

Laksh1997 commented Jun 27, 2020

williamFalcon commented Jun 27, 2020

Laksh1997 commented Jun 27, 2020

williamFalcon commented Jun 27, 2020

williamFalcon commented Jun 27, 2020

williamFalcon commented Jun 27, 2020 •

edited

Loading

Laksh1997 commented Jun 27, 2020

DDP Bug with Model Checkpoint parsing #2299

DDP Bug with Model Checkpoint parsing #2299

Comments

Laksh1997 commented Jun 20, 2020

🐛 Bug

Environment

Additional context

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 20, 2020

Laksh1997 commented Jun 22, 2020

williamFalcon commented Jun 24, 2020

williamFalcon commented Jun 24, 2020

Laksh1997 commented Jun 24, 2020

vanpelt commented Jun 24, 2020

Laksh1997 commented Jun 24, 2020

aced125 commented Jun 24, 2020

Laksh1997 commented Jun 25, 2020

Laksh1997 commented Jun 27, 2020

williamFalcon commented Jun 27, 2020

Laksh1997 commented Jun 27, 2020

williamFalcon commented Jun 27, 2020

williamFalcon commented Jun 27, 2020

williamFalcon commented Jun 27, 2020 • edited Loading

Laksh1997 commented Jun 27, 2020

williamFalcon commented Jun 27, 2020 •

edited

Loading