Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP Bug with Model Checkpoint parsing #2299

Closed
Laksh1997 opened this issue Jun 20, 2020 · 22 comments · Fixed by #2388
Closed

DDP Bug with Model Checkpoint parsing #2299

Laksh1997 opened this issue Jun 20, 2020 · 22 comments · Fixed by #2388
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@Laksh1997
Copy link

🐛 Bug

My script works with CPU, single-GPU and dp.

I need ddp to do 16 bit training. Also even on a single machine ddp is faster.

Here is my ModelCheckpoint code:

def setup_model_checkpoint(config):
    kwargs = config["model_checkpoint_kwargs"]
    metrics = kwargs.pop("metrics", ["val_loss"])
    if isinstance(metrics, str):
        metrics = [metrics]

    fp = "checkpoints/{epoch}"
    for metric in metrics:
        fp += "-{"
        fp += str(metric)
        fp += ":.2f}"

    return ModelCheckpoint(filepath=fp, **kwargs)

In my case it would generate the checkpoint: checkpoints/epoch=4_val_loss=0.6_auc=0.85 for example.

Although I even tried it with just checkpoints and it's the same issue.

The issue is the following:

2020-06-20T14:50:19.704+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
2020-06-20T14:50:19.704+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
2020-06-20T14:50:19.704+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 891, in fit
2020-06-20T14:50:19.705+01:00
self.ddp_train(task, model)
2020-06-20T14:50:19.705+01:00
self.ddp_train(task, model)
2020-06-20T14:50:19.705+01:00
self.ddp_train(task, model)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 530, in ddp_train
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 530, in ddp_train
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 530, in ddp_train
2020-06-20T14:50:19.705+01:00
self.run_pretrain_routine(model)
2020-06-20T14:50:19.705+01:00
self.run_pretrain_routine(model)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in run_pretrain_routine
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in run_pretrain_routine
2020-06-20T14:50:19.705+01:00
self.run_pretrain_routine(model)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in run_pretrain_routine
2020-06-20T14:50:19.705+01:00
self.configure_checkpoint_callback()
2020-06-20T14:50:19.705+01:00
self.configure_checkpoint_callback()
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_config.py", line 60, in configure_checkpoint_callback
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_config.py", line 60, in configure_checkpoint_callback
2020-06-20T14:50:19.705+01:00
self.configure_checkpoint_callback()
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_config.py", line 60, in configure_checkpoint_callback
2020-06-20T14:50:19.705+01:00
"checkpoints"
2020-06-20T14:50:19.705+01:00
"checkpoints"
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/posixpath.py", line 94, in join
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/posixpath.py", line 94, in join
2020-06-20T14:50:19.705+01:00
"checkpoints"
2020-06-20T14:50:19.705+01:00
genericpath._check_arg_types('join', a, *p)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/genericpath.py", line 149, in _check_arg_types
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/posixpath.py", line 94, in join
2020-06-20T14:50:19.705+01:00
genericpath._check_arg_types('join', a, *p)
2020-06-20T14:50:19.705+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/genericpath.py", line 149, in _check_arg_types
2020-06-20T14:50:19.705+01:00
(funcname, s.__class__.__name__)) from None
2020-06-20T14:50:19.705+01:00
genericpath._check_arg_types('join', a, *p)
2020-06-20T14:50:19.705+01:00
TypeError: join() argument must be str or bytes, not 'NoneType'
2020-06-20T14:50:19.706+01:00
File "/home/user/miniconda/envs/py36/lib/python3.6/genericpath.py", line 149, in _check_arg_types
2020-06-20T14:50:19.706+01:00
(funcname, s.__class__.__name__)) from None
2020-06-20T14:50:19.706+01:00
TypeError: join() argument must be str or bytes, not 'NoneType'
2020-06-20T14:50:19.706+01:00
(funcname, s.__class__.__name__)) from None
2020-06-20T14:50:19.706+01:00
TypeError: join() argument must be str or bytes, not 'NoneType'

Environment

  • PyTorch Version (e.g., 1.0): 1.4
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): Conda
  • Build command you used (if compiling from source):
  • Python version: 3.6.5
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: 4 x V100
  • Any other relevant information: Pytorch lightning 0.8.0

Additional context

@Laksh1997 Laksh1997 added bug Something isn't working help wanted Open to be worked on labels Jun 20, 2020
@Laksh1997
Copy link
Author

Ahh, it may be because I do kwargs.pop("metrics") which then means for the other processes its a NoneType.

I've copy.deepcopy'd kwargs. Let's see if that fixes it!

@Laksh1997
Copy link
Author

Turns out the above didn't help.

@Laksh1997
Copy link
Author

The error seems to be here (trainer/callback_config.py line 60)

    def configure_checkpoint_callback(self):
        """
        Weight path set in this priority:
        Checkpoint_callback's path (if passed in).
        User provided weights_saved_path
        Otherwise use os.getcwd()
        """
        ckpt_path = self.default_root_dir
        if self.checkpoint_callback:
            # init a default one
            if self.logger is not None:
                save_dir = (getattr(self.logger, 'save_dir', None) or
                            getattr(self.logger, '_save_dir', None) or
                            self.default_root_dir)

                # weights_save_path overrides anything
                if self.weights_save_path is not None:
                    save_dir = self.weights_save_path

                version = self.logger.version if isinstance(
                    self.logger.version, str) else f'version_{self.logger.version}'
                ckpt_path = os.path.join(
                    save_dir,
                    self.logger.name,
                    version,
                    "checkpoints"
                )

@Laksh1997
Copy link
Author

One of the arguments in the last os.path.join is a NoneType. But it only happens on ddp2. Confusing...

More confusing- exactly 3 (not 4) processes log this error. So one process seems to be fine but not the other 3!

@Laksh1997
Copy link
Author

I have a suspicion self.default_root_dir is None.

I am now explicitly passing in a default root dir.

@Laksh1997
Copy link
Author

The above didn't work. I now think that it's something to do with the logger, specifically logger.name

@Laksh1997
Copy link
Author

Here is my logger code:

def setup_wandb_logging(total_cfg: Dict):
    """
    Helper function to set-up WandB logging.
    Parameters
    ----------
    total_cfg: A dictionary containing all possible config for a training run! (Model + Data + Training config)

    Returns
    -------
    a WandbLogger ready to log in the training run.
    """
    wandb_logger = WandbLogger(
        name=total_cfg["name"],
        version=total_cfg["name"],
        save_dir="checkpoints",
        offline=False,
        anonymous=False,
        project=total_cfg["project"],
        tags=None,
        experiment=None,
    )
    wandb_logger.log_hyperparams(total_cfg)
    return wandb_logger

which I then pass into trainer as Trainer(logger=wandb_logger, **kwargs)

@Laksh1997
Copy link
Author

Response from wandb:

Hi Laksh - Try using wandb.init (reinit=True) wandb.join at the end of each run 

@williamFalcon
Copy link
Contributor

looking at this... so it looks to be specifically wb related?

@williamFalcon
Copy link
Contributor

can you put up a colab that creates this issue?

@Laksh1997
Copy link
Author

@williamFalcon Yeah it looks like it. I have tried to contact the wandb team but they've given me limited response so far. Sure, let me put together a basic colab.

@vanpelt
Copy link
Contributor

vanpelt commented Jun 24, 2020

Hey guys, we can look into this. @Laksh1997 can you share a basic colab to reproduce?

@Laksh1997
Copy link
Author

Hi @vanpelt, thanks! I'm working on a colab right now. Will need to go on a multi GPU machine to confirm error.

I have tried 4 x V100 (p3.8xlarge) and 16 x K80 (p2.16xlarge) and I get the same error on only ddp (but works fine on dp).

@aced125
Copy link

aced125 commented Jun 24, 2020

@vanpelt @williamFalcon Here's a working notebook.

https://colab.research.google.com/drive/1dTgKDU4S8Oy8g9AvXZLEoE5_aDyzBLb_?usp=sharing

To run on multi GPU you will need to copy the code (probs only 150 lines) input a script.

The key things to note is in the Hparams, change your wandb name and project to whatever you want.

Also, remember to turn on multi GPU (set gpus: 4 and distributed_backend="ddp" in trainer_kwargs in hparams).

Let me know if any issues

@Laksh1997
Copy link
Author

@vanpelt @williamFalcon Update - I just tried this on pytorch 1.6 nightly, and the error persists.

@Laksh1997
Copy link
Author

@vanpelt @williamFalcon Have you by any chance had any luck with the issue?

@williamFalcon
Copy link
Contributor

looking in a few hours. want to get this fix into 0.8.2

@Laksh1997
Copy link
Author

Thanks so much!

@williamFalcon
Copy link
Contributor

ok... running your exact code on a single GPU:


  | Name  | Type                | Params
----------------------------------------------
0 | model | EncoderDecoderModel | 245 M 
Epoch 1:   0%|                                                                                                                                                       | 0/828 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:761.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Epoch 1:  16%|██████████████████▏                                                                                              | 133/828 [00:50<04:23,  2.64it/s, loss=6.848, v_num=2nn4c31m]^Cwandb: Ctrl-c pressed.
wandb: Program failed with code 255. Press ctrl-c to abort syncing.
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
  warnings.warn(*args, **kwargs)
Epoch 1:  16%|██████████████████▏                                                                                              | 133/828 [00:50<04:23,  2.63it/s, loss=6.848, v_num=2nn4c31m]

wandb: Waiting for W&B process to finish, PID 29584
wandb: Run summary:
wandb:         _step 135
wandb:    _timestamp 1593277979.7518523
wandb:      _runtime 72.95207810401917
wandb:          loss 6.491087913513184
wandb:   global_step 100
wandb:         epoch 0
wandb: Syncing 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:                                                                                
wandb: Synced MY-WANDB-NAME: https://app.wandb.ai/stk/MY-WANDB-PROJECT/runs/2nn4c31m

Trying ddp now

@williamFalcon
Copy link
Contributor

ok fixed!

/opt/conda/lib/python3.7/site-packages/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
name: MY-WANDB-NAME
project: MY-WANDB-PROJECT
train_bs: 4
val_bs: 4
num_workers: 4
max_length: 160
num_datapoints: 100000
optimizer: Ranger
optimizer_kwargs:
  lr: 0.0003
  alpha: 0.5
  betas:
  - 0.95
  - 0.999
  eps: 1.0e-05
  weight_decay: 0.001
schedulers_kwargs:
  num_warmup_steps: 1000
trainer_kwargs:
  gpus: 2
  gradient_clip_val: 0.5
  accumulate_grad_batches: 4
  min_epochs: 5
  max_epochs: 100
  precision: 32
  distributed_backend: ddp

initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
wandb: Tracking run with wandb version 0.9.1
wandb: Run data is saved locally in wandb/run-20200627_174833-2sdoruup
wandb: Syncing run MY-WANDB-NAME
wandb: ⭐️ View project at https://app.wandb.ai/stk/MY-WANDB-PROJECT
wandb: 🚀 View run at https://app.wandb.ai/stk/MY-WANDB-PROJECT/runs/2sdoruup
wandb: Run `wandb off` to turn off syncing.


  | Name  | Type                | Params
----------------------------------------------
0 | model | EncoderDecoderModel | 245 M 
wandb: Tracking run with wandb version 0.9.1
wandb: Run data is saved locally in wandb/run-20200627_174834-3gkane0x
wandb: Syncing run MY-WANDB-NAME
wandb: ⭐️ View project at https://app.wandb.ai/stk/MY-WANDB-PROJECT
wandb: 🚀 View run at https://app.wandb.ai/stk/MY-WANDB-PROJECT/runs/3gkane0x
wandb: Run `wandb off` to turn off syncing.

Epoch 1:   0%|                                                                                                                                                       | 0/414 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:761.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
/opt/conda/lib/python3.7/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:761.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Epoch 1:   7%|███████▉                                                                                                          | 29/414 [00:11<02:36,  2.46it/s, loss=9.714, v_num=2sdoruup]

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 27, 2020

2 things:

  1. we had a bug where something was trying to access a property not on rank zero which is now fixed.
  2. your prepare_data is not correct now.

Prepare data is only ever called on the root GPU... this means assigning something self.something = a will only work on GPU 0. So, when you try to access that it will break on other GPUs.

We fixed this by introducing

def setup(self, step):

in setup, you can assign whatever you want.

Here are all the details on how to prepare data
https://pytorch-lightning.readthedocs.io/en/stable/lightning-module.html#data-preparation

image

The fix is very simple in the HF example:

# old
def prepare_data(self):
    self.x = train_split

# new
def setup(self, step):
    self.x = train_split

And use prepare_data only for downloads

def prepare_data(self):
    self.download()
    tokenize()
    etc()

   self.dont_assing = it_wont_work_on_other_gpus

FYI @sshleifer

@Laksh1997 here's the fixed code
https://gist.github.com/williamFalcon/645019619bdd897d135d232556bcf27d

@Laksh1997
Copy link
Author

@williamFalcon Thank you so much! Amazing!

So if the data is particularly light to download, we might as well just put all the code in setup?

Also what is the step argument for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants