-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP Bug with Model Checkpoint parsing #2299
Comments
Ahh, it may be because I do I've |
Turns out the above didn't help. |
The error seems to be here (trainer/callback_config.py line 60)
|
One of the arguments in the last More confusing- exactly 3 (not 4) processes log this error. So one process seems to be fine but not the other 3! |
I have a suspicion I am now explicitly passing in a default root dir. |
The above didn't work. I now think that it's something to do with the logger, specifically |
Here is my logger code:
which I then pass into trainer as |
Response from wandb:
|
looking at this... so it looks to be specifically wb related? |
can you put up a colab that creates this issue? |
@williamFalcon Yeah it looks like it. I have tried to contact the wandb team but they've given me limited response so far. Sure, let me put together a basic colab. |
Hey guys, we can look into this. @Laksh1997 can you share a basic colab to reproduce? |
Hi @vanpelt, thanks! I'm working on a colab right now. Will need to go on a multi GPU machine to confirm error. I have tried 4 x V100 (p3.8xlarge) and 16 x K80 (p2.16xlarge) and I get the same error on only ddp (but works fine on dp). |
@vanpelt @williamFalcon Here's a working notebook. https://colab.research.google.com/drive/1dTgKDU4S8Oy8g9AvXZLEoE5_aDyzBLb_?usp=sharing To run on multi GPU you will need to copy the code (probs only 150 lines) input a script. The key things to note is in the Hparams, change your wandb name and project to whatever you want. Also, remember to turn on multi GPU (set gpus: 4 and distributed_backend="ddp" in Let me know if any issues |
@vanpelt @williamFalcon Update - I just tried this on pytorch 1.6 nightly, and the error persists. |
@vanpelt @williamFalcon Have you by any chance had any luck with the issue? |
looking in a few hours. want to get this fix into 0.8.2 |
Thanks so much! |
ok... running your exact code on a single GPU:
Trying ddp now |
ok fixed!
|
2 things:
Prepare data is only ever called on the root GPU... this means assigning something We fixed this by introducing
in setup, you can assign whatever you want. Here are all the details on how to prepare data The fix is very simple in the HF example:
And use prepare_data only for downloads
FYI @sshleifer @Laksh1997 here's the fixed code |
@williamFalcon Thank you so much! Amazing! So if the data is particularly light to download, we might as well just put all the code in Also what is the step argument for? |
🐛 Bug
My script works with CPU, single-GPU and dp.
I need ddp to do 16 bit training. Also even on a single machine ddp is faster.
Here is my
ModelCheckpoint
code:In my case it would generate the checkpoint:
checkpoints/epoch=4_val_loss=0.6_auc=0.85
for example.Although I even tried it with just
checkpoints
and it's the same issue.The issue is the following:
Environment
conda
,pip
, source): CondaAdditional context
The text was updated successfully, but these errors were encountered: