Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TriviaQA LR scheduler code issue #37

Open
apoorv2904 opened this issue May 18, 2020 · 4 comments
Open

TriviaQA LR scheduler code issue #37

apoorv2904 opened this issue May 18, 2020 · 4 comments

Comments

@apoorv2904
Copy link

apoorv2904 commented May 18, 2020

Hi,

For single gpu training using the triviaqa code script, the learning rate goes to 0 in the first epoch itself.

Possible reasons: For a batchsize of 1, the global_step in pytorch_lightning increases with each batch of size 1 returned by the data_loader. It doesn't correspond to the number of optimizer steps. The LR scheduler was written with accumulated gradient batch size and thus the learning rate goes to 0 within the first epoch itself.

Thanks.
Apoorv

@ibeltagy
Copy link
Collaborator

ibeltagy commented May 20, 2020

Very good catch. Thanks, @apoorv2904. This is a bug in pytorch-lightning==0.6.0, and it has been fixed in later releases (Lightning-AI/pytorch-lightning#832). I would suggest you update to a more recent version of PTL (say version 0.7.5, not the most recent 0.7.6 because of a higher chance of bugs). If I am not mistaken, everything should work the same except loading a checkpoint, which requires resume_from_checkpoint (https://github.com/ibeltagy/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L115). If you can make that change and submit a PR, that will be very much appreciated.

@Fan-Luo
Copy link

Fan-Luo commented Sep 16, 2020

Hi,

When I update pl from 0.6.0 to 0.7.5 , I got:

ERROR: mkl-random 1.0.1 requires cython, which is not installed.
ERROR: torchvision 0.4.2 has requirement torch==1.3.1, but you'll have torch 1.6.0 which is incompatible.
ERROR: longformer 0.1 has requirement pytorch-lightning==0.6.0, but you'll have pytorch-lightning 0.7.5 which is incompatible.
ERROR: thinc 6.12.1 has requirement msgpack<0.6.0,>=0.5.6, but you'll have msgpack 0.6.1 which is incompatible.
ERROR: spacy 2.0.16 has requirement regex==2018.01.10, but you'll have regex 2020.7.14 which is incompatible.

When running trainer.fit(model), I also got:

miniconda3/envs/hotpotqa/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py in backward(self, trainer, loss, optimizer, optimizer_idx)
146
147 if self.trainer.use_native_amp:
--> 148 self.trainer.scaler.scale(loss).backward()
149
150 # TODO: remove in v0.8.0

AttributeError: 'Trainer' object has no attribute 'scaler'

Any comment/suggestion?

Thanks

@ibeltagy
Copy link
Collaborator

I am assuming you are using pytorch v.1.6, that's why pytorch-lightning is using native amp.

  • I would recommend to upgrade pytorch-lightning to version 0.8.5, I recently tested it and it has fewer bugs that previous versions

  • The reason you are getting this issue is because our optimizer_step here doesn't work nicely with native_amp. Removing our optimizer_step will fix this problem.

  • With optimizer_step gone, you need to update your configure_optimizers to return the following, and PTL will take care of the scheduler

return [optimizer], [{"scheduler": scheduler, "interval": "step"}]

@Fan-Luo
Copy link

Fan-Luo commented Sep 27, 2020

Thank you for your reply.
What I did was simply added the one line fix into pytorch_lightning/trainer/training_loop.py according to the post (Lightning-AI/pytorch-lightning#832) you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants