TriviaQA LR scheduler code issue #37

apoorv2904 · 2020-05-18T13:30:54Z

Hi,

For single gpu training using the triviaqa code script, the learning rate goes to 0 in the first epoch itself.

Possible reasons: For a batchsize of 1, the global_step in pytorch_lightning increases with each batch of size 1 returned by the data_loader. It doesn't correspond to the number of optimizer steps. The LR scheduler was written with accumulated gradient batch size and thus the learning rate goes to 0 within the first epoch itself.

Thanks.
Apoorv

ibeltagy · 2020-05-20T13:02:20Z

Very good catch. Thanks, @apoorv2904. This is a bug in pytorch-lightning==0.6.0, and it has been fixed in later releases (Lightning-AI/pytorch-lightning#832). I would suggest you update to a more recent version of PTL (say version 0.7.5, not the most recent 0.7.6 because of a higher chance of bugs). If I am not mistaken, everything should work the same except loading a checkpoint, which requires resume_from_checkpoint (https://github.com/ibeltagy/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L115). If you can make that change and submit a PR, that will be very much appreciated.

Fan-Luo · 2020-09-16T08:39:37Z

Hi,

When I update pl from 0.6.0 to 0.7.5 , I got:

ERROR: mkl-random 1.0.1 requires cython, which is not installed.
ERROR: torchvision 0.4.2 has requirement torch==1.3.1, but you'll have torch 1.6.0 which is incompatible.
ERROR: longformer 0.1 has requirement pytorch-lightning==0.6.0, but you'll have pytorch-lightning 0.7.5 which is incompatible.
ERROR: thinc 6.12.1 has requirement msgpack<0.6.0,>=0.5.6, but you'll have msgpack 0.6.1 which is incompatible.
ERROR: spacy 2.0.16 has requirement regex==2018.01.10, but you'll have regex 2020.7.14 which is incompatible.

When running trainer.fit(model), I also got:

miniconda3/envs/hotpotqa/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py in backward(self, trainer, loss, optimizer, optimizer_idx)
146
147 if self.trainer.use_native_amp:
--> 148 self.trainer.scaler.scale(loss).backward()
149
150 # TODO: remove in v0.8.0

AttributeError: 'Trainer' object has no attribute 'scaler'

Any comment/suggestion?

Thanks

ibeltagy · 2020-09-24T02:33:01Z

I am assuming you are using pytorch v.1.6, that's why pytorch-lightning is using native amp.

I would recommend to upgrade pytorch-lightning to version 0.8.5, I recently tested it and it has fewer bugs that previous versions
The reason you are getting this issue is because our optimizer_step here doesn't work nicely with native_amp. Removing our optimizer_step will fix this problem.
With optimizer_step gone, you need to update your configure_optimizers to return the following, and PTL will take care of the scheduler

return [optimizer], [{"scheduler": scheduler, "interval": "step"}]

Fan-Luo · 2020-09-27T07:23:30Z

Thank you for your reply.
What I did was simply added the one line fix into pytorch_lightning/trainer/training_loop.py according to the post (Lightning-AI/pytorch-lightning#832) you mentioned.

Fan-Luo referenced this issue in Fan-Luo/experiments Sep 27, 2020

add run_hotpotqa.py which runs to produce experiment results

beffe24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TriviaQA LR scheduler code issue #37

TriviaQA LR scheduler code issue #37

apoorv2904 commented May 18, 2020 •

edited

Loading

ibeltagy commented May 20, 2020 •

edited

Loading

Fan-Luo commented Sep 16, 2020 •

edited

Loading

ibeltagy commented Sep 24, 2020

Fan-Luo commented Sep 27, 2020

TriviaQA LR scheduler code issue #37

TriviaQA LR scheduler code issue #37

Comments

apoorv2904 commented May 18, 2020 • edited Loading

ibeltagy commented May 20, 2020 • edited Loading

Fan-Luo commented Sep 16, 2020 • edited Loading

ibeltagy commented Sep 24, 2020

Fan-Luo commented Sep 27, 2020

apoorv2904 commented May 18, 2020 •

edited

Loading

ibeltagy commented May 20, 2020 •

edited

Loading

Fan-Luo commented Sep 16, 2020 •

edited

Loading