Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes resuming checkpoints rerunning last epoch #866

Merged

Conversation

MattPainter01
Copy link
Contributor

Fixes #850

@MattPainter01 MattPainter01 requested a review from a team February 16, 2020 14:25
@pep8speaks
Copy link

pep8speaks commented Feb 16, 2020

Hello @MattPainter01! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-19 15:57:39 UTC

@@ -307,8 +307,8 @@ def restore(self, checkpoint_path, on_gpu):
def dump_checkpoint(self):

checkpoint = {
'epoch': self.current_epoch,
'global_step': self.global_step
'epoch': self.current_epoch + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that checkpoint can be saved not only at the end of the training epoch. For example, if you set val_check_interval=0.1 and after 0.15 of the training batches the training was interrupted you will continue from the second epoch whereas only 10% of the first epoch actually was processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll look into better ways of dealing with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it, I think resuming in such a way is probably a new feature, which should go in a new PR and have more discussion there. I've updated the test to allow for testing mid-epoch check pointing / resuming but commented out the checkpoints from mid-epochs so as to pass with the current resume method. When it is properly implemented we can use the full test.

For now I've added a warning if you load a mid-epoch checkpoint to alert the user that it will be unreliable to resume.

@MattPainter01 MattPainter01 changed the title Fixes resuming checkpoints rerunning last epoch Fixes resuming checkpoints rerunning last epoch [wip] Feb 16, 2020
Comment on lines 391 to 392
# Deals with peculiarity of different global step for odd vs even num_training_batches
if abs((self.global_step + 1) % self.num_training_batches) > 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this known?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the abs looks suspicious...

Copy link
Contributor Author

@MattPainter01 MattPainter01 Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check should be !=0 if the global step matched for odd vs even number of training steps, so if that's worked out then we shouldn't need it. Unless it's intended, in which case I think just (self.global_step + 1) % self.num_training_batches is sufficient since global step matches num_training_batches in the even case.

Thinking about it, I should probably just remove the abs, it is fine without.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the abs and made sure it handled accumulated batches properly. Can you think of anything else that might change the global step?

Currently the only test that throws this warning now is test_restore_models/test_dp_resume since this changes the percentage of train data used when resuming in a new trainer. Not much we can about that.

@MattPainter01 MattPainter01 changed the title Fixes resuming checkpoints rerunning last epoch [wip] Fixes resuming checkpoints rerunning last epoch Feb 17, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 just check the update on callbacks from #776

@Borda Borda added bug Something isn't working ready PRs ready to be merged labels Feb 18, 2020
@Borda Borda added this to the 0.6.1 milestone Feb 18, 2020
@williamFalcon
Copy link
Contributor

@MattPainter01 welcome!
awesome addition.
Had a few test failures on GPUs

@williamFalcon
Copy link
Contributor

image

@MattPainter01
Copy link
Contributor Author

MattPainter01 commented Feb 19, 2020

My bad, I hadn't ran the ddp tests some of which have 0 training batches. I've put in a check to skip the warning when we have no training batches. Passes all the tests on my machine now, except the slurm tests which I can't run.

@Borda Borda requested a review from a team February 20, 2020 09:15
@williamFalcon williamFalcon merged commit 6e7dc9c into Lightning-AI:master Feb 22, 2020
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Apr 3, 2020
* Properly restore current epoch and global step on resume

* Add test

* Move increment to saving rather than loading

* Fix other tests that refer to current epoch

* Formatting

* Add warning for mid-epoch resuming

* Formatting

* Fix warning check for accumulated batches

* Add variable to init

* Formatting

* Add check for 0 training steps

* Make check more readable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Epoch end checkpoint restarts previous epoch
5 participants