Fixes resuming checkpoints rerunning last epoch #866

MattPainter01 · 2020-02-16T14:20:55Z

Fixes #850

pep8speaks · 2020-02-16T14:47:35Z

Hello @MattPainter01! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-19 15:57:39 UTC

kuynzereb · 2020-02-16T17:08:15Z

pytorch_lightning/trainer/training_io.py

@@ -307,8 +307,8 @@ def restore(self, checkpoint_path, on_gpu):
    def dump_checkpoint(self):

        checkpoint = {
-            'epoch': self.current_epoch,
-            'global_step': self.global_step
+            'epoch': self.current_epoch + 1,


The problem is that checkpoint can be saved not only at the end of the training epoch. For example, if you set val_check_interval=0.1 and after 0.15 of the training batches the training was interrupted you will continue from the second epoch whereas only 10% of the first epoch actually was processed.

That's a good point, I'll look into better ways of dealing with this

Looking at it, I think resuming in such a way is probably a new feature, which should go in a new PR and have more discussion there. I've updated the test to allow for testing mid-epoch check pointing / resuming but commented out the checkpoints from mid-epochs so as to pass with the current resume method. When it is properly implemented we can use the full test.

For now I've added a warning if you load a mid-epoch checkpoint to alert the user that it will be unreliable to resume.

…lightning

MattPainter01 · 2020-02-17T13:13:42Z

pytorch_lightning/trainer/training_io.py

+        # Deals with peculiarity of different global step for odd vs even num_training_batches
+        if abs((self.global_step + 1) % self.num_training_batches) > 1:


Is this known?

the abs looks suspicious...

The check should be !=0 if the global step matched for odd vs even number of training steps, so if that's worked out then we shouldn't need it. Unless it's intended, in which case I think just (self.global_step + 1) % self.num_training_batches is sufficient since global step matches num_training_batches in the even case.

Thinking about it, I should probably just remove the abs, it is fine without.

I've removed the abs and made sure it handled accumulated batches properly. Can you think of anything else that might change the global step?

Currently the only test that throws this warning now is test_restore_models/test_dp_resume since this changes the percentage of train data used when resuming in a new trainer. Not much we can about that.

Borda

LGTM 🚀 just check the update on callbacks from #776

tests/test_trainer.py

williamFalcon · 2020-02-19T13:02:10Z

@MattPainter01 welcome!
awesome addition.
Had a few test failures on GPUs

williamFalcon · 2020-02-19T13:02:30Z

MattPainter01 · 2020-02-19T15:57:39Z

My bad, I hadn't ran the ddp tests some of which have 0 training batches. I've put in a check to skip the warning when we have no training batches. Passes all the tests on my machine now, except the slurm tests which I can't run.

* Properly restore current epoch and global step on resume * Add test * Move increment to saving rather than loading * Fix other tests that refer to current epoch * Formatting * Add warning for mid-epoch resuming * Formatting * Fix warning check for accumulated batches * Add variable to init * Formatting * Add check for 0 training steps * Make check more readable

MattPainter01 added 2 commits February 16, 2020 11:56

Properly restore current epoch and global step on resume

657f1b7

Add test

134255c

MattPainter01 requested a review from a team February 16, 2020 14:25

MattPainter01 added 2 commits February 16, 2020 14:47

Move increment to saving rather than loading

7da52a0

Fix other tests that refer to current epoch

9402ee9

Formatting

67b9938

kuynzereb reviewed Feb 16, 2020

View reviewed changes

MattPainter01 changed the title ~~Fixes resuming checkpoints rerunning last epoch~~ Fixes resuming checkpoints rerunning last epoch [wip] Feb 16, 2020

MattPainter01 added 4 commits February 17, 2020 12:08

Add warning for mid-epoch resuming

3e8b5c6

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

0228dec

…lightning

Update warning check

cd951ea

Formatting

5a0ccf4

MattPainter01 mentioned this pull request Feb 17, 2020

Epoch end checkpoint restarts previous epoch #850

Closed

MattPainter01 commented Feb 17, 2020

View reviewed changes

MattPainter01 changed the title ~~Fixes resuming checkpoints rerunning last epoch [wip]~~ Fixes resuming checkpoints rerunning last epoch Feb 17, 2020

MattPainter01 added 3 commits February 18, 2020 11:24

Fix warning check for accumulated batches

a879a13

Add variable to init

8a41530

Formatting

3ed5bfd

Borda approved these changes Feb 18, 2020

View reviewed changes

tests/test_trainer.py Show resolved Hide resolved

tests/test_trainer.py Show resolved Hide resolved

Borda added bug Something isn't working ready PRs ready to be merged labels Feb 18, 2020

Borda added this to the 0.6.1 milestone Feb 18, 2020

MattPainter01 added 2 commits February 19, 2020 15:52

Add check for 0 training steps

e7969f2

Make check more readable

8fa4cdd

Borda requested a review from a team February 20, 2020 09:15

williamFalcon merged commit 6e7dc9c into Lightning-AI:master Feb 22, 2020

john-zielke-snkeos mentioned this pull request Oct 25, 2023

Validation metrics not available when resuming training from checkpoint #18595

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes resuming checkpoints rerunning last epoch #866

Fixes resuming checkpoints rerunning last epoch #866

MattPainter01 commented Feb 16, 2020

pep8speaks commented Feb 16, 2020 •

edited

Loading

kuynzereb Feb 16, 2020

MattPainter01 Feb 16, 2020

MattPainter01 Feb 17, 2020

MattPainter01 Feb 17, 2020

Borda Feb 17, 2020

MattPainter01 Feb 17, 2020 •

edited

Loading

MattPainter01 Feb 18, 2020

Borda left a comment

williamFalcon commented Feb 19, 2020

williamFalcon commented Feb 19, 2020

MattPainter01 commented Feb 19, 2020 •

edited

Loading

		# Deals with peculiarity of different global step for odd vs even num_training_batches
		if abs((self.global_step + 1) % self.num_training_batches) > 1:

Fixes resuming checkpoints rerunning last epoch #866

Fixes resuming checkpoints rerunning last epoch #866

Conversation

MattPainter01 commented Feb 16, 2020

pep8speaks commented Feb 16, 2020 • edited Loading

Comment last updated at 2020-02-19 15:57:39 UTC

kuynzereb Feb 16, 2020

Choose a reason for hiding this comment

MattPainter01 Feb 16, 2020

Choose a reason for hiding this comment

MattPainter01 Feb 17, 2020

Choose a reason for hiding this comment

MattPainter01 Feb 17, 2020

Choose a reason for hiding this comment

Borda Feb 17, 2020

Choose a reason for hiding this comment

MattPainter01 Feb 17, 2020 • edited Loading

Choose a reason for hiding this comment

MattPainter01 Feb 18, 2020

Choose a reason for hiding this comment

Borda left a comment

Choose a reason for hiding this comment

williamFalcon commented Feb 19, 2020

williamFalcon commented Feb 19, 2020

MattPainter01 commented Feb 19, 2020 • edited Loading

pep8speaks commented Feb 16, 2020 •

edited

Loading

MattPainter01 Feb 17, 2020 •

edited

Loading

MattPainter01 commented Feb 19, 2020 •

edited

Loading