Epoch counting is one-off in multiple instances #3032

AAnoosheh · 2020-08-18T07:32:29Z

🐛 Bug

Two issues occur:

The final epoch does not save a checkpoint during training.
Resuming from a checkpoint N will start the epochs at N+2.

Expected behavior

Final checkpoint should save a .ckpt file, as usual.
Should resume from epoch N+1.

Environment

* CUDA:
	- GPU:
		- Tesla V100-DGXS-16GB
		- Tesla V100-DGXS-16GB
		- Tesla V100-DGXS-16GB
		- Tesla V100-DGXS-16GB
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.6.0
	- pytorch-lightning: 0.9.0rc12
	- tensorboard:       2.2.1
	- tqdm:              4.46.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		-
	- processor:         x86_64
	- python:            3.7.7
	- version:           #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-08-18T18:04:58Z

@AAnoosheh Honestly I do not understand how the PR you linked relates to the bug your report. Did you mean to link another issue?

The final epoch does not save a checkpoint during training.

I don't experience this. The epoch number is 0-indexed, and by default it only saves best checkpoints. Could one of these reasons be why you may think this is a bug?

How can I reproduce the second issue?

AAnoosheh · 2020-08-18T18:22:15Z

Sorry I should have clarified I use the following to save every epoch:

pl.callbacks.ModelCheckpoint(save_top_k=-1, verbose=True)

The second is done via Trainer(resume_from_checkpoint=some_ckpt_file)

I assume some change was made to move epochs to 0-index, when previously they were 1-indexed, and there's a mismatch now.

EDIT:
I also have no idea how a PR was linked in my comment. Those numbers came out of nowhere from the auto-generated issue template.

ananyahjha93 · 2020-08-19T14:35:57Z

@AAnoosheh so when you run pl.callbacks.ModelCheckpoint(save_top_k=-1, verbose=True) all the checkpoints are saved, however we do not save the last one as 'last.ckpt'. Also, the checkpoints are numbered from 0, so if you run for 4 epochs, the last checkpoint saved will be 'epoch=3.ckpt' and when you resume, it resumes from the expected 5th epoch.

Updating tests and code for this

AAnoosheh added bug Something isn't working help wanted Open to be worked on labels Aug 18, 2020

ananyahjha93 self-assigned this Aug 18, 2020

ananyahjha93 added the priority: 0 High priority task label Aug 18, 2020

ananyahjha93 assigned ananyahjha93 and unassigned ananyahjha93 Aug 18, 2020

edenlightning added this to the 0.9.0 milestone Aug 18, 2020

ananyahjha93 closed this as completed Aug 19, 2020

ananyahjha93 reopened this Aug 19, 2020

ananyahjha93 mentioned this issue Aug 20, 2020

make progress bar match internal epoch counter #3061

Merged

williamFalcon closed this as completed in #3061 Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epoch counting is one-off in multiple instances #3032

Epoch counting is one-off in multiple instances #3032

AAnoosheh commented Aug 18, 2020 •

edited

Loading

awaelchli commented Aug 18, 2020

AAnoosheh commented Aug 18, 2020 •

edited

Loading

ananyahjha93 commented Aug 19, 2020 •

edited

Loading

Epoch counting is one-off in multiple instances #3032

Epoch counting is one-off in multiple instances #3032

Comments

AAnoosheh commented Aug 18, 2020 • edited Loading

Expected behavior

Environment

awaelchli commented Aug 18, 2020

AAnoosheh commented Aug 18, 2020 • edited Loading

ananyahjha93 commented Aug 19, 2020 • edited Loading

AAnoosheh commented Aug 18, 2020 •

edited

Loading

AAnoosheh commented Aug 18, 2020 •

edited

Loading

ananyahjha93 commented Aug 19, 2020 •

edited

Loading