Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch counting is one-off in multiple instances #3032

Closed
AAnoosheh opened this issue Aug 18, 2020 · 3 comments · Fixed by #3061
Closed

Epoch counting is one-off in multiple instances #3032

AAnoosheh opened this issue Aug 18, 2020 · 3 comments · Fixed by #3061
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@AAnoosheh
Copy link

AAnoosheh commented Aug 18, 2020

🐛 Bug

Two issues occur:

  1. The final epoch does not save a checkpoint during training.
  2. Resuming from a checkpoint N will start the epochs at N+2.

Expected behavior

  1. Final checkpoint should save a .ckpt file, as usual.
  2. Should resume from epoch N+1.

Environment

* CUDA:
	- GPU:
		- Tesla V100-DGXS-16GB
		- Tesla V100-DGXS-16GB
		- Tesla V100-DGXS-16GB
		- Tesla V100-DGXS-16GB
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.6.0
	- pytorch-lightning: 0.9.0rc12
	- tensorboard:       2.2.1
	- tqdm:              4.46.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		-
	- processor:         x86_64
	- python:            3.7.7
	- version:           #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
@AAnoosheh AAnoosheh added bug Something isn't working help wanted Open to be worked on labels Aug 18, 2020
@ananyahjha93 ananyahjha93 self-assigned this Aug 18, 2020
@ananyahjha93 ananyahjha93 added the priority: 0 High priority task label Aug 18, 2020
@awaelchli
Copy link
Contributor

@AAnoosheh Honestly I do not understand how the PR you linked relates to the bug your report. Did you mean to link another issue?

The final epoch does not save a checkpoint during training.

I don't experience this. The epoch number is 0-indexed, and by default it only saves best checkpoints. Could one of these reasons be why you may think this is a bug?

How can I reproduce the second issue?

@edenlightning edenlightning added this to the 0.9.0 milestone Aug 18, 2020
@AAnoosheh
Copy link
Author

AAnoosheh commented Aug 18, 2020

Sorry I should have clarified I use the following to save every epoch:

pl.callbacks.ModelCheckpoint(save_top_k=-1, verbose=True)

The second is done via Trainer(resume_from_checkpoint=some_ckpt_file)

I assume some change was made to move epochs to 0-index, when previously they were 1-indexed, and there's a mismatch now.

EDIT:
I also have no idea how a PR was linked in my comment. Those numbers came out of nowhere from the auto-generated issue template.

@ananyahjha93
Copy link
Contributor

ananyahjha93 commented Aug 19, 2020

@AAnoosheh so when you run pl.callbacks.ModelCheckpoint(save_top_k=-1, verbose=True) all the checkpoints are saved, however we do not save the last one as 'last.ckpt'. Also, the checkpoints are numbered from 0, so if you run for 4 epochs, the last checkpoint saved will be 'epoch=3.ckpt' and when you resume, it resumes from the expected 5th epoch.

Updating tests and code for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants