-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EarlyStopping reinitializes to .wait=0 even with Trainer resume_from_checkpoint #1463
Comments
Hi! thanks for your contribution!, great first issue! |
I think there should be a check here and should not reset when loading from a checkpoint. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@jeremyjordan is this being added to #1504? |
@williamFalcon yes this is fixed and there is a test to prevent regressions |
🐛 Bug
When using Trainer's resume_from_checkpoint with EarlyStopping callback, the callback's patience progress (i.e. self.wait) is loaded according to the checkpoint, but is getting reset by its on_train_start method, making the checkpoint restoration moot.
Also, the EarlyStopping's .best is not saved or restored at all, making its restoration further unusable.
To Reproduce
Steps to reproduce the behavior:
Install using
pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade
And then use KeyboardInterrupt on the training when early_stopping.wait>0. Load the corresponding checkpoint (let's say it's
model_ckpt/_ckpt_epoch_5.ckpt
) and resume withThe
early_stopping
callback would print:And for
self.best
, I mean it's not even saved; do I need to write the code?Expected behavior
Checkpoint value of
self.wait
should be preserved rather than reset:And
self.best
should be saved and loaded from the checkpoint.Environment
This is ran on Google colab.
https://colab.research.google.com/drive/1ZdiFf6ksNpgsqOdSKM6lMO0yIhqpnTHD
Additional context
It is confusing what member variables of the model Lightning saves into the checkpoints from reading the tutorials -- it's implied it saves a wide range of things, but what is being saved is actually very specific.
Also confusingly there are many ways to restore a checkpoint (model's load_from_checkpoint method, trainer's resume_from_checkpoint parameter, and using test_tube). These are not well documented (at least I didn't find this page before searching github) and I have no idea if I used the right one.
The text was updated successfully, but these errors were encountered: