-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint gives error #526
Comments
I'm pretty sure that |
Really? It is default to True, but it is still saving all the models from all epochs |
but it may be changed in #128 :) |
I used the newest version of pytorch lightning, and got this error: `.../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_i -- Process 0 terminated with the following error: Traceback (most recent call last): Thanks for helping! |
@Jiequannnnnnnnnn can you report it as a bug with reproducibility example....? |
Sorry....I tried and not sure how to add a bug label. What does reproducibility example mean in this case? |
you probably can't change a label, you need t to create a new issue... by example I mean a sample code which gave you this error plus what library version you used... |
@Borda added label. |
@williamFalcon unfortunately, we do not right/permission to do it :] |
Same issue as #525? |
No it is not the same issue |
I made a file that could produce more of the errors (except for the last Traceback )
I made a file that could produce more of the errors (except for the last Traceback ) |
Also, I am using the newest version of lightning |
If I comment out the checkpoint_callback in the trainer, it ends at the 8th epoch, not sure why this happened. |
Just covering over the obvious things first: the main issue is that Can you make sure that the parent directories exist? Can you turn off DDP and see if it works? |
@Jiequannnnnnnnnn still having issues? If so, will take a look at your code |
Yeah, still having the issue. Couldn't fix it. Thanks! |
@Jiequannnnnnnnnn still issues? should be fixed on master |
I have also met the problem. Have you ever solved it? |
Hi,
I wonder if we can only save the best model with lowest validation error, and not save the others checkpoints.
I took a look at checkpoint_callback's save_best_only (below), it seems that this is for saving the best at every epoch (because the file name changes at every epoch). So I wonder if we can only save the best in the whole training process. Thanks!
checkpoint_callback = ModelCheckpoint( filepath=os.getcwd(), save_best_only=True, verbose=True, monitor='val_loss', mode='min', prefix='' )
The text was updated successfully, but these errors were encountered: