-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing with SLURM #2278
Comments
Hi! thanks for your contribution!, great first issue! |
is this with 0.8.1? |
It was with 0.8.0. I upgraded to 0.8.1 and the problem persists. Also, my manual checkpointing with |
Could be related to issue with #2231 experienced the same problems in my setup. |
My hypothesis is that something is going wrong with determining the rank of the process when running on slurm, thus the logger calls are never executed and some outputs for example
Are never shown, otherwise training executes as regulalry. This could be due to an incorrect (non-zero) value in |
You might want to try after this #1504 is merged. |
Can you check to see if the weights are being saved under My guess is that they're being saved but for some reason |
is this fixed by #2341? |
yes |
What is your question?
I have a pytorch-lightning code with checkpointing that runs well on my desktop. But when I run it on our cluster with SLURM, the checkpoints do not get saved.
Code
What have you tried?
I run it in the cluster with the following code:
What's your environment?
The text was updated successfully, but these errors were encountered: