-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log training metrics for each epoch #914
Comments
Hey, thanks for your contribution! Great first issue! |
How about this:
|
I'm also trying to log training metrics at the end of each epoch, and tried it as follows (based off the MNIST example in https://colab.research.google.com/drive/1F_RNcHzTfFuQf-LeKvSlud6x7jXYkG31#scrollTo=x-34xKCI40yW):
However, the following error is thrown at
Weird thing is that it works for |
@awaelchli That works for me, thanks! @polars05 From my understanding, |
yeah, training_end is a bit confusing. we discussed this and decided to change the name/modify it. training_end aggregates outputs of a batch on dp. We need a true training_end and somethinf for the dp aggregation. @jeremyjordan @ethanwharris i think you guys had opinions about this? should we try to sneak this into this next release? |
@jbschiratti, thanks for the clarifications! I've modified my system as per what @awaelchli suggested to you and that works for me as well. However, in the process, I observed that under
For reference, here's my system:
My guess is that we do not have the flexibility to name the dict key in In addition, if we did the following instead in
then both I could simply do:
but it might be cleaner for the user to not have to create two separate dicts? |
@williamFalcon I agree that the behavior for The current |
@polars05 that is correct! you can see how the values returned are processed here on line 153 you can see that there's an explicit check for a |
Just wondering what is the best way to do something at the end of each training epoch, before validation starts? |
@versatran01 Maybe for your use case the on_epoch_end() hook is enough? It doesn't get the metrics from training though. If you need something that is conceptually similar as validation_end for training, then this does not exist yet. And this is also what @jbschiratti is asking for in this issue. |
@awaelchli Thanks for the suggestion. I just saw some huge updates to the callback system, will give it a try. |
@williamFalcon @jeremyjordan Changing the current |
training_end makes sense. maybe a better name for collect_dp_batches is: |
updated in 0.7.1 |
@williamFalcon I had a look at the latest release and I think that my issue still stands. Unless I am mistaken, there is still no equivalent of At each epoch, one can average validation loss across all batches by doing:
But as of today, there is still no way to do the same in the training loop ! I though about the function |
Shall I open a new issue ? |
@jbschiratti you're right, i don't see where this is implemented. reopened the issue. |
@jeremyjordan I pushed a "proof of concept". See this commit. This is certainly not perfect but tests (the ones which were not skipped) are OK and it is what I'm expecting from pytorch-lightning's API: the behavior of |
@williamFalcon Shall I start a PR with what I already did? |
@jbschiratti this is awesome. yes please, submit the PR! |
Hi @jbschiratti just noticed a difference between def training_epoch_end(self, outputs):
# works
return outputs
def validation_epoch_end(self, outputs):
# fails, unless next two lines are uncommented
# loss_mean = outputs['val_loss'].mean().item()
# outputs["val_loss"] = loss_mean
return outputs Is this as intended? Or a bug? Thanks! |
Currently, I am able to log training metrics to Tensorboard using:
This logs training metrics (loss, for instance) after each batch. I would like to be able to average these metrics across all batches and log them to TensorBoard only once, at the end of each epoch. This is what the
validation_end
method does in your example: https://github.com/PyTorchLightning/pytorch-lightning/blob/446a1e23d7fe3b2e07f1a5887fe819d0dfa7d4e0/pl_examples/basic_examples/lightning_module_template.py#L145.I first thought about writing my own
training_end
method. But this method is called after each batch instead of being called at the end of an epoch (as I would have thought). The methodon_epoch_end
seems interesting but does not receive anoutputs
argument astraining_end
does. Basically, in my model, I would like to write something like:self.logger.experiment.add_scalar('training_loss', train_loss_mean, global_step=self.current_epoch)
, but I do not know where to put this line.The text was updated successfully, but these errors were encountered: