TrainResult/EvalResult does not log properly with on_step=True and on_epoch=True #2972

sykrn · 2020-08-14T04:51:02Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Here the minimal code in Colab: here

OR:

Code sample

import os

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl



from pytorch_lightning.metrics.functional import accuracy
from pytorch_lightning import TrainResult,EvalResult

class MNISTModel(pl.LightningModule):

    def __init__(self):
        super(MNISTModel, self).__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        acc = accuracy(y_hat, y)
        result = TrainResult(minimize=loss)
        result.log('tr_loss',loss,prog_bar=True,on_step=True,on_epoch=True)
        result.log('tr_acc',acc,prog_bar=True,on_step=True,on_epoch=True)      
        return result

    def validation_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        acc = accuracy(y_hat, y)
        result = EvalResult(checkpoint_on=loss,early_stop_on=loss)
        result.log('val_loss',loss,prog_bar=True,on_step=True,on_epoch=True)
        result.log('val_acc',acc,prog_bar=True,on_step=True,on_epoch=True)        
        return result


    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=0.02)

train_loader = DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()),shuffle=True, batch_size=32)
val_loader = DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=32)
mnist_model = MNISTModel()
trainer = pl.Trainer(gpus=1, progress_bar_refresh_rate=20,max_epochs=5)    
trainer.fit(mnist_model, train_loader,val_loader)

Expected behavior

The step_val_loss graph on Tensorboard should have $n_batch\times epochs$ items (the number of step), but it looks like the same as number of epoch (only few of them).

Environment

You can get the script and run it with this PL version 0.9.0.rc12, I used the master version here.

!pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade

Additional context

In another experiment, I found in the step_tr_loss also not logging properly (looks like on_epoch=True with different values)

Hope someone can help this problem. Or is there any logical error in mycode?
because, I always upgrade the PL version to master :D,

The text was updated successfully, but these errors were encountered:

justusschock · 2020-08-14T06:41:25Z

cc @williamFalcon

francoisruty · 2020-08-14T16:54:44Z

I have a similar problem

def training_step(self, batch, batch_nb):
    loss = ...
    result = TrainResult(minimize=loss)
    result.log('loss',loss,prog_bar=False,on_step=True,on_epoch=True)
    return result

yields nothing in tensorboard log dir, except 1 data point at step 49, it's really weird

williamFalcon · 2020-08-14T17:07:25Z

ummmm... that's weird. i'll check this out

sykrn · 2020-08-15T11:00:43Z

To be more specific, here is the SS of the eval step (of my code above). either accuracy or loss of EvalResult has the same problem.

Compare to this, in step tr_loss, (the correct one):

In another case:

I also found this inconsistency, in TrainResult that should record the step values but only log a few of them, otherwise in EvalResult, it was correct to log the step values (Flipping case to the one that I post here).

Also sometimes, it did not log anything at all (another case).

* add val step arg to metrics * add val step arg to metrics * add val step arg to metrics * add val step arg to metrics * add val step arg to metrics * add val step arg to metrics * add val step arg to metrics * add val step arg to metrics * add val step arg to metrics * add step metrics * add step metrics

sykrn added bug Something isn't working help wanted Open to be worked on labels Aug 14, 2020

williamFalcon mentioned this issue Aug 15, 2020

Fixes #2972 #2946 #2986

Merged

williamFalcon closed this as completed in #2986 Aug 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrainResult/EvalResult does not log properly with on_step=True and on_epoch=True #2972

TrainResult/EvalResult does not log properly with on_step=True and on_epoch=True #2972

sykrn commented Aug 14, 2020

justusschock commented Aug 14, 2020

francoisruty commented Aug 14, 2020

williamFalcon commented Aug 14, 2020

sykrn commented Aug 15, 2020 •

edited

Loading

TrainResult/EvalResult does not log properly with on_step=True and on_epoch=True #2972

TrainResult/EvalResult does not log properly with on_step=True and on_epoch=True #2972

Comments

sykrn commented Aug 14, 2020

🐛 Bug

To Reproduce

OR:

Code sample

Expected behavior

Environment

Additional context

justusschock commented Aug 14, 2020

francoisruty commented Aug 14, 2020

williamFalcon commented Aug 14, 2020

sykrn commented Aug 15, 2020 • edited Loading

Compare to this, in step tr_loss, (the correct one):

In another case:

sykrn commented Aug 15, 2020 •

edited

Loading