Add support to log hparams and metrics to tensorboard? #1228

mRcSchwering · 2020-03-24T22:20:30Z

How can I log metrics (e.g. validation loss of best epoch) together with the set of hyperparameters?

I have looked through the docs and through the code.
It seems like an obvious thing, so maybe I'm just not getting it.

Currently, the only way that I found was to extend the logger class:

class MyTensorBoardLogger(TensorBoardLogger):

    def __init__(self, *args, **kwargs):
        super(MyTensorBoardLogger, self).__init__(*args, **kwargs)

    def log_hyperparams(self, *args, **kwargs):
        pass

    @rank_zero_only
    def log_hyperparams_metrics(self, params: dict, metrics: dict) -> None:
        params = self._convert_params(params)
        exp, ssi, sei = hparams(params, metrics)
        writer = self.experiment._get_file_writer()
        writer.add_summary(exp)
        writer.add_summary(ssi)
        writer.add_summary(sei)
        # some alternative should be added
        self.tags.update(params)

And then I'm writing the hparams with metrics in a callback:

def on_train_end(self, trainer, module):
        module.logger.log_hyperparams_metrics(module.hparams, {'val_loss': self.best_val_loss})

But that doesn't seem right.
Is there a better way to write some metric together with the hparams as well?

Environment

OS: Ubuntu18.04
conda4.8.3
pytorch-lightning==0.7.1
torch==1.4.0

The text was updated successfully, but these errors were encountered:

github-actions · 2020-03-24T22:21:03Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-03-25T09:54:28Z

it seems to be duplicated, pls continue in #1225

mRcSchwering · 2020-03-25T13:56:10Z

Actually #1225 is not related. In that issue it's about providing a Namespace as hparams. Here, its about logging a metric such as validation accuracy together with a set of hparams.

awaelchli · 2020-03-25T16:58:52Z

at which point would you like to log that? a) at each training step b) at the end of training c) something else?

awaelchli · 2020-03-25T17:08:06Z

Did you try this: In training_step for example:

self.hparams.my_custom_metric = 3.14
self.logger.log_hyperparams(hparams)

mRcSchwering · 2020-03-26T08:59:33Z

Yes I tried it. It seems that updating the hyperparams (writing a seond time) doesn't work. That's why I overwrite the original log_hyperparams to do nothing, and then only call my own implementation log_hyperparams_metrics at the very end of the training. (log_hyperparams is called automatically by the pytorch lightning framework at some point during the start of the trianing).

mRcSchwering · 2020-03-26T09:00:50Z

I want to achieve b).
E.g. I run 10 random hparams sampling rounds, then I want to know which hparam set gave the best validation loss.

Borda · 2020-04-09T11:51:53Z

I am maybe something missing on your use-case... you are running some hyperparameter search with just one logger for all results? Not sure if storing all into single logger run is a good idea 🐰

mRcSchwering · 2020-04-09T14:37:40Z

Usually I am training and validating a model with different sets of hyperparameters in a loop. After each round the final output is often something like "val-loss". This would be the validation loss of the best epoch achieved in this particular round.
Eventually, I have a number of "best validation losses" and each of these represents a certain set of hyper parameters.

After the last training round I am looking at the various sets of hyperparameters and compare them to their associated best validation loss. Tensorboard already provides tools for that: Visualize the results in TensorBoard's HParams plugin.

In your TensorBoardLogger you are already using the hparams function to summarize the hyperparameters. So, you are almost there. This function can also take metrics as a second argument. However, in your current implementation you always pass a {}. That's why I had to overwrite your original implementation. Furthermore you are writing this summary once in the beginning of the round. But the metrics are only known at the very end.

mRcSchwering · 2020-04-09T14:38:31Z

I wonder, how do you compare different hyperparameter sets? Maybe there is a functionality that I didn't find...

SpontaneousDuck · 2020-04-15T18:43:08Z

I am also seeing this same issue. No matter what I write to with log_hyperparams , no data is outptu to Tensorboard. I only see a line in tensorboard for my log with no data for each log. The input I am using is a dict with values filled. I tried both before and after trainer.fit() and no results.

SpontaneousDuck · 2020-04-16T12:05:13Z

So it appears like Trainer.fit() is calling run_pretrain_routine which checks if the trainer has the hparams attribute. Since this attribute is defined in the __init__ function of Trainer, even though it is set to None by default, Lightning will still write out the empty hyperparameters to the logger. In the case of Tensorfboard, this causes all subsequent writes to the hyper-parameters to be ignored. You can sole this in one of two ways:

Call delattr(model, "hparams") on your LightningModule before trainer.fit() to ensure the hyperparameters are not automatically written out.
Use @mRcSchwering code above. This will not write initially since he is passing the original log_hyperparams.

class MyTensorBoardLogger(TensorBoardLogger):

    def __init__(self, *args, **kwargs):
        super(MyTensorBoardLogger, self).__init__(*args, **kwargs)

    def log_hyperparams(self, *args, **kwargs):
        pass

    @rank_zero_only
    def log_hyperparams_metrics(self, params: dict, metrics: dict) -> None:
        from torch.utils.tensorboard.summary import hparams
        params = self._convert_params(params)
        exp, ssi, sei = hparams(params, metrics)
        writer = self.experiment._get_file_writer()
        writer.add_summary(exp)
        writer.add_summary(ssi)
        writer.add_summary(sei)
        # some alternative should be added
        self.tags.update(params)

Tip: I was only able to get metric values to display when they matched the tag name of a scalar I had logged previously in the log.

williamFalcon · 2020-04-16T12:28:33Z

umm. so weird. my tb logs hparams automatically when they are attached tk the module (self.hparams=hparams).

and metrics are logged normally through the log key.

did you try that? what exactly are you trying to do?

SpontaneousDuck · 2020-04-16T12:51:32Z

What log key are you using? I was never able to get metrics to show up in the hyperparameters window without using the above method. I can log the metrics to the scalars tab but not the hparams tab. I am going for something like this in my hparams tab:

I was able to get this using the above code. Setting the hparams attribute gave me the first column with the hyperparameters but I could not figure out how to add the accuracy and loss columns without the modified logger. Thanks!

thhung · 2020-04-18T13:28:49Z

Maybe a full simple example in Colab could be easier for this discussion?

reactivetype · 2020-04-25T21:32:29Z

@SpontaneousDuck I tried your fix with pytorch-lightning 0.7.4rc7 but it did not work for me.
I got the hparams in tb but the metrics are not enabled.

reactivetype · 2020-04-25T21:35:15Z

and metrics are logged normally through the log key.

Can you please give a simple example?

williamFalcon · 2020-04-26T21:53:49Z

@mRcSchwering

if the cluster or (you) kill your job, how would you log the parameters?
What if I need to know the params AS the job is running so that i can see the effect each has on the loss curve?

These are the reasons we do this in the beginning of training... but agree this is not optimal. So, this seems to be a fundamental design problem with tensorboard unless I'm missing something obvious?

FYI:
@awaelchli , @justusschock

mRcSchwering · 2020-04-27T06:58:28Z

@williamFalcon indeed both cases wont work with my solution.
Currently, I just wrote a function that writes the set of hparams with the best loss metric into a csv and updates it after every round.

It's not really nice because I have all the epoch-wise metrics nicely organized in the tensorboard, but the hparams overview is in a separate .csv without all the tensorboard features.

justusschock · 2020-04-27T07:14:10Z

Just a question: If you log them multiple times, would they be overwritten on tensorboard? Because then we could probably also log them each epoch.

mRcSchwering · 2020-04-27T07:16:07Z

From some comments I get the feeling we might be talking about 2 things here.
There are 2 levels of metrics here. Let me clarify with an example:

Say I have a model which has the hyperparameter learning rate, it's a classifier, I want to train it for 100 epochs, I want to try out 2 different learning rates (say 1e-4 and 1e-5).

On the one hand, I want to describe the training progress. I will do that with a training/validation loss and accuracy, which I write down for every epoch. Let me call these metrics epoch-wise metrics.

On the other hand, I want to have an overview of which learning rate worked better.
So maybe 1e-4 had the lowest validation loss at epoch 20 with 0.6, and 1e-5 at epoch 80 with 0.5.
This summary might get updates every epoch, but it only summarizes the best achieved epoch of every training run. In this summary I can see what influence the learning rate has on the training. Let me call this run metrics.

What I was talking about the whole time is the second case, the run metrics.
The first case (epoch-wise metrics) work fine for me.
What I was trying to get is the second case.
In Tensorboard there is a extra tab for this (4_visualize_the_results_in_tensorboards_hparams_plugin).

justusschock · 2020-04-27T07:21:38Z

Okay, so just to be sure: for your run metrics, you would like to log metrics and hparams in each step?

I think, that's a bit overkill to add it as default behaviour. If it's just about the learning rate maybe #1498 could help you there. Otherwise I'm afraid I can't think of a generic implementation here, that won't add much overhead for most users :/

mRcSchwering · 2020-04-28T06:38:22Z

Hi, unfortunately it doesn't work yet.
I tried out the change and logged hparams like this:

current_loss = float(module.val_loss.cpu().numpy())
if current_loss < self.best_loss:
    self.best_loss = current_loss
    metrics = {'hparam/loss': self.best_loss}
    module.logger.log_hyperparams(params=module.hparams, metrics=metrics)

The parameter appears as a column in the hparams tab but it doesnt get a value.

mRcSchwering · 2020-04-28T06:41:54Z

Btw, if I use the high level API from tensorboard, everything works fine:

with SummaryWriter(log_dir=log_dir) as w:
    w.add_hparams(module.hparams, metrics)

Is there a reason why you don't use the high level API in the tensoboard logger?

justusschock · 2020-04-28T07:18:40Z

Sorry, that was my mistake. I thought this was all handled by pytorch itself. Can you maybe try #1647 ?

mRcSchwering · 2020-04-28T10:50:44Z

So theoretically it works. log_hyperparams can log metrics now.
However, every call to log_hyperparams creates a new event file. So, I would still have to use my hack in order to make sure log_hyperparams is only called once.

williamFalcon · 2020-04-28T11:36:20Z

but how do you do this if you kill training prematurely or the cluster kills your job prematurely?

your solution assumes the user ALWAYS gets to training_end

SpontaneousDuck · 2020-04-28T11:54:38Z

The current log_hyperparams does follow the default SummaryWriter outcome now. (creating a separate tensorboard file for each call to log_hyperparams) The problem we are seeing here is the default performance does not match hour need case or flexibility. Pytorch Lightning saw this problem which is why they did not use this implementation in TensorBoardLogger. This breaks the link between all other metrics you logged for the training session so you have one file with all your training logs then a separate one with just hyperparameters.

It also looks like the Keras way of doing this is writing the hyperparameters at the beginning of training then writing the metrics and status message at the end. Not sure if that i possible in PyTorch right now though...

The best solution to this I believe is just allowing the user more control. The code @mRcSchwering wrote above does this mostly. Having metrics be an optional parameter would solve this. If you call our modified TensorBoardLogger with metrics logging, as long as your tags for metrics you want to display are matching other logs, you can write the hyper parameters with dummy metrics at the beginning of training then tensorboard will automatically update the metrics with the most recently logged data. Example steps below:

tb_logger.log_hyperparams_metrics(parameters, {"val/accuracy": 0, "val/loss": 1})

In pl.LightningModule, log metrics with matching tags:

def validation_epoch_end(self, outputs):
    ...
    return {"log": {"val/accuracy": accuracy, "val/loss": loss}}

Tensorboard will pull the most recent value for these metrics from training. This gets around the issue of early stopping and still allows the user to log metrics.

mRcSchwering · 2020-04-28T15:53:42Z

I found that if those many files written by log_hyperparams are in the same directory (and have the same set of hyperparameters) tensorboard also correctly interprets them as a metric that was updated.
Here is a implementation using callbacks for reporting metrics, a module that collects results, and the logger workaround .

reactivetype · 2020-05-01T16:09:27Z

Here is a implementation using callbacks for reporting metrics, a module that collects results, and the logger workaround .

Would it be possible to extend your pattern to collect Trainer.test results? If you define on_test_start and test_epoch_end in MetricsAndBestLossOnEpochEnd, would it create new event files? Will the test metrics appear in same row as the best/val log and simply have columns for test metrics?

mRcSchwering · 2020-05-02T14:40:11Z

@reactivetype actually, now I just got what @SpontaneousDuck meant. Here is an example. In the beginning I write all hyperparameter metrics that I want. In my case I use this logger. I do this again with a module base class which writes a hyperparameter metric best/val-loss at the beginning of the training run.
During the training I can update this by just adding the appropriate key to the log dictionary of the return key. E.g.

    def validation_epoch_end(self, train_steps: List[dict]) -> dict:
        loss = self._get_loss(train_steps)
        log = {
            'loss/val': loss,
            'best/val-loss': self._update_best_loss(loss)}
        return {'val_loss': loss, 'log': log}

Then, it doesn't matter where you do this. You could also do this in the test_epoch_end. Everything is written into one file.

reactivetype · 2020-05-06T17:18:13Z

During the training I can update this by just adding the appropriate key to the log dictionary of the return key

Does this mean we can further simplify the callback pattern in your examples?

mRcSchwering · 2020-05-09T14:23:35Z

During the training I can update this by just adding the appropriate key to the log dictionary of the return key

Does this mean we can further simplify the callback pattern in your examples?

Yes, I'm not using callbacks anymore. Everything is inherited (a base module class). I usually prefer callbacks but in this case the on_epoch_end callback function doesn't get the predictions and targets of the epoch. So, I would have to write them onto a module attribute. The on_epoch_end hook however has all the targets and predictions as an argument.

wasserth · 2020-05-19T15:54:31Z

Are there any plans to include this into lightning? Would be really nice to be able to use metrics inside of Tensorboard hparams without any hacking.

williamFalcon · 2020-05-19T16:08:43Z

yes, let's officially make this a fix! @mRcSchwering want to submit a PR?

Borda · 2020-05-26T19:10:33Z

@mRcSchwering so it is solved, right? 🦝

mRcSchwering · 2020-05-26T19:21:25Z

@williamFalcon I just took a look at it. Wouldn't the pre-train routine be the place where the initial metrics have to be logged?

One could add another attribute to the lightning module which will be added as metrics to the call.
That would also mean, in both the pre-train routine and the LoggerCollection one would have to check the logger type (only adding the metrics if it is the tensorboard logger).
That solution looks more messy to me than just extending the logger (as discussed above).

Is there a reason why log_hyperparams is called in the pre-train routine and not in on_train_start?
That would make changing the log_hyperparams behavior by the user more transparent.

mRcSchwering · 2020-05-26T19:26:14Z

@Borda yes, with #1228 (comment) it is possible to do it, currently.

cramdoulfa · 2020-06-10T19:50:24Z

Thanks all for your contributions in solving this issue that I have also been struggling with.

Would anyone be able to summarize what the current recommended approach is and maybe edit the documentation? https://pytorch-lightning.readthedocs.io/en/latest/loggers.html

My current strategy is still quite hacky.

MilesCranmer · 2020-07-04T07:41:51Z

For those trying to solve this, here's another proposed solution: #1778 (comment)
I think the solution in #1228 (comment) is still a cleaner way of solving this, but not sure

gwiener · 2020-07-16T05:52:46Z

Following the idea from @SpontaneousDuck I found the following way to bypass the problem without modifying the framework code: add a dummy call to log_hyperparams with a metrics placeholder before Trainer.run_pretrain_routine is called, for example:

class MyModule(LightningModule):
    # set up 'test_loss' metric before fit routine starts
    def on_fit_start(self):
        metric_placeholder = {'test_loss': 0}
        self.logger.log_hyperparams(self.hparams, metrics=metric_placeholder)
    
    # at some method later
    def test_epoch_end(self, outputs):
        metrics_log = {'test_loss': something}
        return {'log': metrics_log}

The last metric will show in the TensorBoard HPARAMS table as desired, albeit the metric graph will include the first dummy point.

neil-tan · 2020-07-24T05:47:34Z

Thanks @gwiener for the quick workaround.
Though, with this method, dummy points are added every time the training is restored from checkpoints - still looking for a solution here.

TrentBrick · 2020-08-17T15:53:35Z

Likewise still looking for a solution here. And the solutions provided above did not work for me.

justusschock · 2020-08-17T18:01:08Z

we're discussing this in #2974

stale · 2020-10-22T02:24:25Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

mRcSchwering added the question Further information is requested label Mar 24, 2020

Borda added the duplicate This issue or pull request already exists label Mar 25, 2020

Borda closed this as completed Mar 25, 2020

mRcSchwering mentioned this issue Mar 25, 2020

How to log hparams to Tensorboard? #1225

Closed

Borda reopened this Mar 25, 2020

Borda added help wanted Open to be worked on discussion In a discussion stage and removed duplicate This issue or pull request already exists labels Apr 9, 2020

justusschock mentioned this issue Apr 27, 2020

Graceful shutdown on python interpreter exit #1631

Merged

5 tasks

williamFalcon closed this as completed in #1630 Apr 27, 2020

justusschock mentioned this issue Apr 28, 2020

Bug fix hparam logging with metrics #1647

Merged

5 tasks

williamFalcon reopened this May 19, 2020

williamFalcon changed the title ~~How to log hparams with metrics to tensorboard?~~ Add support to log hparams and metrics to tensorboard? May 19, 2020

stale bot added the won't fix This will not be worked on label Oct 22, 2020

stale bot closed this as completed Oct 29, 2020

Add support to log hparams and metrics to tensorboard? #1228

Add support to log hparams and metrics to tensorboard? #1228

Comments

mRcSchwering commented Mar 24, 2020

Environment

github-actions bot commented Mar 24, 2020

Borda commented Mar 25, 2020

mRcSchwering commented Mar 25, 2020 • edited Loading

awaelchli commented Mar 25, 2020

awaelchli commented Mar 25, 2020

mRcSchwering commented Mar 26, 2020 • edited Loading

mRcSchwering commented Mar 26, 2020

Borda commented Apr 9, 2020

mRcSchwering commented Apr 9, 2020

mRcSchwering commented Apr 9, 2020

SpontaneousDuck commented Apr 15, 2020

SpontaneousDuck commented Apr 16, 2020

williamFalcon commented Apr 16, 2020

SpontaneousDuck commented Apr 16, 2020

thhung commented Apr 18, 2020

reactivetype commented Apr 25, 2020 • edited Loading

reactivetype commented Apr 25, 2020

williamFalcon commented Apr 26, 2020

mRcSchwering commented Apr 27, 2020

justusschock commented Apr 27, 2020

mRcSchwering commented Apr 27, 2020 • edited Loading

justusschock commented Apr 27, 2020

mRcSchwering commented Apr 28, 2020

mRcSchwering commented Apr 28, 2020

justusschock commented Apr 28, 2020

mRcSchwering commented Apr 28, 2020

williamFalcon commented Apr 28, 2020

SpontaneousDuck commented Apr 28, 2020 • edited Loading

mRcSchwering commented Apr 28, 2020 • edited Loading

reactivetype commented May 1, 2020

mRcSchwering commented May 2, 2020 • edited Loading

reactivetype commented May 6, 2020 • edited Loading

mRcSchwering commented May 9, 2020

wasserth commented May 19, 2020

williamFalcon commented May 19, 2020

Borda commented May 26, 2020

mRcSchwering commented May 26, 2020 • edited Loading

mRcSchwering commented May 26, 2020

cramdoulfa commented Jun 10, 2020

MilesCranmer commented Jul 4, 2020 • edited Loading

gwiener commented Jul 16, 2020

neil-tan commented Jul 24, 2020

TrentBrick commented Aug 17, 2020 • edited Loading

justusschock commented Aug 17, 2020

stale bot commented Oct 22, 2020

mRcSchwering commented Mar 25, 2020 •

edited

Loading

mRcSchwering commented Mar 26, 2020 •

edited

Loading

reactivetype commented Apr 25, 2020 •

edited

Loading

mRcSchwering commented Apr 27, 2020 •

edited

Loading

SpontaneousDuck commented Apr 28, 2020 •

edited

Loading

mRcSchwering commented Apr 28, 2020 •

edited

Loading

mRcSchwering commented May 2, 2020 •

edited

Loading

reactivetype commented May 6, 2020 •

edited

Loading

mRcSchwering commented May 26, 2020 •

edited

Loading

MilesCranmer commented Jul 4, 2020 •

edited

Loading

TrentBrick commented Aug 17, 2020 •

edited

Loading