Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to log hparams and metrics to tensorboard? #1228

Closed
mRcSchwering opened this issue Mar 24, 2020 · 50 comments · Fixed by #1630 or #1647
Closed

Add support to log hparams and metrics to tensorboard? #1228

mRcSchwering opened this issue Mar 24, 2020 · 50 comments · Fixed by #1630 or #1647
Labels
discussion In a discussion stage help wanted Open to be worked on question Further information is requested won't fix This will not be worked on

Comments

@mRcSchwering
Copy link

How can I log metrics (e.g. validation loss of best epoch) together with the set of hyperparameters?

I have looked through the docs and through the code.
It seems like an obvious thing, so maybe I'm just not getting it.

Currently, the only way that I found was to extend the logger class:

class MyTensorBoardLogger(TensorBoardLogger):

    def __init__(self, *args, **kwargs):
        super(MyTensorBoardLogger, self).__init__(*args, **kwargs)

    def log_hyperparams(self, *args, **kwargs):
        pass

    @rank_zero_only
    def log_hyperparams_metrics(self, params: dict, metrics: dict) -> None:
        params = self._convert_params(params)
        exp, ssi, sei = hparams(params, metrics)
        writer = self.experiment._get_file_writer()
        writer.add_summary(exp)
        writer.add_summary(ssi)
        writer.add_summary(sei)
        # some alternative should be added
        self.tags.update(params)

And then I'm writing the hparams with metrics in a callback:

def on_train_end(self, trainer, module):
        module.logger.log_hyperparams_metrics(module.hparams, {'val_loss': self.best_val_loss})

But that doesn't seem right.
Is there a better way to write some metric together with the hparams as well?

Environment

  • OS: Ubuntu18.04
  • conda4.8.3
  • pytorch-lightning==0.7.1
  • torch==1.4.0
@mRcSchwering mRcSchwering added the question Further information is requested label Mar 24, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@Borda Borda added the duplicate This issue or pull request already exists label Mar 25, 2020
@Borda
Copy link
Member

Borda commented Mar 25, 2020

it seems to be duplicated, pls continue in #1225

@Borda Borda closed this as completed Mar 25, 2020
@mRcSchwering
Copy link
Author

mRcSchwering commented Mar 25, 2020

Actually #1225 is not related. In that issue it's about providing a Namespace as hparams. Here, its about logging a metric such as validation accuracy together with a set of hparams.

@awaelchli
Copy link
Contributor

at which point would you like to log that? a) at each training step b) at the end of training c) something else?

@awaelchli
Copy link
Contributor

Did you try this: In training_step for example:

self.hparams.my_custom_metric = 3.14
self.logger.log_hyperparams(hparams)

@mRcSchwering
Copy link
Author

mRcSchwering commented Mar 26, 2020

Yes I tried it. It seems that updating the hyperparams (writing a seond time) doesn't work. That's why I overwrite the original log_hyperparams to do nothing, and then only call my own implementation log_hyperparams_metrics at the very end of the training. (log_hyperparams is called automatically by the pytorch lightning framework at some point during the start of the trianing).

@mRcSchwering
Copy link
Author

I want to achieve b).
E.g. I run 10 random hparams sampling rounds, then I want to know which hparam set gave the best validation loss.

@Borda Borda added help wanted Open to be worked on discussion In a discussion stage and removed duplicate This issue or pull request already exists labels Apr 9, 2020
@Borda
Copy link
Member

Borda commented Apr 9, 2020

I am maybe something missing on your use-case... you are running some hyperparameter search with just one logger for all results? Not sure if storing all into single logger run is a good idea 🐰

@mRcSchwering
Copy link
Author

Usually I am training and validating a model with different sets of hyperparameters in a loop. After each round the final output is often something like "val-loss". This would be the validation loss of the best epoch achieved in this particular round.
Eventually, I have a number of "best validation losses" and each of these represents a certain set of hyper parameters.

After the last training round I am looking at the various sets of hyperparameters and compare them to their associated best validation loss. Tensorboard already provides tools for that: Visualize the results in TensorBoard's HParams plugin.

In your TensorBoardLogger you are already using the hparams function to summarize the hyperparameters. So, you are almost there. This function can also take metrics as a second argument. However, in your current implementation you always pass a {}. That's why I had to overwrite your original implementation. Furthermore you are writing this summary once in the beginning of the round. But the metrics are only known at the very end.

@mRcSchwering
Copy link
Author

I wonder, how do you compare different hyperparameter sets? Maybe there is a functionality that I didn't find...

@SpontaneousDuck
Copy link
Contributor

I am also seeing this same issue. No matter what I write to with log_hyperparams , no data is outptu to Tensorboard. I only see a line in tensorboard for my log with no data for each log. The input I am using is a dict with values filled. I tried both before and after trainer.fit() and no results.

@SpontaneousDuck
Copy link
Contributor

So it appears like Trainer.fit() is calling run_pretrain_routine which checks if the trainer has the hparams attribute. Since this attribute is defined in the __init__ function of Trainer, even though it is set to None by default, Lightning will still write out the empty hyperparameters to the logger. In the case of Tensorfboard, this causes all subsequent writes to the hyper-parameters to be ignored. You can sole this in one of two ways:

  1. Call delattr(model, "hparams") on your LightningModule before trainer.fit() to ensure the hyperparameters are not automatically written out.
  2. Use @mRcSchwering code above. This will not write initially since he is passing the original log_hyperparams.
class MyTensorBoardLogger(TensorBoardLogger):

    def __init__(self, *args, **kwargs):
        super(MyTensorBoardLogger, self).__init__(*args, **kwargs)

    def log_hyperparams(self, *args, **kwargs):
        pass

    @rank_zero_only
    def log_hyperparams_metrics(self, params: dict, metrics: dict) -> None:
        from torch.utils.tensorboard.summary import hparams
        params = self._convert_params(params)
        exp, ssi, sei = hparams(params, metrics)
        writer = self.experiment._get_file_writer()
        writer.add_summary(exp)
        writer.add_summary(ssi)
        writer.add_summary(sei)
        # some alternative should be added
        self.tags.update(params)

Tip: I was only able to get metric values to display when they matched the tag name of a scalar I had logged previously in the log.

@williamFalcon
Copy link
Contributor

umm. so weird. my tb logs hparams automatically when they are attached tk the module (self.hparams=hparams).

and metrics are logged normally through the log key.

did you try that? what exactly are you trying to do?

@SpontaneousDuck
Copy link
Contributor

What log key are you using? I was never able to get metrics to show up in the hyperparameters window without using the above method. I can log the metrics to the scalars tab but not the hparams tab. I am going for something like this in my hparams tab:
image
I was able to get this using the above code. Setting the hparams attribute gave me the first column with the hyperparameters but I could not figure out how to add the accuracy and loss columns without the modified logger. Thanks!

@thhung
Copy link

thhung commented Apr 18, 2020

Maybe a full simple example in Colab could be easier for this discussion?

@reactivetype
Copy link

reactivetype commented Apr 25, 2020

@SpontaneousDuck I tried your fix with pytorch-lightning 0.7.4rc7 but it did not work for me.
I got the hparams in tb but the metrics are not enabled.

@reactivetype
Copy link

and metrics are logged normally through the log key.

Can you please give a simple example?

@williamFalcon
Copy link
Contributor

@mRcSchwering

  1. if the cluster or (you) kill your job, how would you log the parameters?
  2. What if I need to know the params AS the job is running so that i can see the effect each has on the loss curve?

These are the reasons we do this in the beginning of training... but agree this is not optimal. So, this seems to be a fundamental design problem with tensorboard unless I'm missing something obvious?

FYI:
@awaelchli , @justusschock

@mRcSchwering
Copy link
Author

@williamFalcon indeed both cases wont work with my solution.
Currently, I just wrote a function that writes the set of hparams with the best loss metric into a csv and updates it after every round.

It's not really nice because I have all the epoch-wise metrics nicely organized in the tensorboard, but the hparams overview is in a separate .csv without all the tensorboard features.

@justusschock
Copy link
Member

Just a question: If you log them multiple times, would they be overwritten on tensorboard? Because then we could probably also log them each epoch.

@mRcSchwering
Copy link
Author

mRcSchwering commented Apr 27, 2020

From some comments I get the feeling we might be talking about 2 things here.
There are 2 levels of metrics here. Let me clarify with an example:

Say I have a model which has the hyperparameter learning rate, it's a classifier, I want to train it for 100 epochs, I want to try out 2 different learning rates (say 1e-4 and 1e-5).

On the one hand, I want to describe the training progress. I will do that with a training/validation loss and accuracy, which I write down for every epoch. Let me call these metrics epoch-wise metrics.

On the other hand, I want to have an overview of which learning rate worked better.
So maybe 1e-4 had the lowest validation loss at epoch 20 with 0.6, and 1e-5 at epoch 80 with 0.5.
This summary might get updates every epoch, but it only summarizes the best achieved epoch of every training run. In this summary I can see what influence the learning rate has on the training. Let me call this run metrics.

What I was talking about the whole time is the second case, the run metrics.
The first case (epoch-wise metrics) work fine for me.
What I was trying to get is the second case.
In Tensorboard there is a extra tab for this (4_visualize_the_results_in_tensorboards_hparams_plugin).

@justusschock
Copy link
Member

Okay, so just to be sure: for your run metrics, you would like to log metrics and hparams in each step?

I think, that's a bit overkill to add it as default behaviour. If it's just about the learning rate maybe #1498 could help you there. Otherwise I'm afraid I can't think of a generic implementation here, that won't add much overhead for most users :/

@mRcSchwering
Copy link
Author

Hi, unfortunately it doesn't work yet.
I tried out the change and logged hparams like this:

current_loss = float(module.val_loss.cpu().numpy())
if current_loss < self.best_loss:
    self.best_loss = current_loss
    metrics = {'hparam/loss': self.best_loss}
    module.logger.log_hyperparams(params=module.hparams, metrics=metrics)

The parameter appears as a column in the hparams tab but it doesnt get a value.

hparams

@mRcSchwering
Copy link
Author

Btw, if I use the high level API from tensorboard, everything works fine:

with SummaryWriter(log_dir=log_dir) as w:
    w.add_hparams(module.hparams, metrics)

Is there a reason why you don't use the high level API in the tensoboard logger?

@justusschock
Copy link
Member

Sorry, that was my mistake. I thought this was all handled by pytorch itself. Can you maybe try #1647 ?

@mRcSchwering
Copy link
Author

So theoretically it works. log_hyperparams can log metrics now.
However, every call to log_hyperparams creates a new event file. So, I would still have to use my hack in order to make sure log_hyperparams is only called once.

@williamFalcon
Copy link
Contributor

but how do you do this if you kill training prematurely or the cluster kills your job prematurely?

your solution assumes the user ALWAYS gets to training_end

@SpontaneousDuck
Copy link
Contributor

SpontaneousDuck commented Apr 28, 2020

The current log_hyperparams does follow the default SummaryWriter outcome now. (creating a separate tensorboard file for each call to log_hyperparams) The problem we are seeing here is the default performance does not match hour need case or flexibility. Pytorch Lightning saw this problem which is why they did not use this implementation in TensorBoardLogger. This breaks the link between all other metrics you logged for the training session so you have one file with all your training logs then a separate one with just hyperparameters.

It also looks like the Keras way of doing this is writing the hyperparameters at the beginning of training then writing the metrics and status message at the end. Not sure if that i possible in PyTorch right now though...

The best solution to this I believe is just allowing the user more control. The code @mRcSchwering wrote above does this mostly. Having metrics be an optional parameter would solve this. If you call our modified TensorBoardLogger with metrics logging, as long as your tags for metrics you want to display are matching other logs, you can write the hyper parameters with dummy metrics at the beginning of training then tensorboard will automatically update the metrics with the most recently logged data. Example steps below:

  1. tb_logger.log_hyperparams_metrics(parameters, {"val/accuracy": 0, "val/loss": 1})
  2. In pl.LightningModule, log metrics with matching tags:
    def validation_epoch_end(self, outputs):
        ...
        return {"log": {"val/accuracy": accuracy, "val/loss": loss}}
    
  3. Tensorboard will pull the most recent value for these metrics from training. This gets around the issue of early stopping and still allows the user to log metrics.

@mRcSchwering
Copy link
Author

mRcSchwering commented Apr 28, 2020

I found that if those many files written by log_hyperparams are in the same directory (and have the same set of hyperparameters) tensorboard also correctly interprets them as a metric that was updated.
Here is a implementation using callbacks for reporting metrics, a module that collects results, and the logger workaround .

@reactivetype
Copy link

Here is a implementation using callbacks for reporting metrics, a module that collects results, and the logger workaround .

Would it be possible to extend your pattern to collect Trainer.test results? If you define on_test_start and test_epoch_end in MetricsAndBestLossOnEpochEnd, would it create new event files? Will the test metrics appear in same row as the best/val log and simply have columns for test metrics?

@mRcSchwering
Copy link
Author

mRcSchwering commented May 2, 2020

@reactivetype actually, now I just got what @SpontaneousDuck meant. Here is an example. In the beginning I write all hyperparameter metrics that I want. In my case I use this logger. I do this again with a module base class which writes a hyperparameter metric best/val-loss at the beginning of the training run.
During the training I can update this by just adding the appropriate key to the log dictionary of the return key. E.g.

    def validation_epoch_end(self, train_steps: List[dict]) -> dict:
        loss = self._get_loss(train_steps)
        log = {
            'loss/val': loss,
            'best/val-loss': self._update_best_loss(loss)}
        return {'val_loss': loss, 'log': log}

Then, it doesn't matter where you do this. You could also do this in the test_epoch_end. Everything is written into one file.

@reactivetype
Copy link

reactivetype commented May 6, 2020

During the training I can update this by just adding the appropriate key to the log dictionary of the return key

Does this mean we can further simplify the callback pattern in your examples?

@mRcSchwering
Copy link
Author

During the training I can update this by just adding the appropriate key to the log dictionary of the return key

Does this mean we can further simplify the callback pattern in your examples?

Yes, I'm not using callbacks anymore. Everything is inherited (a base module class). I usually prefer callbacks but in this case the on_epoch_end callback function doesn't get the predictions and targets of the epoch. So, I would have to write them onto a module attribute. The on_epoch_end hook however has all the targets and predictions as an argument.

@wasserth
Copy link

Are there any plans to include this into lightning? Would be really nice to be able to use metrics inside of Tensorboard hparams without any hacking.

@williamFalcon williamFalcon reopened this May 19, 2020
@williamFalcon
Copy link
Contributor

yes, let's officially make this a fix! @mRcSchwering want to submit a PR?

@williamFalcon williamFalcon changed the title How to log hparams with metrics to tensorboard? Add support to log hparams and metrics to tensorboard? May 19, 2020
@Borda
Copy link
Member

Borda commented May 26, 2020

@mRcSchwering so it is solved, right? 🦝

@mRcSchwering
Copy link
Author

mRcSchwering commented May 26, 2020

@williamFalcon I just took a look at it. Wouldn't the pre-train routine be the place where the initial metrics have to be logged?

One could add another attribute to the lightning module which will be added as metrics to the call.
That would also mean, in both the pre-train routine and the LoggerCollection one would have to check the logger type (only adding the metrics if it is the tensorboard logger).
That solution looks more messy to me than just extending the logger (as discussed above).

Is there a reason why log_hyperparams is called in the pre-train routine and not in on_train_start?
That would make changing the log_hyperparams behavior by the user more transparent.

@mRcSchwering
Copy link
Author

@Borda yes, with #1228 (comment) it is possible to do it, currently.

@cramdoulfa
Copy link

Thanks all for your contributions in solving this issue that I have also been struggling with.

Would anyone be able to summarize what the current recommended approach is and maybe edit the documentation? https://pytorch-lightning.readthedocs.io/en/latest/loggers.html

My current strategy is still quite hacky.

@MilesCranmer
Copy link

MilesCranmer commented Jul 4, 2020

For those trying to solve this, here's another proposed solution: #1778 (comment)
I think the solution in #1228 (comment) is still a cleaner way of solving this, but not sure

@gwiener
Copy link

gwiener commented Jul 16, 2020

Following the idea from @SpontaneousDuck I found the following way to bypass the problem without modifying the framework code: add a dummy call to log_hyperparams with a metrics placeholder before Trainer.run_pretrain_routine is called, for example:

class MyModule(LightningModule):
    # set up 'test_loss' metric before fit routine starts
    def on_fit_start(self):
        metric_placeholder = {'test_loss': 0}
        self.logger.log_hyperparams(self.hparams, metrics=metric_placeholder)
    
    # at some method later
    def test_epoch_end(self, outputs):
        metrics_log = {'test_loss': something}
        return {'log': metrics_log}

The last metric will show in the TensorBoard HPARAMS table as desired, albeit the metric graph will include the first dummy point.

@neil-tan
Copy link

Thanks @gwiener for the quick workaround.
Though, with this method, dummy points are added every time the training is restored from checkpoints - still looking for a solution here.

@TrentBrick
Copy link

TrentBrick commented Aug 17, 2020

Likewise still looking for a solution here. And the solutions provided above did not work for me.

@justusschock
Copy link
Member

we're discussing this in #2974

@stale
Copy link

stale bot commented Oct 22, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 22, 2020
@stale stale bot closed this as completed Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion In a discussion stage help wanted Open to be worked on question Further information is requested won't fix This will not be worked on
Projects
None yet