Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard log_hyperparams(params, metrics) seems not to have effect #1778

Closed
karapostK opened this issue May 11, 2020 · 18 comments
Closed
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task

Comments

@karapostK
Copy link

🐛 Bug

Calling self.logger.log_hyperparams(hparams_dicts, metrics_dicts) in test_epoch_end doesn't have the desired effect. It should show the entries in the Hparams section with hparams and metrics specified but it shows nothing instead.
Looking at the code it seems to be caused by self.hparams and the pre-logging of the hyperparameters at the start of the training. In this way, calls to log_hyperparams won't be able to log the hyperparameters AND the metrics properly since they will clash with the previous log, hence, showing nothing.

To Reproduce

Try to log metrics with self.logger.log_hyperparams.

Code sample

def test_epoch_end(self, outputs):
    avg_recall = np.concatenate([x['recall'] for x in outputs]).mean()
    tensorboard_logs = {'test/avg_ndcg': avg_ndcg, "test/avg_recall": avg_recall}
    ## Log metrics
    self.logger.log_hyperparams(vars(self.params),tensorboard_logs)
    return tensorboard_logs

Expected behavior

Tensorboard should show me the section of Hparams with each entry composed by hyperparameters and metrics.

Environment

  • CUDA:
    - GPU:
    - available: False
    - version: 10.2
  • Packages:
    - numpy: 1.18.1
    - pyTorch_debug: False
    - pyTorch_version: 1.5.0
    - pytorch-lightning: 0.7.5
    - tensorboard: 2.2.1
    - tqdm: 4.46.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.8.2
    - version: Evaluate reduce removal from validation_step #34~18.04.1-Ubuntu SMP Fri Feb 28 13:42:26 UTC 2020

Additional context

@karapostK karapostK added bug Something isn't working help wanted Open to be worked on labels May 11, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@williamFalcon williamFalcon added the priority: 0 High priority task label May 11, 2020
@williamFalcon williamFalcon self-assigned this May 11, 2020
@williamFalcon
Copy link
Contributor

yes, hparams seems to not be working for some reason. looking into it.

@karapostK
Copy link
Author

I played around the issue and I think I may know the problem. When calling log_hyperparams the function adds events to the log file instead of overwrite the existent one (the one created in the pre-train routine). It follows that tensorboard sees one file with different sets of metrics, { } from the first log and the last added with log_hyperparams. This leads to hparams not showing it properly since Tensorboard wants to have all set of metrics to be the same across the experiments.

In order to see if works, you can just have "del self.hparams" at the beginning of your pl.lightiningmodule. This effectively jumps the pre-train logging and correctly shows hparams in tensorboard. Downside? Cannot load from checkpoint anymore ;)

@williamFalcon
Copy link
Contributor

@justusschock is this related to the changes with initializing tb?

@justusschock
Copy link
Member

justusschock commented May 12, 2020

Probably it is (we used the writers internal add_hparams for this, which seems to create a separate file). Hopefully fixed this in #1630 and #1647

@williamFalcon
Copy link
Contributor

Confirmed this is fixed on master,

https://colab.research.google.com/drive/1K6Gxo99O6dEbzzj_lW8jAL74OvjojV9T

image

@karapostK
Copy link
Author

The issue is not fixed yet, unfortunately.
Callinglog_hyperparams(vars(self.hparams),{"SOME_METRIC":2012}) in test_epoch_end won't log the metric in hparams. The problem is still the same. The inital log of the hparameters interfers with the last call and won't allow to log metrics in hparams!

@karapostK
Copy link
Author

karapostK commented May 12, 2020

As you can see, no metrics are shown on tensorboard

Screenshot from 2020-05-12 17-26-32

def test_epoch_end(self, outputs):
    avg_ndcg = np.concatenate([x['ndcg'] for x in outputs]).mean()
    avg_recall = np.concatenate([x['recall'] for x in outputs]).mean()

    tensorboard_logs = {'test/avg_ndcg': avg_ndcg, "test/avg_recall": avg_recall}
    ## Log metrics
    #self.logger.log_metrics(tensorboard_logs)

    self.logger.log_hyperparams(vars(self.hparams),tensorboard_logs)
    return tensorboard_logs

@karapostK karapostK changed the title Tensorboard log_hyperparams(params, metrics=None) seems not to have effect Tensorboard log_hyperparams(params, metrics) seems not to have effect May 12, 2020
@williamFalcon williamFalcon reopened this May 12, 2020
@williamFalcon
Copy link
Contributor

ummmm. this might be a design problem in TB.

  1. there’s never a guarantee your model ends training (cluster interrupt, crash, etc..). In those instances you still want to know the hparams
  2. the TB design assumes your training always completes for metrics...

so, looks like we have to get hacky to get this to work correctly?

@karapostK
Copy link
Author

I think they may fix this in the future but I don't think in the short time :/
tensorflow/tensorboard#3597

@singhay
Copy link

singhay commented May 21, 2020

@karapostK any luck with finding a fix/workaround ?

@karapostK
Copy link
Author

@singhay I found one but it's not very nicey. In line 840 (run_pretrain_routine function) in Trainer.py there is this code:

    if self.logger is not None:
        # save exp to get started
        if hasattr(ref_model, "hparams"):
            self.logger.log_hyperparams(ref_model.hparams)

which logs the hparams too soon. Simply changing this to:

    if self.logger is not None:
        # save exp to get started
        if hasattr(ref_model, "hparams"):
            pass

does the job. However, remember to call log_hyperparams(hparams,metrics) somewhere in your code in order to properly log the metrics. EIther using the callback on_train_end() or on test_epoch_end.

@versatran01
Copy link

versatran01 commented Jun 6, 2020

Have the same problem and I think pl should not log hparams blindly in trainer. What's the point of logging hparams without a metric? It is been saved to disk anyway, so one could look there. If someone can only look at tensorboard (without access to the machine), then it is better to log hparams as text.

But now the question becomes, where to actually do the logging? In model checkpoint?

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 6, 2020

most of the time we interrupt training before it completes... a lot of code won’t “complete” but you still need checkpoints and to know what you ran.

the problem is not lightning, but tensorboard for assuming that training always ends.

@MilesCranmer
Copy link

MilesCranmer commented Jul 4, 2020

+1 Thanks everyone for looking into this. I've also been searching a way to use the hparams tab in tensorboard with lightning. I think @karapostK's solution needs an update: now one should just comment out self.logger.log_hyperparams(ref_model.hparams) in run_pretrain_routine in trainer.py.

Maybe one solution is to just have a Trainer option like pre_record_hyperparams which is default True, and when False, it turns off this line: https://github.com/PyTorchLightning/pytorch-lightning/blob/325852c6df93f749bb843bff1a3cdba41698722c/pytorch_lightning/trainer/trainer.py#L1077

we can change it to be:

        if self.logger is not None and self.pre_record_hyperparams:

and then the user will manually call log_hyperparams whenever they see fit, and also include the desired metrics.


For anybody trying to solve the same issue for their code, here is how I solved it with a hack:

  1. Change if self.logger is not None: to if False: in run_pretrain_routine in trainer.py.
  2. Before training, I have
checkpointer = ModelCheckpoint(filepath='best')

(and add it as a callback to trainer: Trainer(..., checkpoint_callback=checkpointer)).
3. After training, run:

logger.log_hyperparams(params=model.hparams, metrics={'val_loss': checkpointer.best_model_score.item()})
logger.save()

Then the model will appear in your hparams tab with val_loss as a metric:

Screen Shot 2020-07-04 at 3 27 17 AM

@edenlightning
Copy link
Contributor

Closing as this is a TB issue. @versatran01 feel free to reopen if you have any other issues!

@vedal
Copy link

vedal commented Apr 16, 2021

Found an answer to this question here:
https://pytorch-lightning.readthedocs.io/en/latest/extensions/logging.html#logging-hyperparameters
(#6904)

rolanddenis added a commit to PhaseFieldICJ/nnpf that referenced this issue Sep 17, 2022
@Isuxiz
Copy link

Isuxiz commented Mar 2, 2023

Found an answer to this question here: https://pytorch-lightning.readthedocs.io/en/latest/extensions/logging.html#logging-hyperparameters (#6904)

And you, my friend, you are a real hero!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests

9 participants