-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Comet Logger pickled behavior #2553
Improve Comet Logger pickled behavior #2553
Conversation
* Delay the creation of the actual experiment object for as long as we can. * Save the experiment id in case an Experiment object is created so we can continue the same experiment in the sub-processes. * Run pre-commit on the comet file.
I also added some tests for loggers including comet over here #2502 |
Make most Comet Logger attribute protected as they might not reflect the final Experiment attributes. Also fix the typo in the test name.
I have pushed a fix for the review comments, thank you! I've take a look at #2502, this should fix the test I've considered removing the |
pytorch_lightning/loggers/comet.py
Outdated
def name(self) -> Optional[str]: | ||
# don't create an experiment if we don't have one | ||
return self._experiment.project_name if self._experiment else self._project_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests fail because of this here. If the experiment does not exist and the project name is also None, it will return None, and the trainer tries to os.path.join(..., None), that does not work. I would keep it as before. The same applies to the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code was problematic because it would create an experiment object if None exists. In addition, if project_name is None and no project_name is configured by the user, self._experiment.project_name
can also returns None.
Wandb logger seems to also return None in that case https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/loggers/wandb.py#L140, is there a difference between the Wandb logger and the Comet logger that would make returning None for the Logger name problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the wandb logger in the tests is called to create the experiment before accessing the name. You are right there is no difference in the definition there. None is problematic when the Trainer is trying to concat the path to the logger directory, so it is doing os.path.join(root, name, version, ...) and if some of these are None it throws.
Why do you think it is problematic that the call to e.g. name or version creates an experiment if it does not exist? The fact that the constructor does not immediately create the experiment means it needs to be triggered as soon as the user tries to interact with the logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of at least 2 scenarios when implicit experiment creation would be problematic:
- In DDP mode, if one process which is not the rank zero one tries to access
name
orversion
, it will create an Experiment that will not receives metrics or parameters. This will slow down the process and add noise as the Experiment (almost empty) summary will be displayed at the end of the training. - In DDP mode, if one experiment is created before the actual start of the training, we will have to recreate another experiment in the DDP processes anyway so that will slow down the training and generates extra output too. This particular scenario would still happens if the user tries to access
logger.experiment
before callingTrainer.fit
but I think we should avoid this if we can. I've move the handling of experiment_name to delay the creation of the actual Experiment until we really need it.
As I said before, for the logger name, even if the experiment exists, there is still a chance that the project_name
is None. If it is absolutely required for the Logger to returns a non-null string, I will have to think about a solution. Maybe returning a default string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on rank > 0 the experiments don't get called, see the decorator "rank_zero_experiment".
As I said before, for the logger name, even if the experiment exists, there is still a chance that the project_name is None. If it is absolutely required for the Logger to returns a non-null string, I will have to think about a solution. Maybe returning a default string.
Do you think it would be too strict to make the project/experiment name mandatory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning a default name would be reasonable too. Some other loggers do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've update the code to returns a default name in case no project_name is set. I've also added a couple of tests, I'm not sure if current build failures are linked to my changes or not, I have the impression they are not, could you confirm it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, try to merge master, then they will be triggered again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated my PR with your review comments and merged master. I don't think the current failures are linked to my changes.
Codecov Report
@@ Coverage Diff @@
## master #2553 +/- ##
=======================================
- Coverage 91% 86% -5%
=======================================
Files 109 109
Lines 8031 8081 +50
=======================================
- Hits 7291 6939 -352
- Misses 740 1142 +402 |
@Lothiraldan how is it going, still wip or ready to review? 🐰 |
@Borda I've merged master and fixed the conflict. There was still one open discussion (#2553 (comment)) but I guess it is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are solid! thanks for that!
minor comments, but overall LGTM!!
tests/loggers/test_comet.py
Outdated
with patch('pytorch_lightning.loggers.comet.CometExperiment') as comet: | ||
logger = CometLogger(api_key=api_key, experiment_name=experiment_name,) | ||
|
||
# The experiment object should not exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you remove these type of comments, also in the other places below? the assertion below makes it very clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, LGTM 😃
@Lothiraldan can you pls allow editing your PR? it seems I cannot accept suggestions... |
Co-authored-by: Jirka Borovec <[email protected]>
@Borda I've applied the suggestions and I will remove the extraneous comments. Sorry for the answer time, I was in vacation |
This pull request is now in conflict... :( |
Hello @Lothiraldan! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-09-18 08:52:27 UTC |
The tests error seems unrelated to my changes |
Co-authored-by: Adrian Wälchli <[email protected]>
Well, this changes are breaking everything if I am already using |
can you just make a PR if you want to work on the fix directly and refer the problem there... :] |
I've created a PR already |
@Vozf Indeed this use-case was broken by my PR, sorry about that. |
What does this PR do?
Hello, I'm working for Comet.ml and we got report that the Comet Logger wasn't working correctly in distributed mode. I was happy to see that the main issue has been solved few days ago: https://github.com/PyTorchLightning/pytorch-lightning/pull/2518/files#diff-a79ed8980a01d44db3ea399541407142. I iterated over that fix to improve the behavior with DDP by saving the experiment ID when pickling, so we can use the same experiment id when recreating, and don't create an Experiment object when trying to set the experiment name or accessing the Logger version.
continue the same experiment in the sub-processes.