-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable any ML experiment tracking framework #47
Comments
I also wonder if SlurmManager should just be brought into lightning. Thoughts? |
So far I see, all used methods are following from test_tube import HyperOptArgumentParser, Experiment, SlurmCluster I would remove the |
There are also these lines (which need to be generic for each logger (but seems hard seeing they all have a different interface). So, maybe have a function that calls the appropriate thing. here, here, here, here (basically search for self.experiment). I kind of like the idea of limiting support for a few things here so that people aren't lost in what they can use. We can support many different ones, but would be nice in the docs if people were told which ones they can choose from. A big help that lightning provides is helping people who don't know what they don't know. For instance some people not in CS/ML have never heard of tensorboard. But if they see it listed here, they'll know they can get the support for free if they use it. Also remember that the experiment object in test-tube is a subclass of PyTorch SummaryWriter. |
HyperargumentParser is a subclass of Argparse. It's the same thing except it adds more features for hyperparameter optimization. |
Most people also use ArgumentParser to pass in args to their models. Thus, this is the standard way of doing it in lightning. Remove the choices for things that should be standard. A core idea that I like about Lightning is that it removes the unnecessary choices you might need to make. So it's highly opinionated about standard practices. (ie: to log we all do this, to viz we all do this, to grid search we all do this, etc...). Except instead of doing something a million ways, there's one way. |
sure, I was thinking about having it really lightweight so even drop or move the arg parser class to this framework... about the logging, what about have one more parameter like |
Good suggestion, however I think that's just adding another thing for the user to remember. I'd rather them just plug in the logger and it work. So, in the meantime, why don't we just internally call the correct function depending on the logger? something like: def log(logger, metrics):
package_name = logger.__module__.split('.')[0]
if type(logger) is 'test_tube':
logger.ml_flow_way_of_logging(metrics) And replace the lines of self.experiment.log(x)
# replace with
self.log(logger, x) Just tried this: (ddt) comp:~ user$ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from test_tube import Experiment
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
>>> Experiment.__module__
'test_tube.log'
>>>
|
I'd like to pick one of the frameworks up, MLflow if it's isn't already taken. I'll look at it tomorrow morning and try to figure out how it can be integrated. I'm a bit confused though, can't tensorboard be used to track experiments and different model runs? Also, could lighting add a feature for quantisation or pruning ? Maybe expose an API so that people can try out different quantisation and pruning strategies to compress their Network. I recently was able to quantise a fp32 model to 8 bits ( only the fully connected and conv layers) with next to no loss in accuracy, this was an MNIST network. But I'm following and using the same algos the Tensorflow lite people are using. I can share a Google colab notebook of a sample implementation of you think it's a viable thing for this project. |
@karanchahal I agree with you... most people I know use tensorboard, but happy to allow this to be flexible. There's no real need to force an opinion on this particular thing from the framework. I think @Borda is probably going to look at it (but we can defonflict here if you want to look at it). However, a way of doing network pruning and quantisation would be good. I don't know those in details, so maybe make a proposal and what the use cases are for? (different issue with enhancement tag) |
I would keep #44 simple just adding CI, so I would leave the resolving test-tube to another PR/user 😄 |
@karanchahal feel free to give this a shot! |
@karanchahal still interested in making these improvements? |
Since I'll start working on quantization, I think I'll leave this for some
one else to pick up :)
…On Sat, Aug 10, 2019, 17:30 William Falcon ***@***.***> wrote:
@karanchahal <https://github.com/karanchahal> still interested in making
these improvements?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADEXT7SD4EDXUAVUNFNDFG3QD2UVDANCNFSM4IJXQUWA>
.
|
Alright, started taking a look at this. I think it's going to be more straightforward than I initially thought. I'd like some feedback on a proposal though, before I start working on it. I'd like to create a base class for experiment loggers. Loggers for individual tracking frameworks would inherit from this base class. The particular logger that a user wants can be passed to the Proposed base class: class LightningLoggerBase:
def log_metrics(self, metrics_dict, epoch_num):
pass
def log_hyperparams(self, param_dict):
pass
def save(self):
pass Implementation of a logger for a particular tracking framework would look like this: class MLFlowLogger(LightningLoggerBase):
... And using it would be straightforward from pytorch_lightning.loggers import MLFlowLogger
logger = MLFlowLogger(...)
trainer = pl.Trainer(logger=logger, ...)
trainer.fit(model) Any feedback on this? In particular, I'm wondering if I'm missing methods that I'm also wondering what we do with the existing test-tube integration. We could rip the existing stuff out and replace it with a test-tube logger that inherits from |
beautiful solution. then we don't have to wrap them, each user can wrap their own? I think we also move these options to this logger class from trainer:
Because if no logger is passed in, these flags won't be used. So, makes sense to define them in the logger. The root experiment also needs:
Experiment name and versionA nice thing about the test-tube logger is that you get a nice organization of experiments each with a version and a set of hyperparams saved in meta_tags.csv. This should just be provided by the core class which people use to wrap up their experiments. We can add a flag to turn this off in the case where an MLTracker already does this. I think those are all the main points we need to account for. If you have a cleaner way of handling the above needs we can go with that too. These are just points to account for. |
@neggert For the test tube logger, I think that it would be best to make it just another logger and remove the special-casing it has right now. Since this is meant to become more generic in the PyTorch ecosystem, I think making it more pluggable is a better idea. |
I was envisioning that Lightning could provide loggers for some of the more common frameworks (maybe Test Tube, MLFlow, and polyaxon?). Of course, users would be free to write their own as well. I'm not sure I agree about What were you running into that made test-tube Experiments non-pickleable? I would have guessed that it would just work. |
process = 0 would have to be inside the abstraction. the user will for sure call self.logger(...) at any point in their code. they won’t know what process it is and it will also get annoying adding an if statement every time. more room for user error. we can keep the logging interval stuff in trainer for now. |
Ah, okay. Makes sense. I was thinking that the user wouldn't ever call the logger themselves, but I guess there's no way to stop them :P. Any insight into what was causing the problems with pickling that you ran into? It's not clear to me if this is a problem we'll run into frequently, or if there's something specific to Test Tube that causes it. |
i don’t remember if it was test tube itself. i think it’s PyTorch summarywriter. i didn’t spend too much time debugging because the PT code is a bit messy there. But i didn’t spend too much time on that, however, it might be helpful to try a quick pickle of an experiment instance. |
Does it make sense to do #183 along the way? This would let users load models from whatever logger they want. I'm imagining adding a couple methods to def get_hyperparams(self):
pass
def save_weights(self, trainer, filename):
pass
def get_weights_path(self, filename):
pass Then a model can be loaded like so: new_hparams = logger.get_hyperparams()
new_weights_path = logger.get_weights_path("save_test.ckpt")
model_2 = LightningTemplateModel.load(new_weights_path, new_hparams, on_gpu=False) much of the logic in the existing |
i think the trainer should be the one to save hyperparameters. that will allow loading when people don’t use a logger. same for weights. let’s do the .load separately because we need to think through it. On one hand it is more flexible to allow users to load using arbitrary args, but it will break reproducibility. So, most likely i think we should force users to use the hyperparams to restore the model. this means the trainer should save hyperparams automatically instead of the logger. This will all be part of the style guide I’m creating. |
People seem to have strong preferences for either using MLFlow, test-tube, polyaxon, etc...
Let's just add generic support for whatever people want to use. I don't know if generic support is possible, but each can easily be supported individually.
To make this work we'd need to:
I think that's all that's needed to add this support.
Any suggestions and takers for working on this integration?
@Borda @alok
The text was updated successfully, but these errors were encountered: