Enable any ML experiment tracking framework #47

williamFalcon · 2019-08-06T15:20:49Z

People seem to have strong preferences for either using MLFlow, test-tube, polyaxon, etc...

Let's just add generic support for whatever people want to use. I don't know if generic support is possible, but each can easily be supported individually.

To make this work we'd need to:

change the logging to be non test-tube specific. Logging only happens in 2 places (train and validation completion).
Each call to log needs to be process-safe. Meaning when using distributed only rank=0 will log.
the experiment param in Trainer will need to be generalized (signature the same), to take any logger.

I think that's all that's needed to add this support.

Any suggestions and takers for working on this integration?
@Borda @alok

williamFalcon · 2019-08-06T15:38:43Z

I also wonder if SlurmManager should just be brought into lightning. Thoughts?

Borda · 2019-08-06T15:44:04Z

So far I see, all used methods are following

from test_tube import HyperOptArgumentParser, Experiment, SlurmCluster

I would remove the test_tube dependence from pytorch_lightning.testing.lm_test_module so the pytorch_lightning package will be independent on these additional frameworks and then I would even keep this test_tube for testing/examples...
And instead HyperOptArgumentParser I would use standard/basic python argparse
https://docs.python.org/3/library/argparse.html

williamFalcon · 2019-08-06T15:51:01Z

There are also these lines (which need to be generic for each logger (but seems hard seeing they all have a different interface). So, maybe have a function that calls the appropriate thing.

here, here, here, here (basically search for self.experiment).

I kind of like the idea of limiting support for a few things here so that people aren't lost in what they can use. We can support many different ones, but would be nice in the docs if people were told which ones they can choose from.

A big help that lightning provides is helping people who don't know what they don't know. For instance some people not in CS/ML have never heard of tensorboard. But if they see it listed here, they'll know they can get the support for free if they use it.

Also remember that the experiment object in test-tube is a subclass of PyTorch SummaryWriter.

williamFalcon · 2019-08-06T15:51:42Z

So far I see, all used methods are following
from test_tube import HyperOptArgumentParser, Experiment, SlurmCluster
I would remove the test_tube dependence from pytorch_lightning.testing.lm_test_module so the pytorch_lightning package will be independent on these additional frameworks and then I would even keep this test_tube for testing/examples...
And instead HyperOptArgumentParser I would use standard/basic python argparse
https://docs.python.org/3/library/argparse.html

HyperargumentParser is a subclass of Argparse. It's the same thing except it adds more features for hyperparameter optimization.

williamFalcon · 2019-08-06T15:54:09Z

Most people also use ArgumentParser to pass in args to their models. Thus, this is the standard way of doing it in lightning. Remove the choices for things that should be standard.

A core idea that I like about Lightning is that it removes the unnecessary choices you might need to make. So it's highly opinionated about standard practices. (ie: to log we all do this, to viz we all do this, to grid search we all do this, etc...). Except instead of doing something a million ways, there's one way.

Borda · 2019-08-06T16:01:01Z

sure, I was thinking about having it really lightweight so even drop or move the arg parser class to this framework... about the logging, what about have one more parameter like log_callback which will be used if it is specified... and then in examples you just keep your proposal with using test-tube, but if someone come and want to use anything else, he just change the callback function... I think that Keras is also using these callbacks in a similar way...

williamFalcon · 2019-08-06T18:14:17Z

sure, I was thinking about having it really lightweight so even drop or move the arg parser class to this framework... about the logging, what about have one more parameter like log_callback which will be used if it is specified... and then in examples you just keep your proposal with using test-tube, but if someone come and want to use anything else, he just change the callback function... I think that Keras is also using these callbacks in a similar way...

Good suggestion, however I think that's just adding another thing for the user to remember. I'd rather them just plug in the logger and it work.

So, in the meantime, why don't we just internally call the correct function depending on the logger?

something like:

def log(logger, metrics):
    package_name = logger.__module__.split('.')[0]
   if type(logger) is 'test_tube':
       logger.ml_flow_way_of_logging(metrics)

And replace the lines of

self.experiment.log(x)

# replace with
self.log(logger, x)

Just tried this:

(ddt) comp:~ user$ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from test_tube import Experiment
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
>>> Experiment.__module__
'test_tube.log'
>>>

karanchahal · 2019-08-06T19:00:13Z

I'd like to pick one of the frameworks up, MLflow if it's isn't already taken. I'll look at it tomorrow morning and try to figure out how it can be integrated. I'm a bit confused though, can't tensorboard be used to track experiments and different model runs?

Also, could lighting add a feature for quantisation or pruning ?

Maybe expose an API so that people can try out different quantisation and pruning strategies to compress their Network.

I recently was able to quantise a fp32 model to 8 bits ( only the fully connected and conv layers) with next to no loss in accuracy, this was an MNIST network. But I'm following and using the same algos the Tensorflow lite people are using. I can share a Google colab notebook of a sample implementation of you think it's a viable thing for this project.

williamFalcon · 2019-08-06T19:31:24Z

@karanchahal I agree with you... most people I know use tensorboard, but happy to allow this to be flexible. There's no real need to force an opinion on this particular thing from the framework.

I think @Borda is probably going to look at it (but we can defonflict here if you want to look at it).

However, a way of doing network pruning and quantisation would be good. I don't know those in details, so maybe make a proposal and what the use cases are for? (different issue with enhancement tag)

Borda · 2019-08-06T20:48:01Z

I would keep #44 simple just adding CI, so I would leave the resolving test-tube to another PR/user 😄

williamFalcon · 2019-08-07T01:27:34Z

@karanchahal feel free to give this a shot!
link to this issue

williamFalcon · 2019-08-10T12:00:14Z

@karanchahal still interested in making these improvements?

karanchahal · 2019-08-10T15:44:57Z

Since I'll start working on quantization, I think I'll leave this for some one else to pick up :)

…

On Sat, Aug 10, 2019, 17:30 William Falcon ***@***.***> wrote: @karanchahal <https://github.com/karanchahal> still interested in making these improvements? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADEXT7SD4EDXUAVUNFNDFG3QD2UVDANCNFSM4IJXQUWA> .

neggert · 2019-09-06T20:01:17Z

Alright, started taking a look at this. I think it's going to be more straightforward than I initially thought. I'd like some feedback on a proposal though, before I start working on it.

I'd like to create a base class for experiment loggers. Loggers for individual tracking frameworks would inherit from this base class. The particular logger that a user wants can be passed to the Trainer constructor, just like the test-tube experiment is today.

Proposed base class:

class LightningLoggerBase:

    def log_metrics(self, metrics_dict, epoch_num):
        pass
    
    def log_hyperparams(self, param_dict):
        pass
    
    def save(self):
        pass

Implementation of a logger for a particular tracking framework would look like this:

class MLFlowLogger(LightningLoggerBase):
    ...

And using it would be straightforward

from pytorch_lightning.loggers import MLFlowLogger

logger = MLFlowLogger(...)
trainer = pl.Trainer(logger=logger, ...)
trainer.fit(model)

Any feedback on this? In particular, I'm wondering if I'm missing methods that LightningLoggerBase should have. Off the top of my head, do we want properties for experiment_name and experiment_version? Or perhaps an optional method to save a model or other artifact?

I'm also wondering what we do with the existing test-tube integration. We could rip the existing stuff out and replace it with a test-tube logger that inherits from LightingLoggerBase. Or we could leave it in, since it's optional, and add other frameworks using the new interface. Thoughts?

williamFalcon · 2019-09-06T20:59:40Z

beautiful solution. then we don't have to wrap them, each user can wrap their own?

I think we also move these options to this logger class from trainer:

log_save_interval=100,
add_log_row_interval=10,

Because if no logger is passed in, these flags won't be used. So, makes sense to define them in the logger.

The root experiment also needs:

rank property (to make sure it only logs from process 0 in ddp mode).
get_non_ddp_exp() - when launching ddp, PT pickles everything. if a logger isn't picklable ddp won't work. either we let those loggers fail (ie: no use) or we allow user to create a pickle safe version. Test tube exp has the following meta data class:

class DDPExperiment(object):
    def __init__(
        self,
        exp
    ):
        """
        Used as meta_data storage if the experiment needs to be pickled
        :param name:
        :param debug:
        :param version:
        :param save_dir:
        :param autosave:
        :param description:
        :param create_git_tag:
        :param args:
        :param kwargs:
        """

        self.tag_markdown_saved = exp.tag_markdown_saved
        self.no_save_dir = exp.no_save_dir
        self.metrics = exp.metrics
        self.tags = exp.tags
        self.name = exp.name
        self.debug = exp.debug
        self.version = exp.version
        self.autosave = exp.autosave
        self.description = exp.description
        self.create_git_tag = exp.create_git_tag
        self.exp_hash = exp.exp_hash
        self.created_at = exp.created_at
        self.save_dir = exp.save_dir


    def get_non_ddp_exp(self):
        return Experiment(
            name=self.name,
            debug=self.debug,
            version=self.version,
            save_dir=self.save_dir,
            autosave=self.autosave,
            description=self.description,
            create_git_tag=self.create_git_tag
        )

Experiment name and version

A nice thing about the test-tube logger is that you get a nice organization of experiments each with a version and a set of hyperparams saved in meta_tags.csv.

This should just be provided by the core class which people use to wrap up their experiments. We can add a flag to turn this off in the case where an MLTracker already does this.

I think those are all the main points we need to account for. If you have a cleaner way of handling the above needs we can go with that too. These are just points to account for.

alok · 2019-09-06T21:24:19Z

@neggert For the test tube logger, I think that it would be best to make it just another logger and remove the special-casing it has right now. Since this is meant to become more generic in the PyTorch ecosystem, I think making it more pluggable is a better idea.

neggert · 2019-09-09T16:28:50Z

I was envisioning that Lightning could provide loggers for some of the more common frameworks (maybe Test Tube, MLFlow, and polyaxon?). Of course, users would be free to write their own as well.

I'm not sure I agree about log_save_interval and add_log_row_interval. Those control how often the trainer calls the logger methods. IMO they should stay in the trainer, but I don't feel strongly about it. On a similar note, IMO the Trainer should be responsible for logging only from the rank 0 process.

What were you running into that made test-tube Experiments non-pickleable? I would have guessed that it would just work.

williamFalcon · 2019-09-09T18:29:06Z

process = 0 would have to be inside the abstraction.

the user will for sure call self.logger(...) at any point in their code. they won’t know what process it is and it will also get annoying adding an if statement every time. more room for user error.

we can keep the logging interval stuff in trainer for now.

neggert · 2019-09-10T17:48:42Z

Ah, okay. Makes sense. I was thinking that the user wouldn't ever call the logger themselves, but I guess there's no way to stop them :P.

Any insight into what was causing the problems with pickling that you ran into? It's not clear to me if this is a problem we'll run into frequently, or if there's something specific to Test Tube that causes it.

williamFalcon · 2019-09-10T17:58:20Z

i don’t remember if it was test tube itself. i think it’s PyTorch summarywriter. i didn’t spend too much time debugging because the PT code is a bit messy there.

But i didn’t spend too much time on that, however, it might be helpful to try a quick pickle of an experiment instance.

neggert · 2019-09-10T21:48:25Z

Does it make sense to do #183 along the way? This would let users load models from whatever logger they want.

I'm imagining adding a couple methods to LightningLoggerBase:

    def get_hyperparams(self):
        pass
    
    def save_weights(self, trainer, filename):
        pass
    
    def get_weights_path(self, filename):
        pass

Then a model can be loaded like so:

    new_hparams = logger.get_hyperparams()
    new_weights_path = logger.get_weights_path("save_test.ckpt")
    model_2 = LightningTemplateModel.load(new_weights_path, new_hparams, on_gpu=False)

much of the logic in the existing load_from_metrics would get moved into TestTubeLogger.get_hyperparams.

williamFalcon · 2019-09-12T10:43:49Z

i think the trainer should be the one to save hyperparameters. that will allow loading when people don’t use a logger.

same for weights.

let’s do the .load separately because we need to think through it. On one hand it is more flexible to allow users to load using arbitrary args, but it will break reproducibility. So, most likely i think we should force users to use the hyperparams to restore the model. this means the trainer should save hyperparams automatically instead of the logger. This will all be part of the style guide I’m creating.

williamFalcon mentioned this issue Aug 6, 2019

How would you compare this to ignite? #35

Closed

williamFalcon changed the title ~~Enable any ML exp framework~~ Enable any ML experiment tracking framework Aug 6, 2019

williamFalcon added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 6, 2019

williamFalcon added this to the 3.7.1 milestone Aug 8, 2019

williamFalcon modified the milestones: 0.4.4, 0.4.5 Aug 12, 2019

williamFalcon mentioned this issue Aug 13, 2019

Choice of Logging Backend #111

Closed

williamFalcon mentioned this issue Aug 21, 2019

[FEATURE] Logging frequency for TensorBoard #156

Closed

neggert mentioned this issue Sep 5, 2019

Should the dependency, test_tube, be explicity stated in the readme at the top? #191

Closed

neggert mentioned this issue Sep 13, 2019

Enable any ML experiment tracking framework #223

Merged

4 tasks

williamFalcon closed this as completed in #223 Sep 27, 2019

luiscape pushed a commit to luiscape/pytorch-lightning that referenced this issue Jan 17, 2020

[PYT-217] Update mobile menu design (Lightning-AI#47)

43d6d87

brucemuller mentioned this issue Jun 5, 2020

CUDA error: an illegal memory access was encountered after updating to the latest stable packages #2085

Closed

nickyi1990 mentioned this issue Sep 15, 2020

Tensorboard display nothing #3506

Closed

Borda modified the milestones: 0.4.7, v0.4.x Apr 18, 2021

senarvi mentioned this issue Jun 11, 2021

worker_init_fn is not set to seed dataloaders correctly when using DDP #7937

Closed

martinkim0 mentioned this issue Jul 8, 2023

Import error for pydantic ModelMetaClass #18026

Closed

waynemystir mentioned this issue Feb 27, 2024

NCCL when trying to train on 2 nodes #19544

Open

stonelazy mentioned this issue Nov 11, 2024

Training not starting in multidevice setup #20411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable any ML experiment tracking framework #47

Enable any ML experiment tracking framework #47

williamFalcon commented Aug 6, 2019

williamFalcon commented Aug 6, 2019

Borda commented Aug 6, 2019 •

edited

Loading

williamFalcon commented Aug 6, 2019

williamFalcon commented Aug 6, 2019

williamFalcon commented Aug 6, 2019

Borda commented Aug 6, 2019

williamFalcon commented Aug 6, 2019 •

edited

Loading

karanchahal commented Aug 6, 2019 •

edited

Loading

williamFalcon commented Aug 6, 2019

Borda commented Aug 6, 2019

williamFalcon commented Aug 7, 2019 •

edited

Loading

williamFalcon commented Aug 10, 2019

karanchahal commented Aug 10, 2019 via email

neggert commented Sep 6, 2019

williamFalcon commented Sep 6, 2019 •

edited

Loading

alok commented Sep 6, 2019

neggert commented Sep 9, 2019

williamFalcon commented Sep 9, 2019 •

edited

Loading

neggert commented Sep 10, 2019

williamFalcon commented Sep 10, 2019

neggert commented Sep 10, 2019

williamFalcon commented Sep 12, 2019

Enable any ML experiment tracking framework #47

Enable any ML experiment tracking framework #47

Comments

williamFalcon commented Aug 6, 2019

williamFalcon commented Aug 6, 2019

Borda commented Aug 6, 2019 • edited Loading

williamFalcon commented Aug 6, 2019

williamFalcon commented Aug 6, 2019

williamFalcon commented Aug 6, 2019

Borda commented Aug 6, 2019

williamFalcon commented Aug 6, 2019 • edited Loading

karanchahal commented Aug 6, 2019 • edited Loading

williamFalcon commented Aug 6, 2019

Borda commented Aug 6, 2019

williamFalcon commented Aug 7, 2019 • edited Loading

williamFalcon commented Aug 10, 2019

karanchahal commented Aug 10, 2019 via email

neggert commented Sep 6, 2019

williamFalcon commented Sep 6, 2019 • edited Loading

Experiment name and version

alok commented Sep 6, 2019

neggert commented Sep 9, 2019

williamFalcon commented Sep 9, 2019 • edited Loading

neggert commented Sep 10, 2019

williamFalcon commented Sep 10, 2019

neggert commented Sep 10, 2019

williamFalcon commented Sep 12, 2019

Borda commented Aug 6, 2019 •

edited

Loading

williamFalcon commented Aug 6, 2019 •

edited

Loading

karanchahal commented Aug 6, 2019 •

edited

Loading

williamFalcon commented Aug 7, 2019 •

edited

Loading

williamFalcon commented Sep 6, 2019 •

edited

Loading

williamFalcon commented Sep 9, 2019 •

edited

Loading