How to persist a pytorch lightning module that depends on external data? #1755

Apsod · 2020-05-07T15:53:51Z

❓ Questions and Help

What is your question?

Hi! We're using pytorch lightning to train language models and transformers from scratch. This includes training tokenizers and applying them text data, resulting in binarized data.
The way we've structured the process is to train a tokenizer, apply it to text data (coupling the binarized data and the tokenizer), and apply a language model on the binarized data.

Since the language model depends on the tokenizer (number of tokens, special tokens, et.c.) the pytorch lightning model needs a tokenizer/vocabulary as part of its hparams. This does not play very nicely with the way hparams and loading works: If we transfer the model from one computer to another, we would need to move a tokenizer to the exact same path on the other computer.

Generally i guess the problem boils down to this: If the inner pytorch modules of the pytorch lightning module depends on some kind of external data (e.g. a vocabulary, or a random sparsity graph), and you then wish to share the pytorch lightning module, we can't find an easy way of doing this.

on_load/save_checkpoint does not work, since they take effect after the model has been initialized, whereas we would like to persist data to the initialization logic in itself.

Is there an elegant way to do this in pytorch lightning?

github-actions · 2020-05-07T15:54:36Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-05-07T16:01:20Z

good points. been thinking about this as well.

can you share pseudocode so we can come up with the changes to the API?

Apsod · 2020-05-07T22:01:21Z

Say that we have a Transformer model which takes a tokenizer as part of its hparams:

class Transformer(LightningModule):
  def __init__(self, hparams):
    ...
    # load the tokenizer/data
    self.tokenizer = load_tokenizer(hparams.tokenizer_path)

    # Initialize the pytorch model (dependent on tokenizer)
    self.transformer = torch.nn.Transformer(
      dimension = hparams.dimension,
      num_embeddings = self.tokenizer.vocab_size,
      padding_index = self.tokenizer.padding_index,
      ...)

model = Transformer(hparams)
trainer = Trainer(..)
trainer.fit(model)

Later we wish to load the transformer

def do_transformer_stuff(checkpoint_path):
  transformer = Transformer.load_from_checkpoint(checkpoint_path)
  ...

This work perfectly fine, given that tokenizer_path points to the same tokenizer. However, if the original tokenizer_path was relative or the checkpoint was transferred to some other place, it will fail.

One workaround is to make the tokenizer a kwarg:

class Transformer(LightningModule):
  def __init__(self, hparams, tokenizer=None):
    ...
    # set the tokenizer (tokenizer loading logic outside of Transformer)
    self.tokenizer = tokenizer

    # Initialize the pytorch model (dependent on tokenizer)
    self.transformer = torch.nn.Transformer(
      dimension = hparams.dimension,
      num_embeddings = self.tokenizer.vocab_size,
      padding_index = self.tokenizer.padding_index,
      ...)


tokenizer = load_tokenizer(hparams.tokenizer_path)
model = Transformer(hparams, tokenizer=tokenizer)
trainer = Trainer(..)
trainer.fit(model)

The consequence of which is that initialization logic needs to take place outside of the Transformer, and that it is no longer self contained:

def do_transformer_stuff(checkpoint_path, tokenizer_path):
  tokenizer = load_tokenizer(hparams.tokenizer_path)
  transformer = Transformer.load_from_checkpoint(checkpoint_path, tokenizer=tokenizer)

yukw777 · 2020-05-07T23:12:56Z

I personally went with something similar to the workaround. I didn't think it was particularly bad that it wasn't "self-contained".

Apsod · 2020-05-09T10:23:49Z

FWIW, we did find a workaround that makes the module self-contained. It is built on the kwarg workaround, adding a separate hparams only classmethod make responsible for tokenizer initialization and a separate classmethod load responsible for loading from checkpoint.

class Transformer(LightningModule):
  def __init__(self, hparams, tokenizer=None):
    ...
    # set the tokenizer (tokenizer loading logic outside of Transformer)
    self.tokenizer = tokenizer

    # Initialize the pytorch model (dependent on tokenizer)
    self.transformer = torch.nn.Transformer(
      dimension = hparams.dimension,
      num_embeddings = self.tokenizer.vocab_size,
      padding_index = self.tokenizer.padding_index,
      ...)
    def on_save_checkpoint(self, checkpoint):
        checkpoint['tokenizer'] = self.tokenizer

    @classmethod
    def make(cls, hparams):
        # Essentially a wrapper around init responsible for tokenizer loading
        tokenizer = get_tokenizer(hparams)
        return cls(hparams, tokenizer=tokenizer)

    @classmethod
    def load(
            cls,
            checkpoint_path,
            map_location = None
            ):
        # Copied from load_from_checkpoint, but we extract tokenizer (saved during on_save) and make it a kwarg.

        if map_location is not None:
            checkpoint = torch.load(checkpoint_path, map_location=map_location)
        else:
            checkpoint = torch.load(checkpoint_path, map_location=lambda storage, loc: storage)
        
        args = []
        tokenizer = checkpoint['tokenizer']   # extract the tokenizer
        kwargs = {'tokenizer': tokenizer}      # Make it a kwarg to _load_model_state

        model = cls._load_model_state(checkpoint, *args, **kwargs)
        return model

This is obviously particular to our use case, but I think it is possible to polish it a bit by, for example, making it possible use a saved kwargs-dictionary in load_from_checkpoint.

elkotito · 2020-05-14T14:01:13Z

If we transfer the model from one computer to another, we would need to move a tokenizer to the exact same path on the other computer.

It's a completely valid requirement. That's why people wrap up their training experiments into Dockerfiles. A different example is that Polyaxon supports this with YAML scripts that help you define running environment.

stale · 2020-07-13T14:35:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apsod added the question Further information is requested label May 7, 2020

yukw777 mentioned this issue May 11, 2020

Is hparams really a good practice? #1735

Closed

stale bot added the won't fix This will not be worked on label Jul 13, 2020

stale bot closed this as completed Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to persist a pytorch lightning module that depends on external data? #1755

How to persist a pytorch lightning module that depends on external data? #1755

Apsod commented May 7, 2020

github-actions bot commented May 7, 2020

williamFalcon commented May 7, 2020

Apsod commented May 7, 2020

yukw777 commented May 7, 2020

Apsod commented May 9, 2020 •

edited

Loading

elkotito commented May 14, 2020

stale bot commented Jul 13, 2020

How to persist a pytorch lightning module that depends on external data? #1755

How to persist a pytorch lightning module that depends on external data? #1755

Comments

Apsod commented May 7, 2020

❓ Questions and Help

What is your question?

github-actions bot commented May 7, 2020

williamFalcon commented May 7, 2020

Apsod commented May 7, 2020

yukw777 commented May 7, 2020

Apsod commented May 9, 2020 • edited Loading

elkotito commented May 14, 2020

stale bot commented Jul 13, 2020

Apsod commented May 9, 2020 •

edited

Loading