Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step-wise processing, better support for IterableDataset, and others #640

Closed
ibeltagy opened this issue Dec 19, 2019 · 5 comments · Fixed by #728
Closed

Step-wise processing, better support for IterableDataset, and others #640

ibeltagy opened this issue Dec 19, 2019 · 5 comments · Fixed by #728
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@ibeltagy
Copy link
Contributor

ibeltagy commented Dec 19, 2019

I have been using PTL for a month. It is nice and saves a lot of time, and I intend to use it in future projects. That said, I have a list of feature requests and improvements that would be very helpful to have to support a wider set of use cases. I am not sure what the best format for this list, so I will just write them here.

  1. Better support for IterableDataset
  • In addition to val_check_interval, we also need num_val_steps and num_train_steps. num_val_steps is needed because the validation set is also using an IterableDataset. num_train_steps is needed because you usually need to carefully pick number of gradient updates which has some interaction with the learning rate scheduler (num_train_steps=inf is not sufficient)
  • For validation, keep the same DataLoader object instead of instantiating a new one on each validation cycle, because it is costly to construct new workers each time.
  • Some of the debugging features that run on a small percentage of the training/validation don't work because they are assuming a Dataset not IterableDataset
  1. Step-wise processing
    Thinking of the "gradient update" as the unit of training instead of (or in addition to) an epoch. A typical use case is pretraining a language model, where you want to control for number of gradient updates, not epochs (e.g. check the RoBERTa/BERT papers).
  • Add an option to do scheduler.step() after every gradient update
  • Have self.trainer.num_train_steps be available for the LR scheduler. The scheduler is usually a function of number of steps
  • Checkpointing the current step, and resume from that step. Again, this is important to get the right scheduler, and also important for the tensorboard logging. It will be nice to resume from the same training example, but this is less important.
  1. Misc. These are smaller points, but nice to have.
  • Having the default tensorboard logging include LR, time per steps, allgradnorm (check fairseq)
  • Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus.
  • with ddp, sync validation stats across processes. This is a common mistake, and it will be nice to guard users against it. It is something like having the following line at the end of validation_end:
val_loss = torch.distributed.all_reduce(val_loss, op=torch.distributed.ReduceOp.SUM)/self.trainer.world_size
  • various logs refer to "batches", and it is not clear if it a "batch" or a "step". They are usually the same except with gradient accumulation. Personally, I prefer the word step because it eliminates that confusion.

Thanks for the helpful library and sorry for the long list.

@ibeltagy ibeltagy added feature Is an improvement or enhancement help wanted Open to be worked on labels Dec 19, 2019
@matthew-z
Copy link
Contributor

Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus.

Trainer cannot ignore CUDA_VISIBLE_DEVICES, as it is handled by the underlaying libraries.

The device ids in pytorch/tensorflow applications are not consistent with the ids showed in nvidia-smi . E.g, the gpu:0 in pytorch may be the gpu:2 in nvidia-smi. I guess that's why you failed to set the device id to use card you want.

You may set this env variable export CUDA_DEVICE_ORDER=PCI_BUS_ID, then the order of device ids will be consistent.

@ibeltagy
Copy link
Contributor Author

ibeltagy commented Jan 21, 2020

@matthew-z , PTL does override my CUDA_DEVICE_ORDER to 0,1,.., making it difficult to run multiple jobs on the same machine. Check the code here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_data_parallel.py#L251
and similarity here: https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_parts.py#L516

@Borda
Copy link
Member

Borda commented Jan 24, 2020

probably linked to #323 and #698

@sshleifer
Copy link
Contributor

Is num_val_steps added, or some alias?

@ibeltagy
Copy link
Contributor Author

As far as I can tell, no. My workaround is to add

    def __len__(self):
        return 1000000  # a large positive constant 

in the IterableDataset, then use val_percent_check to run for a smaller number of steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants