specifying the tpu_core speed-up TPU training #2016

rohitgr7 · 2020-05-30T13:14:12Z

🐛 Bug

I am getting a huge time difference between training a model on a specific tpu core tpu_cores=[1] and training a model on just 1 tpu core tpu_cores=1. What's the reason for that? Aren't both the conditions the same with just the difference that I am assigning a specific tpu_core in the first case and assigning the number of tpu_cores I want to use in the second case. Also in the second case, I am getting an error. When training with tpu_cores=[1] epoch time is 17 seconds with tpu_cores=1 epoch time is just 5 seconds.
Running on colab gives me an error but no error on Kaggle kernels. But the time difference issue is the same on both the platforms.

To Reproduce

Code sample

Expected behavior

As far as I know in both cases, the training time should be the same regardless of training on a single core or training on a specific core.

Environment

PyTorch Version (e.g., 1.0): 1.5.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7
CUDA/cuDNN version: 10.1
GPU models and configuration: Tesla P100-PCIE-16GB
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

rohitgr7 · 2020-05-30T13:16:54Z

@williamFalcon @dlibenzi

dlibenzi · 2020-05-30T15:29:52Z

Because of a bug?

https://github.com/PyTorchLightning/pytorch-lightning/blob/fdbbe968256f6c68a5dbb840a2004b77a618ef61/pytorch_lightning/trainer/trainer.py#L360

https://github.com/PyTorchLightning/pytorch-lightning/blob/fdbbe968256f6c68a5dbb840a2004b77a618ef61/pytorch_lightning/trainer/training_loop.py#L415

lezwon · 2020-05-31T20:02:41Z

@dlibenzi I recall when training on a single core and using ParallelLoader, I used to receive an error. Hence the self.tpu_id is None condition. However, I did a recheck and it seems to be working fine with ParallelLoader now. Made a PR for the same. :)

Borda · 2020-06-16T15:01:57Z

not sure why this was reopened...

rohitgr7 added the help wanted Open to be worked on label May 30, 2020

lezwon mentioned this issue May 31, 2020

slow tpu train #2033

Merged

5 tasks

Borda changed the title ~~Training time on tpu is less when specifying the tpu_core~~ specifying the tpu_core speed-up TPU training Jun 2, 2020

Borda added the feature Is an improvement or enhancement label Jun 2, 2020

Borda added this to the 0.8.0 milestone Jun 2, 2020

williamFalcon closed this as completed in #2033 Jun 2, 2020

rohitgr7 mentioned this issue Jun 7, 2020

tpu_cores=8 not working #2106

Closed

Borda reopened this Jun 10, 2020

Borda closed this as completed Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specifying the tpu_core speed-up TPU training #2016

specifying the tpu_core speed-up TPU training #2016

rohitgr7 commented May 30, 2020

rohitgr7 commented May 30, 2020

dlibenzi commented May 30, 2020

lezwon commented May 31, 2020

Borda commented Jun 16, 2020

specifying the tpu_core speed-up TPU training #2016

specifying the tpu_core speed-up TPU training #2016

Comments

rohitgr7 commented May 30, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

rohitgr7 commented May 30, 2020

dlibenzi commented May 30, 2020

lezwon commented May 31, 2020

Borda commented Jun 16, 2020