Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU Error - RuntimeError: Unknown device #5064

Closed
NasirKhalid24 opened this issue Dec 10, 2020 · 6 comments
Closed

TPU Error - RuntimeError: Unknown device #5064

NasirKhalid24 opened this issue Dec 10, 2020 · 6 comments
Assignees
Labels
3rd party Related to a 3rd-party accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task

Comments

@NasirKhalid24
Copy link

🐛 Bug

Running in to a bug when trying to use TPU - code works fine with GPU. I am trying to train a preloaded efficient net on some new data. Stack trace at the end of the issue shows that the error is within the EfficientNet - the colab notebook is linked.

Would appreciate any pointers as I am unable to debug the issue

Colab Link

To Reproduce

Notebook linked

Environment

Google Colab

Additional context

Stack Trace Below

FOLD: 1

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
Using native 16bit precision.
training on 8 TPU cores
INIT TPU local core: 0, global rank: 0 with XLA_USE_BF16=1
INIT TPU local core: 6, global rank: 6 with XLA_USE_BF16=1
INIT TPU local core: 7, global rank: 7 with XLA_USE_BF16=1
INIT TPU local core: 5, global rank: 5 with XLA_USE_BF16=1
INIT TPU local core: 2, global rank: 2 with XLA_USE_BF16=1
INIT TPU local core: 3, global rank: 3 with XLA_USE_BF16=1
INIT TPU local core: 1, global rank: 1 with XLA_USE_BF16=1
INIT TPU local core: 4, global rank: 4 with XLA_USE_BF16=1

  | Name   | Type         | Params
----------------------------------------
0 | model  | EfficientNet | 10.7 M
1 | metric | Accuracy     | 0     
----------------------------------------
10.7 M    Trainable params
0         Non-trainable params
10.7 M    Total params

Validation sanity check:
0/? [00:00<?, ?it/s]

Exception in device=TPU:3: Unknown device
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 140, in tpu_train_in_process
    results = self.train_or_test()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 65, in train_or_test
    results = self.trainer.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 492, in train
    self.run_sanity_check(self.get_model())
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 690, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 606, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 178, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 156, in validation_step
    return self._step(self.trainer.model.validation_step, args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 150, in _step
    return model_step(*args)
  File "<ipython-input-15-a9365a9b51e7>", line 62, in validation_step
    logits = self(image)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<ipython-input-15-a9365a9b51e7>", line 19, in forward
    return self.model(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 390, in forward
    x = self.forward_features(x)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 383, in forward_features
    x = self.blocks(x)
RuntimeError: Unknown device
Exception in device=TPU:7: Unknown device
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 140, in tpu_train_in_process
    results = self.train_or_test()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 65, in train_or_test
    results = self.trainer.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 492, in train
    self.run_sanity_check(self.get_model())
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 690, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 606, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 178, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 156, in validation_step
    return self._step(self.trainer.model.validation_step, args)
Exception in device=TPU:1: Unknown device
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 150, in _step
    return model_step(*args)
  File "<ipython-input-15-a9365a9b51e7>", line 62, in validation_step
    logits = self(image)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<ipython-input-15-a9365a9b51e7>", line 19, in forward
    return self.model(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 390, in forward
    x = self.forward_features(x)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 383, in forward_features
    x = self.blocks(x)
RuntimeError: Unknown device
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 140, in tpu_train_in_process
    results = self.train_or_test()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 65, in train_or_test
    results = self.trainer.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 492, in train
    self.run_sanity_check(self.get_model())
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 690, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 606, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 178, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 156, in validation_step
    return self._step(self.trainer.model.validation_step, args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 150, in _step
    return model_step(*args)
  File "<ipython-input-15-a9365a9b51e7>", line 62, in validation_step
    logits = self(image)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<ipython-input-15-a9365a9b51e7>", line 19, in forward
    return self.model(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 390, in forward
    x = self.forward_features(x)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 383, in forward_features
    x = self.blocks(x)
RuntimeError: Unknown device
Exception in device=TPU:5: Unknown device
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 140, in tpu_train_in_process
    results = self.train_or_test()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 65, in train_or_test
    results = self.trainer.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 492, in train
    self.run_sanity_check(self.get_model())
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 690, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 606, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 178, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 156, in validation_step
    return self._step(self.trainer.model.validation_step, args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 150, in _step
    return model_step(*args)
  File "<ipython-input-15-a9365a9b51e7>", line 62, in validation_step
    logits = self(image)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<ipython-input-15-a9365a9b51e7>", line 19, in forward
    return self.model(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 390, in forward
    x = self.forward_features(x)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 383, in forward_features
    x = self.blocks(x)
RuntimeError: Unknown device
Exception in device=TPU:6: Unknown device
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 140, in tpu_train_in_process
    results = self.train_or_test()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 65, in train_or_test
    results = self.trainer.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 492, in train
    self.run_sanity_check(self.get_model())
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 690, in run_sanity_check
    _, eval_results = self.run_evaluation(test_mode=False, max_batches=self.num_sanity_val_batches)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 606, in run_evaluation
    output = self.evaluation_loop.evaluation_step(test_mode, batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 178, in evaluation_step
    output = self.trainer.accelerator_backend.validation_step(args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 156, in validation_step
    return self._step(self.trainer.model.validation_step, args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 150, in _step
    return model_step(*args)
  File "<ipython-input-15-a9365a9b51e7>", line 62, in validation_step
    logits = self(image)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<ipython-input-15-a9365a9b51e7>", line 19, in forward
    return self.model(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 390, in forward
    x = self.forward_features(x)
  File "/content/pytorch-image-models/pytorch-image-models-master/timm/models/efficientnet.py", line 383, in forward_features
    x = self.blocks(x)
RuntimeError: Unknown device

---------------------------------------------------------------------------

ProcessExitedException                    Traceback (most recent call last)

<ipython-input-16-5a9b778d6f4f> in <module>()
     23     )
     24 
---> 25     trainer.fit(model)
     26 
     27     image_paths = []

4 frames

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    158                     error_index=error_index,
    159                     error_pid=failed_process.pid,
--> 160                     exit_code=exitcode
    161                 )
    162 

ProcessExitedException: process 3 terminated with exit code 17

@NasirKhalid24 NasirKhalid24 added bug Something isn't working help wanted Open to be worked on labels Dec 10, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@rohitgr7
Copy link
Contributor

@NasirKhalid24 mind update the access of the notebook to public?

@NasirKhalid24
Copy link
Author

@NasirKhalid24 mind update the access of the notebook to public?

Hey Rohit - apologies. Should be public now at this link

https://colab.research.google.com/drive/1OgwnD5C4Oiw-I8QzVt2NRzbDG_t2nQiE?usp=sharing

@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 10, 2020

notebook is too big to find what's the issue. Can you minimize it somehow or reproduce it with https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing??

@NasirKhalid24
Copy link
Author

notebook is too big to find what's the issue. Can you minimize it somehow or reproduce it with https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing??

Hey Rohit - I tried to reproduce it here but now I keep getting a process killed error

https://colab.research.google.com/drive/1-SW2pb4vVDUuV4WKLEtnaC8OK-sZsroV?usp=sharingg

@Borda Borda added the 3rd party Related to a 3rd-party label Dec 11, 2020
@Borda Borda added priority: 1 Medium priority task accelerator: tpu Tensor Processing Unit labels Dec 11, 2020
@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 11, 2020

related issue #1590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party accelerator: tpu Tensor Processing Unit bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

No branches or pull requests

4 participants