You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training GPT2 model on TPU but training is getting stuck with following as the last line:
tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:54541
To Reproduce
I have followed all steps as outlined in https://github.com/mgrankin/ru_transformers/tree/master/tpu to train a GPT2 model on TPU on Google Cloud. As mentioned there, I was able to successfully run MNIST example without any issue python /pytorch/xla/test/test_train_mp_mnist.py
But when I ran the full training which is on a small dataset (10MB) just to make sure it runs successfully, the training is getting stuck with above line and doesn't proceed further. When I press Ctrl-C, I can see it is waiting in socket polling. I have tried restarting the TPU but same problem is observed.
Collecting environment information...
PyTorch version: 1.5.0a0+65bad41
Is debug build: No
CUDA used to build PyTorch: None
OS: Debian GNU/Linux 9 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: version 3.14.0
Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] numpydoc==0.9.1
[pip] torch==1.5.0a0+65bad41
[pip] torch-xla==0.8+98a2790
[pip] torchvision==0.6.0a0+b6f28ec
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py36he904b0f_0
[conda] mkl_fft 1.0.14 py36ha843d7b_0
[conda] mkl_random 1.1.0 py36hd6b4f25_0
[conda] torch 1.5.0a0+65bad41 <pip>
[conda] torch-xla 0.8+98a2790 <pip>
[conda] torchvision 0.6.0a0+b6f28ec <pip>
```### Additional context
This is my first time using TPU for training.
The text was updated successfully, but these errors were encountered:
🐛 Bug
I am training GPT2 model on TPU but training is getting stuck with following as the last line:
tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:54541
To Reproduce
I have followed all steps as outlined in https://github.com/mgrankin/ru_transformers/tree/master/tpu to train a GPT2 model on TPU on Google Cloud. As mentioned there, I was able to successfully run MNIST example without any issue
python /pytorch/xla/test/test_train_mp_mnist.py
But when I ran the full training which is on a small dataset (10MB) just to make sure it runs successfully, the training is getting stuck with above line and doesn't proceed further. When I press Ctrl-C, I can see it is waiting in socket polling. I have tried restarting the TPU but same problem is observed.
Steps to reproduce the behavior:
Logs
TPU Hang.log
Expected behavior
Training should complete successfully.
Environment
The text was updated successfully, but these errors were encountered: