Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on TPU stuck at "Waiting to connect to client mesh master (300 seconds) localhost:54541" #1090

Closed
nikhilno1 opened this issue Mar 8, 2020 · 3 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@nikhilno1
Copy link

🐛 Bug

I am training GPT2 model on TPU but training is getting stuck with following as the last line:
tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:54541

To Reproduce

I have followed all steps as outlined in https://github.com/mgrankin/ru_transformers/tree/master/tpu to train a GPT2 model on TPU on Google Cloud. As mentioned there, I was able to successfully run MNIST example without any issue
python /pytorch/xla/test/test_train_mp_mnist.py
But when I ran the full training which is on a small dataset (10MB) just to make sure it runs successfully, the training is getting stuck with above line and doesn't proceed further. When I press Ctrl-C, I can see it is waiting in socket polling. I have tried restarting the TPU but same problem is observed.

Steps to reproduce the behavior:

  1. Run the fit.sh present in the repo here: https://github.com/mgrankin/ru_transformers after all the necessary configuration.

Logs

TPU Hang.log

Expected behavior

Training should complete successfully.

Environment

Collecting environment information...
PyTorch version: 1.5.0a0+65bad41
Is debug build: No
CUDA used to build PyTorch: None

OS: Debian GNU/Linux 9 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] numpydoc==0.9.1
[pip] torch==1.5.0a0+65bad41
[pip] torch-xla==0.8+98a2790
[pip] torchvision==0.6.0a0+b6f28ec
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.4                      243  
[conda] mkl-service               2.3.0            py36he904b0f_0  
[conda] mkl_fft                   1.0.14           py36ha843d7b_0  
[conda] mkl_random                1.1.0            py36hd6b4f25_0  
[conda] torch                     1.5.0a0+65bad41           <pip>
[conda] torch-xla                 0.8+98a2790               <pip>
[conda] torchvision               0.6.0a0+b6f28ec           <pip>


```### Additional context

This is my first time using TPU for training.
@nikhilno1 nikhilno1 added bug Something isn't working help wanted Open to be worked on labels Mar 8, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2020

Hi! thanks for your contribution!, great first issue!

@nikhilno1
Copy link
Author

Is it the case that training is actually completing but command doesn't return, which is what I am used to seeing?

@nikhilno1
Copy link
Author

I am seeing the pytorch_model.bin getting created so which means training was successful. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

1 participant