Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT2-large on Colab TPU seems to time out #996

Closed
bilal2vec opened this issue Mar 2, 2020 · 7 comments
Closed

GPT2-large on Colab TPU seems to time out #996

bilal2vec opened this issue Mar 2, 2020 · 7 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@bilal2vec
Copy link
Contributor

🐛 Bug

When training gpt2-large on a colab tpu, gpt2-large doesn't work

To Reproduce

See the colab notebook: https://colab.research.google.com/drive/1An6D3wh_H4dbmlEUHYOXZYxkH6S7VKu9

This is the relevant part of the stack trace:

INFO:root:training on 8 TPU cores
2020-03-02 00:43:14.794597: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.857680: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.918609: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.974498: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.031540: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.087601: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.142553: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
E0302 00:43:22.445484458    1536 server_chttp2.cc:40]        {"created":"@1583109802.445465277","description":"Only 1 addresses added out of total 2 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":404,"referenced_errors":[{"created":"@1583109802.445463004","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::1]:57271"}]}
2020-03-02 00:43:24.109498: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.429623: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.712988: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.731491: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.867584: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.883436: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:25.112841: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
INFO:root:INIT TPU local core: 0, global rank: 0
2020-03-02 00:44:11.382078: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
INFO:root:INIT TPU local core: 2, global rank: 2
INFO:root:INIT TPU local core: 5, global rank: 5
2020-03-02 00:44:15.925331: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'pl.Trainer.run_pretrain_routine': Socket closed (14)
Traceback (most recent call last):
  File "finetune.py", line 129, in <module>
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 976, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 1 terminated with signal SIGKILL

Expected behavior

The code works when training on gpt2 (124M) but doesn't when training on gpt2-large (774M)

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.17.5
[pip3] torch==1.4.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.5.0
[conda] Could not collect

Additional context

@bilal2vec bilal2vec added bug Something isn't working help wanted Open to be worked on labels Mar 2, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2020

Hi! thanks for your contribution!, great first issue!

@williamFalcon
Copy link
Contributor

@bkkaggle try again using the latest version.

@bilal2vec
Copy link
Contributor Author

I updated the colab notebook, the error remains but it looks like it's because pytorch/xla is loading the data to all the processes, causing an OOM. (pytorch/xla#1280 (comment))

Closing

@williamFalcon
Copy link
Contributor

@dlibenzi fyi.

@bkkaggle maybe file a bug in xla repo?

@dlibenzi
Copy link

dlibenzi commented Mar 7, 2020

It's likely the kernel OOM killer triggering this.
Colabs have limited memory and cores, so cannot run very large workloads.
We will be changing the Cloud TPU architecture in the next months, and after that Colab VM should have much more memory and cores.

@williamFalcon
Copy link
Contributor

@srush fyi

@srush
Copy link
Contributor

srush commented Mar 7, 2020

Yup, this is what I saw as well. You need enough RAM to have the model loaded 8 times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants