GPT2-large on Colab TPU seems to time out #996

bilal2vec · 2020-03-02T00:49:51Z

🐛 Bug

When training gpt2-large on a colab tpu, gpt2-large doesn't work

To Reproduce

See the colab notebook: https://colab.research.google.com/drive/1An6D3wh_H4dbmlEUHYOXZYxkH6S7VKu9

This is the relevant part of the stack trace:

INFO:root:training on 8 TPU cores
2020-03-02 00:43:14.794597: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.857680: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.918609: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:14.974498: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.031540: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.087601: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
2020-03-02 00:43:15.142553: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
E0302 00:43:22.445484458    1536 server_chttp2.cc:40]        {"created":"@1583109802.445465277","description":"Only 1 addresses added out of total 2 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":404,"referenced_errors":[{"created":"@1583109802.445463004","description":"Address family not supported by protocol","errno":97,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::1]:57271"}]}
2020-03-02 00:43:24.109498: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.429623: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.712988: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.731491: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.867584: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:24.883436: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
2020-03-02 00:43:25.112841: I tensorflow/compiler/xla/xla_client/computation_client.cc:197] Fetching mesh configuration for worker tpu_worker:0 from mesh service at localhost:57271
INFO:root:INIT TPU local core: 0, global rank: 0
2020-03-02 00:44:11.382078: I tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57271
INFO:root:INIT TPU local core: 2, global rank: 2
INFO:root:INIT TPU local core: 5, global rank: 5
2020-03-02 00:44:15.925331: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'pl.Trainer.run_pretrain_routine': Socket closed (14)
Traceback (most recent call last):
  File "finetune.py", line 129, in <module>
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 976, in fit
    xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 1 terminated with signal SIGKILL

Expected behavior

The code works when training on gpt2 (124M) but doesn't when training on gpt2-large (774M)

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.12.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.17.5
[pip3] torch==1.4.0
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.5.0
[conda] Could not collect

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2020-03-02T00:50:28Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-03-07T00:01:10Z

@bkkaggle try again using the latest version.

bilal2vec · 2020-03-07T19:38:57Z

I updated the colab notebook, the error remains but it looks like it's because pytorch/xla is loading the data to all the processes, causing an OOM. (pytorch/xla#1280 (comment))

Closing

williamFalcon · 2020-03-07T19:42:43Z

@dlibenzi fyi.

@bkkaggle maybe file a bug in xla repo?

dlibenzi · 2020-03-07T19:48:29Z

It's likely the kernel OOM killer triggering this.
Colabs have limited memory and cores, so cannot run very large workloads.
We will be changing the Cloud TPU architecture in the next months, and after that Colab VM should have much more memory and cores.

williamFalcon · 2020-03-07T19:52:39Z

@srush fyi

srush · 2020-03-07T20:02:34Z

Yup, this is what I saw as well. You need enough RAM to have the model loaded 8 times.

bilal2vec added bug Something isn't working help wanted Open to be worked on labels Mar 2, 2020

bilal2vec closed this as completed Mar 7, 2020

bilal2vec mentioned this issue Mar 11, 2020

Multiprocessing RAM usage pytorch/xla#1742

Closed

astariul mentioned this issue Apr 24, 2020

Colab TPU : process terminated with signal SIGKILL #1590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT2-large on Colab TPU seems to time out #996

GPT2-large on Colab TPU seems to time out #996

bilal2vec commented Mar 2, 2020

github-actions bot commented Mar 2, 2020

williamFalcon commented Mar 7, 2020

bilal2vec commented Mar 7, 2020

williamFalcon commented Mar 7, 2020

dlibenzi commented Mar 7, 2020

williamFalcon commented Mar 7, 2020

srush commented Mar 7, 2020

GPT2-large on Colab TPU seems to time out #996

GPT2-large on Colab TPU seems to time out #996

Comments

bilal2vec commented Mar 2, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Mar 2, 2020

williamFalcon commented Mar 7, 2020

bilal2vec commented Mar 7, 2020

williamFalcon commented Mar 7, 2020

dlibenzi commented Mar 7, 2020

williamFalcon commented Mar 7, 2020

srush commented Mar 7, 2020