machine translation validation fails with multi-process #1280

sIncerass · 2019-10-31T20:24:20Z

❓ Questions and Help

To Reproduce

Steps to reproduce the behavior:

create an instance using the latest torch-xla

export PROJECT_NAME=xxx
gcloud config set project ${PROJECT_NAME}
gcloud compute --project=${PROJECT_NAME} instances create instance-1 \
--zone=europe-west4-a  \
--machine-type=n1-standard-8  \
--image=debian-9-torch-xla-v20191026 \
--image-project=ml-images  \
--boot-disk-size=200GB

conda activate torch-xla-nightly
run machine translation scirpt following https://cloud.google.com/tpu/docs/tutorials/transformer-pytorch in tpu branch of fairseq-tpu (https://github.com/pytorch-tpu/fairseq/tree/tpu) as

gcloud compute tpus create transformer-pytorch-tutorial \
--zone=europe-west4-a \
--network=default \
--range=10.2.3.0 \
--version=pytorch-nightly \
--accelerator-type=v3-8

export TPU_IP_ADDRESS=ip-address; \
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470";

python train.py \
  $HOME/pytorch-tutorial-data/wmt18_en_de_bpej32k \
  --save-interval=1 \
  --arch=transformer_vaswani_wmt_en_de_big \
  --max-target-positions=64 \
  --attention-dropout=0.1 \
  --no-progress-bar \
  --criterion=label_smoothed_cross_entropy \
  --source-lang=en \
  --lr-scheduler=inverse_sqrt \
  --min-lr 1e-09 \
  --skip-invalid-size-inputs-valid-test \
  --target-lang=de \
  --label-smoothing=0.1 \
  --update-freq=1 \
  --optimizer adam \
  --adam-betas '(0.9, 0.98)' \
  --warmup-init-lr 1e-07 \
  --lr 0.0005 \
  --warmup-updates 4000 \
  --share-all-embeddings \
  --dropout 0.3 \
  --weight-decay 0.0 \
  --valid-subset=valid \
  --max-epoch=25 \
  --input_shapes 128x64 \
  --num_cores=8 \
  --metrics_debug \
  --log_steps=100

After the first epoch during validation, it reports
/anaconda3/envs/torch-xla-nightly/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache)) and then crushes. There is no checkpoint saved, too.

Expected behavior

It crushes with the SIGKILL from multiprocessing:

Traceback (most recent call last):
  File "train.py", line 632, in <module>
    cli_main()
  File "train.py", line 623, in cli_main
    xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 154, in spawn
    _start_fn, args=(fn, args), nprocs=nprocs, join=join, daemon=daemon)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

Environment

reproducible on XLA backend [CPU/TPU]: TPU
torch_xla version: torch-xla-nightly (v1026)
Any other relevant information:

The text was updated successfully, but these errors were encountered:

taylanbil · 2019-10-31T20:30:34Z

Hello,

This is not a fatal error, right? The process should be going on after you see this message in the stderr, can you confirm?

This was discussed here. As far as I can tell, this issue is not really related to TPUs and it is benign.

sIncerass · 2019-10-31T20:35:56Z

Thanks for the information. It will then follow by and crush

| epoch 001 | valid on xla:0/1 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/7 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/2 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/4 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/3 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
Traceback (most recent call last):
  File "train.py", line 632, in <module>
    cli_main()
  File "train.py", line 623, in cli_main
    xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 154, in spawn
    _start_fn, args=(fn, args), nprocs=nprocs, join=join, daemon=daemon)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

taylanbil · 2019-10-31T20:41:00Z

I see. Is there another error message you see? Something legitimately errors in the code, but this is independent of the semaphore_tracker message above. I'm going to be going through the steps now to see if I can reproduce.

taylanbil · 2019-10-31T20:49:39Z

Oh I just noticed that the commands above create a VM in Europe, and TPUs are in US. Can you retry w/ same region?

sIncerass · 2019-10-31T20:51:25Z

Sorry, that's a typo, the TPU and VM instance are all in Europe.
Thanks for helping. That's the only error message I have seen.

taylanbil · 2019-10-31T21:21:57Z

I am trying to repro currently. I'll report back if epoch 1 validation errors or completes.

sIncerass · 2019-10-31T21:23:09Z

Many thanks for helping! I am also restarting a new run to see if it reports the same issue.
confirmed the same issue after the first epoch.

Eric-Wallace · 2019-10-31T23:22:46Z

You can also reproduce this error by just adding an

if i == 10:
    return tracker

inside train_loop_fn so you don't have to wait for epoch 1 training to finish.

taylanbil · 2019-10-31T23:25:25Z

So I created a new VM + tpu, and ran through the tutorial. The process indeed died as described in the issue, around validation step ~300. It received a SIGKILL. Looking at sudo dmesg -T, it became obvious that this is an OOM error.

The reason for this is, I believe, the following:

the tutorial is created assuming the environment torch-xla-0.5. Whereas you are using torch-xla-nightly. But there has been big changes since the 0.5 release, including switching to use multiprocessing instead of multithreading.
Multiprocessing loads the input data to all the processes, whereas multithreading loads once, so the memory usage is significantly higher in MP.
Since the tutorial uses n1-standard-8, the process OOMs.

I have verified that the combo n1-standard-64 and torch-xla-nightly works. I will now verify that it works on torch-xla-0.5 and n1-standard-8.

Does that make sense?

sIncerass · 2019-11-01T00:49:04Z

Yes, it makes sense.
@Eric-Wallace and we found that it might be better to merge the facebookresearch/fairseq@a1c997b into the pytorch-tpu/fairseq repo, which offers more efficient data loader and maybe resolves this problem easily. ("mmap" makes the script doesn't copy the memory across all the different processes).

taylanbil · 2019-11-01T16:31:58Z

Thanks for the suggestion, that seems like a useful commit indeed. It is in our plans to rebase our tpu branch on top of fairseq master, which will include this change too. Feel free to submit a PR if you have cherry picked that commit and resolved conflicts etc already.

I verified that both combinations below work.

n1-standard-64 and torch-xla-nightly
torch-xla-0.5 and n1-standard-8

So, to use multiprocessing in the meantime, you can switch to a bigger machine.

yingyukexiansheng · 2021-05-22T04:59:05Z

can you tell me the fairseq version that you had uesd, i not find the --num_cores command in my version, my version is 0.10.2 ,thank you very match

sIncerass changed the title ~~machine translation validation fails with multi-preprocess~~ machine translation validation fails with multi-process Oct 31, 2019

taylanbil self-assigned this Oct 31, 2019

mruberry added the question label Nov 1, 2019

taylanbil closed this as completed Nov 1, 2019

This was referenced Mar 7, 2020

GPT2-large on Colab TPU seems to time out Lightning-AI/pytorch-lightning#996

Closed

Multiprocessing RAM usage #1742

Closed

engmubarak48 mentioned this issue Jun 12, 2020

num_tpu_cores=8 does not work on kaggle Lightning-AI/pytorch-lightning#1538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

machine translation validation fails with multi-process #1280

machine translation validation fails with multi-process #1280

sIncerass commented Oct 31, 2019 •

edited

Loading

taylanbil commented Oct 31, 2019

sIncerass commented Oct 31, 2019 •

edited

Loading

taylanbil commented Oct 31, 2019

taylanbil commented Oct 31, 2019

sIncerass commented Oct 31, 2019 •

edited

Loading

taylanbil commented Oct 31, 2019

sIncerass commented Oct 31, 2019 •

edited

Loading

Eric-Wallace commented Oct 31, 2019 •

edited

Loading

taylanbil commented Oct 31, 2019

sIncerass commented Nov 1, 2019 •

edited

Loading

taylanbil commented Nov 1, 2019

yingyukexiansheng commented May 22, 2021

machine translation validation fails with multi-process #1280

machine translation validation fails with multi-process #1280

Comments

sIncerass commented Oct 31, 2019 • edited Loading

❓ Questions and Help

To Reproduce

Expected behavior

Environment

taylanbil commented Oct 31, 2019

sIncerass commented Oct 31, 2019 • edited Loading

taylanbil commented Oct 31, 2019

taylanbil commented Oct 31, 2019

sIncerass commented Oct 31, 2019 • edited Loading

taylanbil commented Oct 31, 2019

sIncerass commented Oct 31, 2019 • edited Loading

Eric-Wallace commented Oct 31, 2019 • edited Loading

taylanbil commented Oct 31, 2019

sIncerass commented Nov 1, 2019 • edited Loading

taylanbil commented Nov 1, 2019

yingyukexiansheng commented May 22, 2021

sIncerass commented Oct 31, 2019 •

edited

Loading

sIncerass commented Oct 31, 2019 •

edited

Loading

sIncerass commented Oct 31, 2019 •

edited

Loading

sIncerass commented Oct 31, 2019 •

edited

Loading

Eric-Wallace commented Oct 31, 2019 •

edited

Loading

sIncerass commented Nov 1, 2019 •

edited

Loading