-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
machine translation validation fails with multi-process #1280
Comments
Hello, This is not a fatal error, right? The process should be going on after you see this message in the stderr, can you confirm? This was discussed here. As far as I can tell, this issue is not really related to TPUs and it is benign. |
Thanks for the information. It will then follow by and crush | epoch 001 | valid on xla:0/1 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/7 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/2 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/4 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
| epoch 001 | valid on xla:0/3 'valid' subset | loss 5.485 | nll_loss 3.768 | ppl 13.62 | num_updates 4167
Traceback (most recent call last):
File "train.py", line 632, in <module>
cli_main()
File "train.py", line 623, in cli_main
xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 154, in spawn
_start_fn, args=(fn, args), nprocs=nprocs, join=join, daemon=daemon)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGKILL |
I see. Is there another error message you see? Something legitimately errors in the code, but this is independent of the |
Oh I just noticed that the commands above create a VM in Europe, and TPUs are in US. Can you retry w/ same region? |
Sorry, that's a typo, the TPU and VM instance are all in Europe. |
I am trying to repro currently. I'll report back if epoch 1 validation errors or completes. |
Many thanks for helping! I am also restarting a new run to see if it reports the same issue. |
You can also reproduce this error by just adding an
inside train_loop_fn so you don't have to wait for epoch 1 training to finish. |
So I created a new VM + tpu, and ran through the tutorial. The process indeed died as described in the issue, around validation step ~300. It received a SIGKILL. Looking at The reason for this is, I believe, the following:
I have verified that the combo Does that make sense? |
Yes, it makes sense. |
Thanks for the suggestion, that seems like a useful commit indeed. It is in our plans to rebase our tpu branch on top of fairseq master, which will include this change too. Feel free to submit a PR if you have cherry picked that commit and resolved conflicts etc already. I verified that both combinations below work.
So, to use multiprocessing in the meantime, you can switch to a bigger machine. |
can you tell me the fairseq version that you had uesd, i not find the --num_cores command in my version, my version is 0.10.2 ,thank you very match |
❓ Questions and Help
To Reproduce
Steps to reproduce the behavior:
torch-xla-nightly
After the first epoch during validation, it reports
/anaconda3/envs/torch-xla-nightly/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))
and then crushes. There is no checkpoint saved, too.Expected behavior
It crushes with the SIGKILL from multiprocessing:
Environment
The text was updated successfully, but these errors were encountered: