-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xla/test/test_mp_rendezvous.py failing for GHA on pytorch/pytorch
#3107
Comments
@seemethere I am suspecting this is just a a flaky test, maybe server goes down for some reason and other process can not finish the rendezvous. Can you rerun that CI and check if test will fail again? |
Yup! I'll re-run this workflow to see if that resolves the issue |
Looks like failure persists even after a re-run, are there any other debugging steps I can take for this? |
Is there an easy way for me to repo this on my end? Easiest way would be commented out this line and manually add
This should tell us what is going on. |
I think this issue is fixed. |
Re-opening this, I finally have time to work on this again and this is still popping up, will do the debugging steps that @JackCaoG has recommended, Latest failure log on this: https://github.com/pytorch/pytorch/runs/4304316948?check_suite_focus=true |
Appears as though the problem manifests itself when setting
Full logs
|
Figured this out, turns out that the squid proxy we were using to proxy our requests was messing with the mesh server, I'm going to go ahead and just disable squid proxy for xla specific tests |
Thanks @seemethere ! |
🐛 Bug
During the migration of xla jobs to Github Actions on
pytorch/pytorch
I encountered the following error (PR to migrate workflow: pytorch/pytorch#64320)Is there a way to debug this issue further?
Link to logs: https://github.com/pytorch/pytorch/runs/3476325065?check_suite_focus=true
Logs
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
Additional context
The text was updated successfully, but these errors were encountered: