Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xla/test/test_mp_rendezvous.py failing for GHA on pytorch/pytorch #3107

Closed
seemethere opened this issue Aug 31, 2021 · 10 comments
Closed

xla/test/test_mp_rendezvous.py failing for GHA on pytorch/pytorch #3107

seemethere opened this issue Aug 31, 2021 · 10 comments
Assignees
Labels
bug Something isn't working high priority Issues the team would like to fix quickly.

Comments

@seemethere
Copy link
Member

seemethere commented Aug 31, 2021

🐛 Bug

During the migration of xla jobs to Github Actions on pytorch/pytorch I encountered the following error (PR to migrate workflow: pytorch/pytorch#64320)

Is there a way to debug this issue further?

Link to logs: https://github.com/pytorch/pytorch/runs/3476325065?check_suite_focus=true

Logs
+ python3 /var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py
Core 0 waiting for rendezvous ...
Core 3 waiting for rendezvous ...
Exception in device=CPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Exception in device=CPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Core 1 waiting for rendezvous ...
Exception in device=CPU:1: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Core 2 waiting for rendezvous ...
Exception in device=CPU:2: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 35, in <module>
    xmp.spawn(_mp_fn, args=())
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

  • Reproducible on XLA backend [CPU/TPU]:
  • torch_xla version:

Additional context

@seemethere seemethere added bug Something isn't working high priority Issues the team would like to fix quickly. labels Aug 31, 2021
@JackCaoG
Copy link
Collaborator

@seemethere I am suspecting this is just a a flaky test, maybe server goes down for some reason and other process can not finish the rendezvous. Can you rerun that CI and check if test will fail again?

@seemethere
Copy link
Member Author

Yup! I'll re-run this workflow to see if that resolves the issue

@seemethere
Copy link
Member Author

Looks like failure persists even after a re-run, are there any other debugging steps I can take for this?

@JackCaoG
Copy link
Collaborator

JackCaoG commented Sep 1, 2021

Is there an easy way for me to repo this on my end? Easiest way would be commented out this line and manually add python3 "test/test_mp_rendezvous.py" to get a fast failure. You can also turn on debugging logging by env var

export TF_CPP_MIN_LOG_LEVEL=0 
export TF_CPP_VMODULE=xrt_computation_client=5,computation_client=5

This should tell us what is going on.

@JackCaoG
Copy link
Collaborator

I think this issue is fixed.

@seemethere
Copy link
Member Author

seemethere commented Nov 23, 2021

Re-opening this, I finally have time to work on this again and this is still popping up, will do the debugging steps that @JackCaoG has recommended,

Latest failure log on this: https://github.com/pytorch/pytorch/runs/4304316948?check_suite_focus=true

@seemethere seemethere reopened this Nov 23, 2021
@seemethere
Copy link
Member Author

seemethere commented Nov 23, 2021

Appears as though the problem manifests itself when setting CPU_NUM_DEVICES to any value that's not 1, I can reproduce on our GHA machines using the following:

TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=xrt_computation_client=5,computation_client=5 CPU_NUM_DEVICES=4 python3 /var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py
Full logs
2021-11-23 22:52:55.264971: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so                                                                                                                         [0/1342]2021-11-23 22:52:56.126920: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
2021-11-23 22:52:56.126934: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
2021-11-23 22:52:56.157224: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so                                                                                                                                 2021-11-23 22:52:56.169468: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
2021-11-23 22:52:56.230182: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:0 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230238: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:36723 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230227: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:3 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230254: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:0
2021-11-23 22:52:56.230257: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:34375 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230247: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:2 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230269: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:3
2021-11-23 22:52:56.230273: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:36723
2021-11-23 22:52:56.230271: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:1 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230279: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:39099 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230286: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:34375
2021-11-23 22:52:56.230292: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:43999 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230295: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:2
2021-11-23 22:52:56.230295: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:36723}
2021-11-23 22:52:56.230306: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:34375}
2021-11-23 22:52:56.230313: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:1
2021-11-23 22:52:56.230319: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:39099
2021-11-23 22:52:56.230330: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:43999
2021-11-23 22:52:56.230336: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:39099}
2021-11-23 22:52:56.230346: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:43999}
2021-11-23 22:52:56.230388: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.230392: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.230405: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.230410: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.248724: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x55a009ca3a10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.248774: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.248793: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x56553ee685d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.248821: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.253565: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:36723}
2021-11-23 22:52:56.253693: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:39099}
2021-11-23 22:52:56.254015: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:36723
2021-11-23 22:52:56.254059: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1427] Creating mesh service bound to 089fb5252c17:45063
2021-11-23 22:52:56.254137: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:39099
Core 2 waiting for rendezvous ...
2021-11-23 22:52:56.255331: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
Core 0 waiting for rendezvous ...
2021-11-23 22:52:56.257533: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x55883b524a10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.257564: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.257599: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
2021-11-23 22:52:56.258705: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x55a79ca35a10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.258734: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.262655: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:34375}
2021-11-23 22:52:56.263131: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:34375
2021-11-23 22:52:56.264326: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:43999}
Core 3 waiting for rendezvous ...
2021-11-23 22:52:56.264777: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:43999
2021-11-23 22:52:56.265053: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
Core 1 waiting for rendezvous ...
2021-11-23 22:52:56.267571: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
Exception in device=CPU:2: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.261735: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.261815: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Exception in device=CPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.263794: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1618] Shutting down mesh service ...
2021-11-23 22:57:56.263976: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1620] Shutting down mesh service ... done!
2021-11-23 22:57:56.263997: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.264038: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Exception in device=CPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.270483: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.270552: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Exception in device=CPU:1: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.272592: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.272650: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 35, in <module>
    xmp.spawn(_mp_fn, args=())
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17

@seemethere
Copy link
Member Author

Figured this out, turns out that the squid proxy we were using to proxy our requests was messing with the mesh server, I'm going to go ahead and just disable squid proxy for xla specific tests

@JackCaoG
Copy link
Collaborator

Thanks @seemethere !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority Issues the team would like to fix quickly.
Projects
None yet
Development

No branches or pull requests

2 participants