xla/test/test_mp_rendezvous.py failing for GHA on `pytorch/pytorch` #3107

seemethere · 2021-08-31T19:28:15Z

🐛 Bug

During the migration of xla jobs to Github Actions on pytorch/pytorch I encountered the following error (PR to migrate workflow: pytorch/pytorch#64320)

Is there a way to debug this issue further?

Link to logs: https://github.com/pytorch/pytorch/runs/3476325065?check_suite_focus=true

Logs

+ python3 /var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py
Core 0 waiting for rendezvous ...
Core 3 waiting for rendezvous ...
Exception in device=CPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Exception in device=CPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Core 1 waiting for rendezvous ...
Exception in device=CPU:1: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Core 2 waiting for rendezvous ...
Exception in device=CPU:2: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds)) 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::service::MeshClient::Get()
	
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	PyObject_Call
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	_PyEval_EvalFrameDefault
	
	
	
	_PyEval_EvalFrameDefault
	
	PyEval_EvalCode
	
	PyRun_StringFlags
	PyRun_SimpleStringFlags
	Py_Main
	main
	__libc_start_main
	
*** End stack trace ***
Failed to connect to client mesh master: 2c7b82f21fc7:49515
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 35, in <module>
    xmp.spawn(_mp_fn, args=())
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

Reproducible on XLA backend [CPU/TPU]:
torch_xla version:

Additional context

The text was updated successfully, but these errors were encountered:

JackCaoG · 2021-08-31T19:57:17Z

@seemethere I am suspecting this is just a a flaky test, maybe server goes down for some reason and other process can not finish the rendezvous. Can you rerun that CI and check if test will fail again?

seemethere · 2021-08-31T20:35:46Z

Yup! I'll re-run this workflow to see if that resolves the issue

seemethere · 2021-08-31T21:35:17Z

Looks like failure persists even after a re-run, are there any other debugging steps I can take for this?

JackCaoG · 2021-09-01T00:01:17Z

Is there an easy way for me to repo this on my end? Easiest way would be commented out this line and manually add python3 "test/test_mp_rendezvous.py" to get a fast failure. You can also turn on debugging logging by env var

export TF_CPP_MIN_LOG_LEVEL=0 
export TF_CPP_VMODULE=xrt_computation_client=5,computation_client=5

This should tell us what is going on.

JackCaoG · 2021-10-15T05:23:00Z

I think this issue is fixed.

seemethere · 2021-11-23T21:43:26Z

Re-opening this, I finally have time to work on this again and this is still popping up, will do the debugging steps that @JackCaoG has recommended,

Latest failure log on this: https://github.com/pytorch/pytorch/runs/4304316948?check_suite_focus=true

seemethere · 2021-11-23T21:50:59Z

(maybe) Related issues:

seemethere · 2021-11-23T22:54:45Z

Appears as though the problem manifests itself when setting CPU_NUM_DEVICES to any value that's not 1, I can reproduce on our GHA machines using the following:

TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=xrt_computation_client=5,computation_client=5 CPU_NUM_DEVICES=4 python3 /var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py

Full logs

2021-11-23 22:52:55.264971: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so                                                                                                                         [0/1342]2021-11-23 22:52:56.126920: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
2021-11-23 22:52:56.126934: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
2021-11-23 22:52:56.157224: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so                                                                                                                                 2021-11-23 22:52:56.169468: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
2021-11-23 22:52:56.230182: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:0 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230238: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:36723 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230227: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:3 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230254: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:0
2021-11-23 22:52:56.230257: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:34375 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230247: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:2 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230269: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:3
2021-11-23 22:52:56.230273: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:36723
2021-11-23 22:52:56.230271: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:277] XRT device (LOCAL) CPU:1 -> /job:localservice/replica:0/task:0/device:XLA_CPU:0
2021-11-23 22:52:56.230279: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:39099 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230286: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:34375
2021-11-23 22:52:56.230292: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:281] Worker grpc://localhost:43999 for /job:localservice/replica:0/task:0
2021-11-23 22:52:56.230295: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:2
2021-11-23 22:52:56.230295: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:36723}
2021-11-23 22:52:56.230306: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:34375}
2021-11-23 22:52:56.230313: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:285] XRT default device: CPU:1
2021-11-23 22:52:56.230319: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:39099
2021-11-23 22:52:56.230330: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1967] Local Service Cluster Spec: localservice|localhost:43999
2021-11-23 22:52:56.230336: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:39099}
2021-11-23 22:52:56.230346: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:40] Peer localservice 1 {localhost:43999}
2021-11-23 22:52:56.230388: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.230392: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.230405: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.230410: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations
:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-23 22:52:56.248724: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x55a009ca3a10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.248774: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.248793: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x56553ee685d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.248821: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.253565: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:36723}
2021-11-23 22:52:56.253693: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:39099}
2021-11-23 22:52:56.254015: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:36723
2021-11-23 22:52:56.254059: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1427] Creating mesh service bound to 089fb5252c17:45063
2021-11-23 22:52:56.254137: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:39099
Core 2 waiting for rendezvous ...
2021-11-23 22:52:56.255331: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
Core 0 waiting for rendezvous ...
2021-11-23 22:52:56.257533: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x55883b524a10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.257564: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.257599: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
2021-11-23 22:52:56.258705: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x55a79ca35a10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-23 22:52:56.258734: I tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Host, Default Version
2021-11-23 22:52:56.262655: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:34375}
2021-11-23 22:52:56.263131: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:34375
2021-11-23 22:52:56.264326: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:43999}
Core 3 waiting for rendezvous ...
2021-11-23 22:52:56.264777: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://localhost:43999
2021-11-23 22:52:56.265053: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
Core 1 waiting for rendezvous ...
2021-11-23 22:52:56.267571: I tensorflow/compiler/xla/xla_client/mesh_service.cc:312] Waiting to connect to client mesh master (300 seconds) 089fb5252c17:45063
Exception in device=CPU:2: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.261735: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.261815: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Exception in device=CPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.263794: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1618] Shutting down mesh service ...
2021-11-23 22:57:56.263976: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1620] Shutting down mesh service ... done!
2021-11-23 22:57:56.263997: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.264038: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Exception in device=CPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.270483: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.270552: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Exception in device=CPU:1: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
    replicas=replicas)
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 910, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:316 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
        xla::service::MeshClient::Get()



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        PyObject_Call
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyRun_StringFlags
        PyRun_SimpleStringFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***
Failed to connect to client mesh master: 089fb5252c17:45063
2021-11-23 22:57:56.272592: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1623] Waiting XRT handle releaser thread ...
2021-11-23 22:57:56.272650: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1626] Waiting XRT handle releaser thread ... done!
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 35, in <module>
    xmp.spawn(_mp_fn, args=())
  File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.11-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
    start_method=start_method)
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 144, in join
    exit_code=exitcode
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 17

seemethere · 2021-11-24T19:15:11Z

Figured this out, turns out that the squid proxy we were using to proxy our requests was messing with the mesh server, I'm going to go ahead and just disable squid proxy for xla specific tests

JackCaoG · 2021-11-24T19:30:28Z

Thanks @seemethere !

seemethere added bug Something isn't working high priority Issues the team would like to fix quickly. labels Aug 31, 2021

seemethere assigned JackCaoG Aug 31, 2021

seemethere mentioned this issue Aug 31, 2021

.github: Migrate pytorch_linux_bionic_py_3_6_clang9 to GHA pytorch/pytorch#64218

Closed

seemethere mentioned this issue Aug 31, 2021

.github: Migrate XLA tests to GHA pytorch/pytorch#64320

Closed

JackCaoG closed this as completed Oct 15, 2021

seemethere mentioned this issue Nov 2, 2021

/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py potentially flaky pytorch/pytorch#67446

Open

seemethere reopened this Nov 23, 2021

seemethere closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xla/test/test_mp_rendezvous.py failing for GHA on `pytorch/pytorch` #3107

xla/test/test_mp_rendezvous.py failing for GHA on `pytorch/pytorch` #3107

seemethere commented Aug 31, 2021 •

edited

Loading

JackCaoG commented Aug 31, 2021

seemethere commented Aug 31, 2021

seemethere commented Aug 31, 2021

JackCaoG commented Sep 1, 2021

JackCaoG commented Oct 15, 2021

seemethere commented Nov 23, 2021 •

edited

Loading

seemethere commented Nov 23, 2021 •

edited

Loading

seemethere commented Nov 23, 2021 •

edited

Loading

seemethere commented Nov 24, 2021

JackCaoG commented Nov 24, 2021

xla/test/test_mp_rendezvous.py failing for GHA on pytorch/pytorch #3107

xla/test/test_mp_rendezvous.py failing for GHA on pytorch/pytorch #3107

Comments

seemethere commented Aug 31, 2021 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

JackCaoG commented Aug 31, 2021

seemethere commented Aug 31, 2021

seemethere commented Aug 31, 2021

JackCaoG commented Sep 1, 2021

JackCaoG commented Oct 15, 2021

seemethere commented Nov 23, 2021 • edited Loading

seemethere commented Nov 23, 2021 • edited Loading

seemethere commented Nov 23, 2021 • edited Loading

seemethere commented Nov 24, 2021

JackCaoG commented Nov 24, 2021

xla/test/test_mp_rendezvous.py failing for GHA on `pytorch/pytorch` #3107

xla/test/test_mp_rendezvous.py failing for GHA on `pytorch/pytorch` #3107

seemethere commented Aug 31, 2021 •

edited

Loading

seemethere commented Nov 23, 2021 •

edited

Loading

seemethere commented Nov 23, 2021 •

edited

Loading

seemethere commented Nov 23, 2021 •

edited

Loading