-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional ipc_test failures (macOS?) #154
Comments
Copied the output so we aren't relying on GH actions to retain it. test/ipc_tests.cpp:11: Entering test suite "ipc_tests"
test/ipc_tests.cpp:12: Entering test case "ipc_tests"
libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
2025-02-05T16:12:45.463402Z [unknown] [test/util/random.cpp:48] [SeedRandomStateForTest] Setting random seed for current tests to RANDOM_CTX_SEED=40e3ae01ebf0dcb2dc43849f963c573142c9f0ec38096c9feb4f14563c55cb78
2025-02-05T16:12:45.472556Z [test] [init/common.cpp:153] [LogPackageVersion] Bitcoin Core version v28.99.0-cd63f7b213c7-dirty (release build)
2025-02-05T16:12:45.472716Z [test] [kernel/context.cpp:20] [operator()] Using the 'arm_shani(1way,2way)' SHA256 implementation
2025-02-05T16:12:45.483530Z [test] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client first request from current thread, constructing waiter
2025-02-05T16:12:45.483734Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.add$Params (a = 1, b = 2)
2025-02-05T16:12:45.483815Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #1 FooInterface.add$Params (a = 1, b = 2)
2025-02-05T16:12:45.483854Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #1 FooInterface.add$Results (result = 3)
2025-02-05T16:12:45.483883Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.add$Results (result = 3)
2025-02-05T16:12:45.484420Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passOutPoint$Params (arg = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484443Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #2 FooInterface.passOutPoint$Params (arg = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484458Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #2 FooInterface.passOutPoint$Results (result = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484477Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passOutPoint$Results (result = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484688Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passUniValue$Params (arg = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484710Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #3 FooInterface.passUniValue$Params (arg = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484724Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #3 FooInterface.passUniValue$Results (result = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484741Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passUniValue$Results (result = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484770Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passTransaction$Params (arg = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.484790Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #4 FooInterface.passTransaction$Params (arg = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.484966Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #4 FooInterface.passTransaction$Results (result = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.485063Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passTransaction$Results (result = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.485124Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passVectorChar$Params (arg = "Hello")
2025-02-05T16:12:45.485171Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #5 FooInterface.passVectorChar$Params (arg = "Hello")
2025-02-05T16:12:45.485198Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #5 FooInterface.passVectorChar$Results (result = "Hello")
2025-02-05T16:12:45.485239Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passVectorChar$Results (result = "Hello")
2025-02-05T16:12:45.485430Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passBlockState$Params (arg = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485488Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #6 FooInterface.passBlockState$Params (arg = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485525Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #6 FooInterface.passBlockState$Results (result = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485571Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passBlockState$Results (result = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485619Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passBlockState$Params (arg = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485663Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #7 FooInterface.passBlockState$Params (arg = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485693Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #7 FooInterface.passBlockState$Results (result = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485737Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passBlockState$Results (result = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485793Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passScript$Params (arg = "[")
2025-02-05T16:12:45.485833Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request #8 FooInterface.passScript$Params (arg = "[")
2025-02-05T16:12:45.485858Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #8 FooInterface.passScript$Results (result = "[")
2025-02-05T16:12:45.485898Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passScript$Results (result = "[")
2025-02-05T16:12:45.486162Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server destroy N2mp11ProxyServerIN3gen12FooInterfaceEEE
2025-02-05T16:12:45.486271Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} EventLoop::loop done, cancelling event listeners.
2025-02-05T16:12:45.486283Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} EventLoop::loop bye.
132/140 Test #135: wallet_crypto_tests .................. Passed 3.05 sec
133/140 Test #134: spend_tests .......................... Passed 3.68 sec
134/140 Test #8: secp256k1_exhaustive_tests ........... Passed 27.97 sec
135/140 Test #139: walletload_tests ..................... Passed 4.45 sec
136/140 Test #10: bench_sanity_check_high_priority ..... Passed 28.82 sec
137/140 Test #136: wallet_tests ......................... Passed 9.35 sec
138/140 Test #127: coinselector_tests ................... Passed 20.56 sec
139/140 Test #6: secp256k1_noverify_tests ............. Passed 42.48 sec
Errors while running CTest
140/140 Test #7: secp256k1_tests ...................... Passed 59.80 sec
99% tests passed, 1 tests failed out of 140
Total Test time (real) = 59.82 sec
The following tests FAILED:
140 - ipc_tests (Subprocess aborted) |
This is probably caused by the same bug reported in #128 based on the "mutex lock failed: Invalid argument" error. But the version reported here might be easier to debug because the crash is happening in a unit test instead of in a python test. If you can reproduce this problem locally sometimes, maybe you could run it in a loop until it crashes and try to get a stack trace? This script might work:
(script from chatgpt of course...) |
I couldn't reproduce this but I think I can see how this problem might be happening. The "mutex lock failed" exception is thrown from libc++ mutex::lock method when there's any failure to lock the mutex. In this case I think the exception might being thrown in the the The problem is that while the event loop thread is destroying the event loop object, the I think a simple fix for this would look like: --- a/src/ipc/libmultiprocess/src/mp/proxy.cpp
+++ b/src/ipc/libmultiprocess/src/mp/proxy.cpp
@@ -240,10 +240,12 @@ void EventLoop::removeClient(std::unique_lock<std::mutex>& lock)
if (done(lock)) {
m_cv.notify_all();
int post_fd{m_post_fd};
- Unlock(lock, [&] {
- char buffer = 0;
- KJ_SYSCALL(write(post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon)
- });
+ // Intentionally do not unlock/relock the mutex around this final write
+ // call, so clients are free to destroy the EventLoop object immediately
+ // after the loop() method exits, and code here will not be trying to
+ // relock the event loop mutex after it was destroyed.
+ char buffer = 0;
+ KJ_SYSCALL(write(post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon)
}
}
|
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py functional tests errors that happen on mac os, which sometimes fail with error "terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" as reported bitcoin-core/libmultiprocess#154 This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop` executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol` destructor returns, causing an failure to reacquire the `EventLoop::m_mutex` inside the `removeClient` method. Fix this error by adding a new mutex to guard access to the `CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be called until after `m_loop->removeClient(lock);` returns.
The change in bitcoin/bitcoin#31815 should be a better fix for this issue than the change suggested #154 (comment), which is a less direct fix and could also theoretically cause deadlocks if the Assuming the issue is the |
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py functional tests errors that happen on mac os, which sometimes fail with error "terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" as reported bitcoin-core/libmultiprocess#154 This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop` executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol` destructor returns, causing an failure to reacquire the `EventLoop::m_mutex` inside the `removeClient` method. Fix this error by adding a new mutex to guard access to the `CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be called until after `m_loop->removeClient(lock);` returns.
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py functional tests errors that happen on mac os, which sometimes fail with error "terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" as reported bitcoin-core/libmultiprocess#154 This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop` executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol` destructor returns, causing an failure to reacquire the `EventLoop::m_mutex` inside the `removeClient` method. Fix this error by adding a new mutex to guard access to the `CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be called until after `m_loop->removeClient(lock);` returns.
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py functional tests errors that happen on mac os, which sometimes fail with error "terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" as reported bitcoin-core/libmultiprocess#154 This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop` executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol` destructor returns, causing an failure to reacquire the `EventLoop::m_mutex` inside the `removeClient` method. Fix this error by adding a new mutex to guard access to the `CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be called until after `m_loop->removeClient(lock);` returns.
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py functional tests errors that happen on mac os, which sometimes fail with error "terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument" as reported bitcoin-core/libmultiprocess#154 This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop` executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol` destructor returns, causing an failure to reacquire the `EventLoop::m_mutex` inside the `removeClient` method. Fix this error by adding a new mutex to guard access to the `CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be called until after `m_loop->removeClient(lock);` returns.
Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running and the `EventLoop` object is potentially destroyed. Both cases happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reach and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking the mutex after calling `write()` so the code is updated to just use a simple `lock.unlock()` call and permanently let go of the lock and try to reacquire it. The second case happens in `EventLoop::post` where `Unlock()` is also used in a similar way, and depending on thread scheduling (if the post thread is delayed for a long time before relocking) can result in relocking `EventLoop::m_mutex` after calling `write()` to fail. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are just added to this code, so the `EventLoop::loop()` function will just not exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in libmultiprocess code or API. The `EventLoop` object itself was perfectly safe before these changes as long as outside code was waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added to binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned it meant these two methods could start failing due to their uses of `Unlock()` depending on thread scheduling. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient`, not the second case in `EventLoop::post`. This first case described above was not actually observed but is just theoretically possible. The second case happens intermittently on macos and was reported bitcoin-core#154
Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported bitcoin-core#154
f8cbd0d bugfix: Do not lock EventLoop::mutex after EventLoop is done (user) Pull request description: Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154 Top commit has no ACKs. Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported bitcoin-core#154
48d01bc bugfix: Do not lock EventLoop::mutex after EventLoop is done (Ryan Ofsky) Pull request description: Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock. The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`. This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154 Top commit has no ACKs. Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
Since the change in bitcoin/bitcoin#31815 didn't work (see comments there), I set up a macos vm using quickemu and was able to reproduce this bug locally, get a stack trace, and fix it. Will keep this issue open to see if the fix works in CI, but #159 is the change, and the stack trace looked like: (lldb) bt
* thread #1, name = 'b-test', queue = 'com.apple.main-thread', stop reason = signal SIGABRT
* frame #0: 0x00007ff80922a196 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007ff809261ee6 libsystem_pthread.dylib`pthread_kill + 263
frame #2: 0x00007ff809188b45 libsystem_c.dylib`abort + 123
frame #3: 0x000000010290aa23 libc++abi.1.dylib`abort_message + 195
frame #4: 0x00000001028e8179 libc++abi.1.dylib`demangling_terminate_handler() + 233
frame #5: 0x0000000102908e76 libc++abi.1.dylib`std::__terminate(void (*)()) + 6
frame #6: 0x0000000102908e18 libc++abi.1.dylib`std::terminate() + 56
frame #7: 0x000000010000e6c2 test_bitcoin`__clang_call_terminate + 18
frame #8: 0x00000001011f3ab2 test_bitcoin`mp::EventLoop::post(std::__1::function<void ()> const&) [inlined] mp::UnlockGuard<std::__1::unique_lock<std::__1::mutex> >::~UnlockGuard(this=0x00007ff7bfefe1c8) at util.h:138:29 [opt]
frame #9: 0x00000001011f3aa6 test_bitcoin`mp::EventLoop::post(std::__1::function<void ()> const&) [inlined] mp::UnlockGuard<std::__1::unique_lock<std::__1::mutex> >::~UnlockGuard(this=0x00007ff7bfefe1c8) at util.h:138:20 [opt]
frame #10: 0x00000001011f3aa6 test_bitcoin`mp::EventLoop::post(std::__1::function<void ()> const&) [inlined] void mp::Unlock<std::__1::unique_lock<std::__1::mutex>, mp::EventLoop::post(std::__1::function<void ()> const&)::$_1>(lock=0x00007ff7bfefe1f0, callback=<unavailable>)> const&)::$_1&&) at util.h:147:1 [opt]
frame #11: 0x00000001011f3aa6 test_bitcoin`mp::EventLoop::post(this=0x0000000102e12240, fn=0x00007ff7bfefe250)> const&) at proxy.cpp:231:5 [opt]
frame #12: 0x0000000100fc5143 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] void mp::EventLoop::sync<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()::operator()() const::'lambda'()>(this=<unavailable>, callable=0x00007ff7bfefe240)::'lambda0'()::operator()() const::'lambda'()&&) at proxy-io.h:155:9 [opt]
frame #13: 0x0000000100fc511c test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(this=<unavailable>)::'lambda0'()::operator()() const at proxy-io.h:426:38 [opt]
frame #14: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::'lambda0'()&>()()) std::__1::__invoke[abi:ne190107]<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()&>(mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()&) at invoke.h:149:25 [opt]
frame #15: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ne190107]<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(__args=<unavailable>)::'lambda0'()&>(mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()&) at invoke.h:224:5 [opt]
frame #16: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator(this=<unavailable>)[abi:ne190107]() at function.h:171:12 [opt]
frame #17: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator(this=<unavailable>)() at function.h:313:10 [opt]
frame #18: 0x00000001011f6a07 test_bitcoin`mp::CleanupRun(std::__1::list<std::__1::function<void ()>, std::__1::allocator<std::__1::function<void ()> > >&) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x00007ff7bfefe2b0)[abi:ne190107]() const at function.h:430:12 [opt]
frame #19: 0x00000001011f69f8 test_bitcoin`mp::CleanupRun(std::__1::list<std::__1::function<void ()>, std::__1::allocator<std::__1::function<void ()> > >&) [inlined] std::__1::function<void ()>::operator(this=0x00007ff7bfefe2b0)() const at function.h:989:10 [opt]
frame #20: 0x00000001011f69f8 test_bitcoin`mp::CleanupRun(fns=size=0)>, std::__1::allocator<std::__1::function<void ()> > >&) at proxy.h:46:9 [opt]
frame #21: 0x0000000100fc558e test_bitcoin`mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::~ProxyClientBase(this=0x000060000171e880) at proxy-io.h:449:5 [opt]
frame #22: 0x0000000100ff1a38 test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient() [inlined] mp::ProxyClientCustom<ipc::capnp::messages::Init, interfaces::Init>::~ProxyClientCustom(this=0x000060000171e880) at proxy.h:112:7 [opt]
frame #23: 0x0000000100ff1a30 test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient() [inlined] mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient(this=0x000060000171e880) at init.capnp.proxy-types.c++:7:82 [opt]
frame #24: 0x0000000100ff1a1d test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient() [inlined] mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient(this=0x000060000171e880) at init.capnp.proxy-types.c++:7:57 [opt]
frame #25: 0x0000000100ff1a1d test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient(this=0x000060000171e880) at init.capnp.proxy-types.c++:7:57 [opt]
frame #26: 0x0000000100f9a69b test_bitcoin`IpcSocketPairTest() [inlined] std::__1::default_delete<interfaces::Init>::operator(this=0x00007ff7bfefe3e8, __ptr=<unavailable>)[abi:ne190107](interfaces::Init*) const at unique_ptr.h:80:5 [opt]
frame #27: 0x0000000100f9a695 test_bitcoin`IpcSocketPairTest() [inlined] std::__1::unique_ptr<interfaces::Init, std::__1::default_delete<interfaces::Init> >::reset[abi:ne190107](this=0x00007ff7bfefe3e8, __p=0x0000000000000000) at unique_ptr.h:292:7 [opt]
frame #28: 0x0000000100f9a67e test_bitcoin`IpcSocketPairTest() at ipc_test.cpp:149:17 [opt]
frame #29: 0x0000000100a0587a test_bitcoin`ipc_tests::ipc_tests::test_method(this=0x00007ff7bfefe710) at ipc_tests.cpp:16:5 [opt]
frame #30: 0x0000000100a05742 test_bitcoin`ipc_tests::ipc_tests_invoker() at ipc_tests.cpp:12:1 [opt]
frame #31: 0x0000000100054581 test_bitcoin`boost::detail::function::function_obj_invoker<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) [inlined] boost::function_n<void>::operator(this=<unavailable>)() const at function_template.hpp:789:14 [opt]
frame #32: 0x000000010005456e test_bitcoin`boost::detail::function::function_obj_invoker<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) [inlined] boost::detail::forward::operator(this=<unavailable>)() at execution_monitor.ipp:1405:32 [opt]
frame #33: 0x000000010005456b test_bitcoin`boost::detail::function::function_obj_invoker<boost::detail::forward, int>::invoke(function_obj_ptr=<unavailable>) at function_template.hpp:79:18 [opt]
frame #34: 0x0000000100018786 test_bitcoin`boost::execution_monitor::catch_signals(boost::function<int ()> const&) [inlined] boost::function_n<int>::operator(this=<unavailable>)() const at function_template.hpp:789:14 [opt]
frame #35: 0x0000000100018770 test_bitcoin`boost::execution_monitor::catch_signals(boost::function<int ()> const&) [inlined] int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(tr=0x0000000101660ca0, F=<unavailable>)> const&) at execution_monitor.ipp:318:30 [opt]
frame #36: 0x000000010001875c test_bitcoin`boost::execution_monitor::catch_signals(this=0x0000000101660c88, F=0x00007ff7bfeff030)> const&) at execution_monitor.ipp:920:16 [opt]
frame #37: 0x0000000100018a2b test_bitcoin`boost::execution_monitor::execute(this=0x0000000101660c88, F=0x00007ff7bfeff030)> const&) at execution_monitor.ipp:1318:16 [opt]
frame #38: 0x0000000100013d4e test_bitcoin`boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) [inlined] boost::execution_monitor::vexecute(this=0x0000000101660c88, F=0x0000000102c21018)> const&) at execution_monitor.ipp:1414:5 [opt]
frame #39: 0x0000000100013d2f test_bitcoin`boost::unit_test::unit_test_monitor_t::execute_and_translate(this=0x0000000101660c88, func=0x0000000102c21018, timeout_microseconds=0)> const&, unsigned long) at unit_test_monitor.ipp:49:9 [opt]
frame #40: 0x0000000100015543 test_bitcoin`boost::unit_test::framework::state::execute_test_tree(this=0x0000000101660d60, tu_id=66179, timeout_microseconds=0, p_random_generator=0x00007ff7bfeff300) at framework.ipp:815:44 [opt]
frame #41: 0x0000000100015b68 test_bitcoin`boost::unit_test::framework::state::execute_test_tree(this=0x0000000101660d60, tu_id=131, timeout_microseconds=0, p_random_generator=0x00007ff7bfeff4a0) at framework.ipp:740:54 [opt]
frame #42: 0x0000000100015b68 test_bitcoin`boost::unit_test::framework::state::execute_test_tree(this=0x0000000101660d60, tu_id=1, timeout_microseconds=0, p_random_generator=0x0000000000000000) at framework.ipp:740:54 [opt]
frame #43: 0x0000000100012ca8 test_bitcoin`boost::unit_test::framework::run(id=1, continue_test=<unavailable>) at framework.ipp:1722:29 [opt]
frame #44: 0x00000001000248c3 test_bitcoin`boost::unit_test::unit_test_main(init_func=<unavailable>, argc=<unavailable>, argv=<unavailable>)(int, char**), int, char**) at unit_test_main.ipp:250:9 [opt]
frame #45: 0x00007ff808f07418 dyld`start + 1896 |
Included in the latest push: bitcoin/bitcoin#30975 (comment) |
Will close this issue since I am pretty confident about the fix in #159 and we haven't seen this problem recur since it's been incorporated. |
I'm seeing this test fail occasionally, both locally (on macOS) and on CI:
https://github.com/bitcoin/bitcoin/actions/runs/13161509993/job/36731182334?pr=31802#step:7:2568
Not sure if it's a macOS specific thing, that's just based on the above CI run that also happens to be macOS. I'll keep an eye on it.
The text was updated successfully, but these errors were encountered: