Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional ipc_test failures (macOS?) #154

Closed
Sjors opened this issue Feb 6, 2025 · 7 comments
Closed

Occasional ipc_test failures (macOS?) #154

Sjors opened this issue Feb 6, 2025 · 7 comments

Comments

@Sjors
Copy link
Member

Sjors commented Feb 6, 2025

I'm seeing this test fail occasionally, both locally (on macOS) and on CI:
https://github.com/bitcoin/bitcoin/actions/runs/13161509993/job/36731182334?pr=31802#step:7:2568

Not sure if it's a macOS specific thing, that's just based on the above CI run that also happens to be macOS. I'll keep an eye on it.

@fanquake
Copy link
Member

fanquake commented Feb 6, 2025

Copied the output so we aren't relying on GH actions to retain it.

test/ipc_tests.cpp:11: Entering test suite "ipc_tests"
test/ipc_tests.cpp:12: Entering test case "ipc_tests"
libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
2025-02-05T16:12:45.463402Z [unknown] [test/util/random.cpp:48] [SeedRandomStateForTest] Setting random seed for current tests to RANDOM_CTX_SEED=40e3ae01ebf0dcb2dc43849f963c573142c9f0ec38096c9feb4f14563c55cb78
2025-02-05T16:12:45.472556Z [test] [init/common.cpp:153] [LogPackageVersion] Bitcoin Core version v28.99.0-cd63f7b213c7-dirty (release build)
2025-02-05T16:12:45.472716Z [test] [kernel/context.cpp:20] [operator()] Using the 'arm_shani(1way,2way)' SHA256 implementation
2025-02-05T16:12:45.483530Z [test] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client first request from current thread, constructing waiter
2025-02-05T16:12:45.483734Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.add$Params (a = 1, b = 2)
2025-02-05T16:12:45.483815Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #1 FooInterface.add$Params (a = 1, b = 2)
2025-02-05T16:12:45.483854Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #1 FooInterface.add$Results (result = 3)
2025-02-05T16:12:45.483883Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.add$Results (result = 3)
2025-02-05T16:12:45.484420Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passOutPoint$Params (arg = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484443Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #2 FooInterface.passOutPoint$Params (arg = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484458Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #2 FooInterface.passOutPoint$Results (result = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484477Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passOutPoint$Results (result = "d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000")
2025-02-05T16:12:45.484688Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passUniValue$Params (arg = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484710Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #3 FooInterface.passUniValue$Params (arg = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484724Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #3 FooInterface.passUniValue$Results (result = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484741Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passUniValue$Results (result = "{\\"i\\":1,\\"s\\":\\"two\\"}")
2025-02-05T16:12:45.484770Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passTransaction$Params (arg = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.484790Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #4 FooInterface.passTransaction$Params (arg = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.484966Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #4 FooInterface.passTransaction$Results (result = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.485063Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passTransaction$Results (result = "\\002\\000\\000\\000\\001d\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\310\\000\\000\\000\\000\\377\\377\\377\\377\\001\\000\\341\\365\\005\\000\\000\\000\\000\\000\\003\\000\\000\\000")
2025-02-05T16:12:45.485124Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passVectorChar$Params (arg = "Hello")
2025-02-05T16:12:45.485171Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #5 FooInterface.passVectorChar$Params (arg = "Hello")
2025-02-05T16:12:45.485198Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #5 FooInterface.passVectorChar$Results (result = "Hello")
2025-02-05T16:12:45.485239Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passVectorChar$Results (result = "Hello")
2025-02-05T16:12:45.485430Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passBlockState$Params (arg = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485488Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #6 FooInterface.passBlockState$Params (arg = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485525Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #6 FooInterface.passBlockState$Results (result = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485571Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passBlockState$Results (result = (mode = 1, result = 8, rejectReason = "reject reason", debugMessage = "debug message"))
2025-02-05T16:12:45.485619Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passBlockState$Params (arg = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485663Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #7 FooInterface.passBlockState$Params (arg = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485693Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #7 FooInterface.passBlockState$Results (result = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485737Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passBlockState$Results (result = (mode = 0, result = 0, rejectReason = "", debugMessage = ""))
2025-02-05T16:12:45.485793Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client send FooInterface.passScript$Params (arg = "[")
2025-02-05T16:12:45.485833Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server recv request  #8 FooInterface.passScript$Params (arg = "[")
2025-02-05T16:12:45.485858Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server send response #8 FooInterface.passScript$Results (result = "[")
2025-02-05T16:12:45.485898Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/b-test-54829} IPC client recv FooInterface.passScript$Results (result = "[")
2025-02-05T16:12:45.486162Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} IPC server destroy N2mp11ProxyServerIN3gen12FooInterfaceEEE
2025-02-05T16:12:45.486271Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} EventLoop::loop done, cancelling event listeners.
2025-02-05T16:12:45.486283Z [unknown] [test/ipc_test.cpp:60] [operator()] LOG0: {IpcPipeTest-22316/54830} EventLoop::loop bye.

132/140 Test #135: wallet_crypto_tests ..................   Passed    3.05 sec
133/140 Test #134: spend_tests ..........................   Passed    3.68 sec
134/140 Test   #8: secp256k1_exhaustive_tests ...........   Passed   27.97 sec
135/140 Test #139: walletload_tests .....................   Passed    4.45 sec
136/140 Test  #10: bench_sanity_check_high_priority .....   Passed   28.82 sec
137/140 Test #136: wallet_tests .........................   Passed    9.35 sec
138/140 Test #127: coinselector_tests ...................   Passed   20.56 sec
139/140 Test   #6: secp256k1_noverify_tests .............   Passed   42.48 sec
Errors while running CTest
140/140 Test   #7: secp256k1_tests ......................   Passed   59.80 sec

99% tests passed, 1 tests failed out of 140

Total Test time (real) =  59.82 sec

The following tests FAILED:
	140 - ipc_tests (Subprocess aborted)

@ryanofsky
Copy link
Collaborator

ryanofsky commented Feb 6, 2025

This is probably caused by the same bug reported in #128 based on the "mutex lock failed: Invalid argument" error. But the version reported here might be easier to debug because the crash is happening in a unit test instead of in a python test.

If you can reproduce this problem locally sometimes, maybe you could run it in a loop until it crashes and try to get a stack trace? This script might work:

#!/bin/bash
while true; do
    lldb --batch -o "run" -o "bt" -- build/src/test/test_bitcoin -t ipc_tests
    status=$?
    if [[ $status -ne 0 ]]; then
        echo "Program crashed with exit code $status. Stopping."
        break
    fi
done

(script from chatgpt of course...)

@ryanofsky
Copy link
Collaborator

I couldn't reproduce this but I think I can see how this problem might be happening.

The "mutex lock failed" exception is thrown from libc++ mutex::lock method when there's any failure to lock the mutex.

In this case I think the exception might being thrown in the the ~CapnpProtocol destructor when it is calling EventLoop::removeClient to decrement the last reference count on the event loop. Decrementing the count will cause the event loop to exit, and for the event loop thread to destroy it here.

The problem is that while the event loop thread is destroying the event loop object, the ~CapnpProtocol destructor thread could be trying to lock the EventLoop::m_mutex member here after it calls write(post_fd).

I think a simple fix for this would look like:

--- a/src/ipc/libmultiprocess/src/mp/proxy.cpp
+++ b/src/ipc/libmultiprocess/src/mp/proxy.cpp
@@ -240,10 +240,12 @@ void EventLoop::removeClient(std::unique_lock<std::mutex>& lock)
     if (done(lock)) {
         m_cv.notify_all();
         int post_fd{m_post_fd};
-        Unlock(lock, [&] {
-            char buffer = 0;
-            KJ_SYSCALL(write(post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon)
-        });
+        // Intentionally do not unlock/relock the mutex around this final write
+        // call, so clients are free to destroy the EventLoop object immediately
+        // after the loop() method exits, and code here will not be trying to
+        // relock the event loop mutex after it was destroyed.
+        char buffer = 0;
+        KJ_SYSCALL(write(post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon)
     }
 }
 

ryanofsky added a commit to ryanofsky/bitcoin that referenced this issue Feb 7, 2025
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py
functional tests errors that happen on mac os, which sometimes fail with error
"terminating due to uncaught exception of type std::__1::system_error: mutex
lock failed: Invalid argument" as reported
bitcoin-core/libmultiprocess#154

This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop`
executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol`
destructor returns, causing an failure to reacquire the `EventLoop::m_mutex`
inside the `removeClient` method.

Fix this error by adding a new mutex to guard access to the
`CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be
called until after `m_loop->removeClient(lock);` returns.
@ryanofsky
Copy link
Collaborator

The change in bitcoin/bitcoin#31815 should be a better fix for this issue than the change suggested #154 (comment), which is a less direct fix and could also theoretically cause deadlocks if the write(post_fd) call blocks.

Assuming the issue is the CapnpProtocol::startLoop calling m_loop.reset() too early, adding a mutex around that call is a more direct fix than holding onto the lock in EventLoop::removeClient so EventLoop::loop call can't exit and m_loop.reset() is indirectly delayed.

Sjors pushed a commit to Sjors/bitcoin that referenced this issue Feb 7, 2025
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py
functional tests errors that happen on mac os, which sometimes fail with error
"terminating due to uncaught exception of type std::__1::system_error: mutex
lock failed: Invalid argument" as reported
bitcoin-core/libmultiprocess#154

This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop`
executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol`
destructor returns, causing an failure to reacquire the `EventLoop::m_mutex`
inside the `removeClient` method.

Fix this error by adding a new mutex to guard access to the
`CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be
called until after `m_loop->removeClient(lock);` returns.
Sjors pushed a commit to Sjors/bitcoin that referenced this issue Feb 7, 2025
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py
functional tests errors that happen on mac os, which sometimes fail with error
"terminating due to uncaught exception of type std::__1::system_error: mutex
lock failed: Invalid argument" as reported
bitcoin-core/libmultiprocess#154

This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop`
executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol`
destructor returns, causing an failure to reacquire the `EventLoop::m_mutex`
inside the `removeClient` method.

Fix this error by adding a new mutex to guard access to the
`CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be
called until after `m_loop->removeClient(lock);` returns.
Sjors pushed a commit to Sjors/bitcoin that referenced this issue Feb 7, 2025
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py
functional tests errors that happen on mac os, which sometimes fail with error
"terminating due to uncaught exception of type std::__1::system_error: mutex
lock failed: Invalid argument" as reported
bitcoin-core/libmultiprocess#154

This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop`
executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol`
destructor returns, causing an failure to reacquire the `EventLoop::m_mutex`
inside the `removeClient` method.

Fix this error by adding a new mutex to guard access to the
`CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be
called until after `m_loop->removeClient(lock);` returns.
Sjors pushed a commit to Sjors/bitcoin that referenced this issue Feb 7, 2025
This change is intended to fix occasional ipc_tests unit tests and rpc_misc.py
functional tests errors that happen on mac os, which sometimes fail with error
"terminating due to uncaught exception of type std::__1::system_error: mutex
lock failed: Invalid argument" as reported
bitcoin-core/libmultiprocess#154

This error could happen when `m_loop.reset();` in `CapnpProtocol::startLoop`
executes before the `m_loop->removeClient(lock);` call in the `~CapnpProtocol`
destructor returns, causing an failure to reacquire the `EventLoop::m_mutex`
inside the `removeClient` method.

Fix this error by adding a new mutex to guard access to the
`CapnpProtocol::m_loop` variable and making sure `m_loop.reset();` cannot be
called until after `m_loop->removeClient(lock);` returns.
ryanofsky pushed a commit to ryanofsky/libmultiprocess that referenced this issue Feb 10, 2025
Currently, there are two cases where code may attempt to lock `EventLoop::mutex`
after the `EventLoop::loop()` method has finished running and the `EventLoop` object
is potentially destroyed. Both cases happen due to uses of the `Unlock()` utility
function which unlocks a mutex temporarily, runs a callback function and relocks
it.

The first case happens in `EventLoop::removeClient` when the `EventLoop::done()`
condition is reach and it calls `Unlock()` in order to be able to call
`write(post_fd, ...)` without blocking and wake the EventLoop thread. In this
case, since `EventLoop` execution is done there is not really any point to using
`Unlock()` and relocking the mutex after calling `write()` so the code is
updated to just use a simple `lock.unlock()` call and permanently let go of the
lock and try to reacquire it.

The second case happens in `EventLoop::post` where `Unlock()` is also used in a
similar way, and depending on thread scheduling (if the post thread is delayed
for a long time before relocking) can result in relocking `EventLoop::m_mutex`
after calling `write()` to fail. In this case, since relocking the mutex is
actually necessary for the code that follows, the fix is different: new
`addClient`/`removeClient` calls are just added to this code, so the
`EventLoop::loop()` function will just not exit while the `post()` function is
waiting.

These two changes are being labeled as a bugfix even though there is not
technically a bug here in libmultiprocess code or API. The `EventLoop` object
itself was perfectly safe before these changes as long as outside code was
waited for `EventLoop` methods to finish executing before deleting the
`EventLoop` object. Originally the multiprocess code added in
bitcoin/bitcoin#10102 did this, but later as more
features were added to binding and connecting to unix sockets, and unit tests
were added which would immediately delete the `EventLoop` object after
`EventLoop::loop()` returned it meant these two methods could start failing due
to their uses of `Unlock()` depending on thread scheduling.

A previous attempt was made to fix this bug in
bitcoin/bitcoin#31815 outside of libmultiprocess code.
But it only addressed the first case of a failing `Unlock()` in
`EventLoop::removeClient`, not the second case in `EventLoop::post`.

This first case described above was not actually observed but is just
theoretically possible. The second case happens intermittently on macos and was
reported bitcoin-core#154
ryanofsky pushed a commit to ryanofsky/libmultiprocess that referenced this issue Feb 10, 2025
Currently, there are two cases where code may attempt to lock
`EventLoop::mutex` after the `EventLoop::loop()` method has finished running.
This is not a problem by itself, but is a problem if there is external code
which deletes the `EventLoop` object immediately after the method returns,
which is the case in Bitcoin Core. Both cases of locking after the loop method
returns happen due to uses of the `Unlock()` utility function which unlocks a
mutex temporarily, runs a callback function and relocks it.

The first case happens in `EventLoop::removeClient` when the
`EventLoop::done()` condition is reached and it calls `Unlock()` in order to be
able to call `write(post_fd, ...)` without blocking and wake the EventLoop
thread. In this case, since `EventLoop` execution is done there is not really
any point to using `Unlock()` and relocking, so the code is changed to use a
simple `lock.unlock()` call and not try to relock.

The second case happens in `EventLoop::post` where `Unlock()` is used in a
similar way, and depending on thread scheduling (if the thread is delayed for a
long time before relocking) can result in failing to relock
`EventLoop::m_mutex` after calling `write()`. In this case, since relocking the
mutex is actually necessary for the code that follows, the fix is different:
new `addClient`/`removeClient` calls are added to this code, so the
`EventLoop::loop()` function will never exit while the `post()` function is
waiting.

These two changes are being labeled as a bugfix even though there is not
technically a bug here in the libmultiprocess code or API. The `EventLoop`
object itself was safe before these changes as long as outside code waited for
`EventLoop` methods to finish executing before deleting the `EventLoop` object.
Originally the multiprocess code added in
bitcoin/bitcoin#10102 did this, but later as more
features were added for binding and connecting to unix sockets, and unit tests
were added which would immediately delete the `EventLoop` object after
`EventLoop::loop()` returned, it meant this code could now start failing to
relock after unlocking.

A previous attempt was made to fix this bug in
bitcoin/bitcoin#31815 outside of libmultiprocess code.
But it only addressed the first case of a failing `Unlock()` in
`EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

This first case described above was not observed but is theoretically possible.
The second case happens intermittently on macos and was reported
bitcoin-core#154
@ryanofsky ryanofsky mentioned this issue Feb 10, 2025
ryanofsky added a commit that referenced this issue Feb 10, 2025
f8cbd0d bugfix: Do not lock EventLoop::mutex after EventLoop is done (user)

Pull request description:

  Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it.

  The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock.

  The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting.

  These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in
  bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking.

  A previous attempt was made to fix this bug in
  bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

  This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154

Top commit has no ACKs.

Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
ryanofsky added a commit to ryanofsky/libmultiprocess that referenced this issue Feb 10, 2025
Currently, there are two cases where code may attempt to lock
`EventLoop::mutex` after the `EventLoop::loop()` method has finished running.
This is not a problem by itself, but is a problem if there is external code
which deletes the `EventLoop` object immediately after the method returns,
which is the case in Bitcoin Core. Both cases of locking after the loop method
returns happen due to uses of the `Unlock()` utility function which unlocks a
mutex temporarily, runs a callback function and relocks it.

The first case happens in `EventLoop::removeClient` when the
`EventLoop::done()` condition is reached and it calls `Unlock()` in order to be
able to call `write(post_fd, ...)` without blocking and wake the EventLoop
thread. In this case, since `EventLoop` execution is done there is not really
any point to using `Unlock()` and relocking, so the code is changed to use a
simple `lock.unlock()` call and not try to relock.

The second case happens in `EventLoop::post` where `Unlock()` is used in a
similar way, and depending on thread scheduling (if the thread is delayed for a
long time before relocking) can result in failing to relock
`EventLoop::m_mutex` after calling `write()`. In this case, since relocking the
mutex is actually necessary for the code that follows, the fix is different:
new `addClient`/`removeClient` calls are added to this code, so the
`EventLoop::loop()` function will never exit while the `post()` function is
waiting.

These two changes are being labeled as a bugfix even though there is not
technically a bug here in the libmultiprocess code or API. The `EventLoop`
object itself was safe before these changes as long as outside code waited for
`EventLoop` methods to finish executing before deleting the `EventLoop` object.
Originally the multiprocess code added in
bitcoin/bitcoin#10102 did this, but later as more
features were added for binding and connecting to unix sockets, and unit tests
were added which would immediately delete the `EventLoop` object after
`EventLoop::loop()` returned, it meant this code could now start failing to
relock after unlocking.

A previous attempt was made to fix this bug in
bitcoin/bitcoin#31815 outside of libmultiprocess code.
But it only addressed the first case of a failing `Unlock()` in
`EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

This first case described above was not observed but is theoretically possible.
The second case happens intermittently on macos and was reported
bitcoin-core#154
ryanofsky added a commit that referenced this issue Feb 10, 2025
48d01bc bugfix: Do not lock EventLoop::mutex after EventLoop is done (Ryan Ofsky)

Pull request description:

  Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the `EventLoop` object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it.

  The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reached and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking, so the code is changed to use a simple `lock.unlock()` call and not try to relock.

  The second case happens in `EventLoop::post` where `Unlock()` is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock `EventLoop::m_mutex` after calling `write()`. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are added to this code, so the `EventLoop::loop()` function will never exit while the `post()` function is waiting.

  These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The `EventLoop` object itself was safe before these changes as long as outside code waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in
  bitcoin/bitcoin#10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned, it meant this code could now start failing to relock after unlocking.

  A previous attempt was made to fix this bug in
  bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

  This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported #154

Top commit has no ACKs.

Tree-SHA512: 378d360de84b40f0ddcf2ebc5e11fb74adf6d133e94f148ebb7e0c9926836f276ac79ed1d8cb1a559a7aa756939071e8d252d4f03fe6816760807857fba646cb
@ryanofsky
Copy link
Collaborator

Since the change in bitcoin/bitcoin#31815 didn't work (see comments there), I set up a macos vm using quickemu and was able to reproduce this bug locally, get a stack trace, and fix it.

Will keep this issue open to see if the fix works in CI, but #159 is the change, and the stack trace looked like:

(lldb) bt
* thread #1, name = 'b-test', queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007ff80922a196 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff809261ee6 libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007ff809188b45 libsystem_c.dylib`abort + 123
    frame #3: 0x000000010290aa23 libc++abi.1.dylib`abort_message + 195
    frame #4: 0x00000001028e8179 libc++abi.1.dylib`demangling_terminate_handler() + 233
    frame #5: 0x0000000102908e76 libc++abi.1.dylib`std::__terminate(void (*)()) + 6
    frame #6: 0x0000000102908e18 libc++abi.1.dylib`std::terminate() + 56
    frame #7: 0x000000010000e6c2 test_bitcoin`__clang_call_terminate + 18
    frame #8: 0x00000001011f3ab2 test_bitcoin`mp::EventLoop::post(std::__1::function<void ()> const&) [inlined] mp::UnlockGuard<std::__1::unique_lock<std::__1::mutex> >::~UnlockGuard(this=0x00007ff7bfefe1c8) at util.h:138:29 [opt]
    frame #9: 0x00000001011f3aa6 test_bitcoin`mp::EventLoop::post(std::__1::function<void ()> const&) [inlined] mp::UnlockGuard<std::__1::unique_lock<std::__1::mutex> >::~UnlockGuard(this=0x00007ff7bfefe1c8) at util.h:138:20 [opt]
    frame #10: 0x00000001011f3aa6 test_bitcoin`mp::EventLoop::post(std::__1::function<void ()> const&) [inlined] void mp::Unlock<std::__1::unique_lock<std::__1::mutex>, mp::EventLoop::post(std::__1::function<void ()> const&)::$_1>(lock=0x00007ff7bfefe1f0, callback=<unavailable>)> const&)::$_1&&) at util.h:147:1 [opt]
    frame #11: 0x00000001011f3aa6 test_bitcoin`mp::EventLoop::post(this=0x0000000102e12240, fn=0x00007ff7bfefe250)> const&) at proxy.cpp:231:5 [opt]
    frame #12: 0x0000000100fc5143 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] void mp::EventLoop::sync<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()::operator()() const::'lambda'()>(this=<unavailable>, callable=0x00007ff7bfefe240)::'lambda0'()::operator()() const::'lambda'()&&) at proxy-io.h:155:9 [opt]
    frame #13: 0x0000000100fc511c test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(this=<unavailable>)::'lambda0'()::operator()() const at proxy-io.h:426:38 [opt]
    frame #14: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::'lambda0'()&>()()) std::__1::__invoke[abi:ne190107]<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()&>(mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()&) at invoke.h:149:25 [opt]
    frame #15: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ne190107]<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(__args=<unavailable>)::'lambda0'()&>(mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()&) at invoke.h:224:5 [opt]
    frame #16: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator(this=<unavailable>)[abi:ne190107]() at function.h:171:12 [opt]
    frame #17: 0x0000000100fc50f1 test_bitcoin`std::__1::__function::__func<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'(), std::__1::allocator<mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::ProxyClientBase(ipc::capnp::messages::Init::Client, mp::Connection*, bool)::'lambda0'()>, void ()>::operator(this=<unavailable>)() at function.h:313:10 [opt]
    frame #18: 0x00000001011f6a07 test_bitcoin`mp::CleanupRun(std::__1::list<std::__1::function<void ()>, std::__1::allocator<std::__1::function<void ()> > >&) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x00007ff7bfefe2b0)[abi:ne190107]() const at function.h:430:12 [opt]
    frame #19: 0x00000001011f69f8 test_bitcoin`mp::CleanupRun(std::__1::list<std::__1::function<void ()>, std::__1::allocator<std::__1::function<void ()> > >&) [inlined] std::__1::function<void ()>::operator(this=0x00007ff7bfefe2b0)() const at function.h:989:10 [opt]
    frame #20: 0x00000001011f69f8 test_bitcoin`mp::CleanupRun(fns=size=0)>, std::__1::allocator<std::__1::function<void ()> > >&) at proxy.h:46:9 [opt]
    frame #21: 0x0000000100fc558e test_bitcoin`mp::ProxyClientBase<ipc::capnp::messages::Init, interfaces::Init>::~ProxyClientBase(this=0x000060000171e880) at proxy-io.h:449:5 [opt]
    frame #22: 0x0000000100ff1a38 test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient() [inlined] mp::ProxyClientCustom<ipc::capnp::messages::Init, interfaces::Init>::~ProxyClientCustom(this=0x000060000171e880) at proxy.h:112:7 [opt]
    frame #23: 0x0000000100ff1a30 test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient() [inlined] mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient(this=0x000060000171e880) at init.capnp.proxy-types.c++:7:82 [opt]
    frame #24: 0x0000000100ff1a1d test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient() [inlined] mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient(this=0x000060000171e880) at init.capnp.proxy-types.c++:7:57 [opt]
    frame #25: 0x0000000100ff1a1d test_bitcoin`mp::ProxyClient<ipc::capnp::messages::Init>::~ProxyClient(this=0x000060000171e880) at init.capnp.proxy-types.c++:7:57 [opt]
    frame #26: 0x0000000100f9a69b test_bitcoin`IpcSocketPairTest() [inlined] std::__1::default_delete<interfaces::Init>::operator(this=0x00007ff7bfefe3e8, __ptr=<unavailable>)[abi:ne190107](interfaces::Init*) const at unique_ptr.h:80:5 [opt]
    frame #27: 0x0000000100f9a695 test_bitcoin`IpcSocketPairTest() [inlined] std::__1::unique_ptr<interfaces::Init, std::__1::default_delete<interfaces::Init> >::reset[abi:ne190107](this=0x00007ff7bfefe3e8, __p=0x0000000000000000) at unique_ptr.h:292:7 [opt]
    frame #28: 0x0000000100f9a67e test_bitcoin`IpcSocketPairTest() at ipc_test.cpp:149:17 [opt]
    frame #29: 0x0000000100a0587a test_bitcoin`ipc_tests::ipc_tests::test_method(this=0x00007ff7bfefe710) at ipc_tests.cpp:16:5 [opt]
    frame #30: 0x0000000100a05742 test_bitcoin`ipc_tests::ipc_tests_invoker() at ipc_tests.cpp:12:1 [opt]
    frame #31: 0x0000000100054581 test_bitcoin`boost::detail::function::function_obj_invoker<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) [inlined] boost::function_n<void>::operator(this=<unavailable>)() const at function_template.hpp:789:14 [opt]
    frame #32: 0x000000010005456e test_bitcoin`boost::detail::function::function_obj_invoker<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) [inlined] boost::detail::forward::operator(this=<unavailable>)() at execution_monitor.ipp:1405:32 [opt]
    frame #33: 0x000000010005456b test_bitcoin`boost::detail::function::function_obj_invoker<boost::detail::forward, int>::invoke(function_obj_ptr=<unavailable>) at function_template.hpp:79:18 [opt]
    frame #34: 0x0000000100018786 test_bitcoin`boost::execution_monitor::catch_signals(boost::function<int ()> const&) [inlined] boost::function_n<int>::operator(this=<unavailable>)() const at function_template.hpp:789:14 [opt]
    frame #35: 0x0000000100018770 test_bitcoin`boost::execution_monitor::catch_signals(boost::function<int ()> const&) [inlined] int boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(tr=0x0000000101660ca0, F=<unavailable>)> const&) at execution_monitor.ipp:318:30 [opt]
    frame #36: 0x000000010001875c test_bitcoin`boost::execution_monitor::catch_signals(this=0x0000000101660c88, F=0x00007ff7bfeff030)> const&) at execution_monitor.ipp:920:16 [opt]
    frame #37: 0x0000000100018a2b test_bitcoin`boost::execution_monitor::execute(this=0x0000000101660c88, F=0x00007ff7bfeff030)> const&) at execution_monitor.ipp:1318:16 [opt]
    frame #38: 0x0000000100013d4e test_bitcoin`boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) [inlined] boost::execution_monitor::vexecute(this=0x0000000101660c88, F=0x0000000102c21018)> const&) at execution_monitor.ipp:1414:5 [opt]
    frame #39: 0x0000000100013d2f test_bitcoin`boost::unit_test::unit_test_monitor_t::execute_and_translate(this=0x0000000101660c88, func=0x0000000102c21018, timeout_microseconds=0)> const&, unsigned long) at unit_test_monitor.ipp:49:9 [opt]
    frame #40: 0x0000000100015543 test_bitcoin`boost::unit_test::framework::state::execute_test_tree(this=0x0000000101660d60, tu_id=66179, timeout_microseconds=0, p_random_generator=0x00007ff7bfeff300) at framework.ipp:815:44 [opt]
    frame #41: 0x0000000100015b68 test_bitcoin`boost::unit_test::framework::state::execute_test_tree(this=0x0000000101660d60, tu_id=131, timeout_microseconds=0, p_random_generator=0x00007ff7bfeff4a0) at framework.ipp:740:54 [opt]
    frame #42: 0x0000000100015b68 test_bitcoin`boost::unit_test::framework::state::execute_test_tree(this=0x0000000101660d60, tu_id=1, timeout_microseconds=0, p_random_generator=0x0000000000000000) at framework.ipp:740:54 [opt]
    frame #43: 0x0000000100012ca8 test_bitcoin`boost::unit_test::framework::run(id=1, continue_test=<unavailable>) at framework.ipp:1722:29 [opt]
    frame #44: 0x00000001000248c3 test_bitcoin`boost::unit_test::unit_test_main(init_func=<unavailable>, argc=<unavailable>, argv=<unavailable>)(int, char**), int, char**) at unit_test_main.ipp:250:9 [opt]
    frame #45: 0x00007ff808f07418 dyld`start + 1896

@Sjors
Copy link
Member Author

Sjors commented Feb 10, 2025

Included in the latest push: bitcoin/bitcoin#30975 (comment)

@ryanofsky
Copy link
Collaborator

Will close this issue since I am pretty confident about the fix in #159 and we haven't seen this problem recur since it's been incorporated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants