-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threadpool deadlocks on common multithreading examples #11922
Comments
Same for me.
Oddly, with the |
I did some more experiments today, the threadpool also deadlocks on common task parallelism benchmarks:
Source: mratsim/weave@be776c4 |
is this a deadlock? it looks more like recursion is exhausting the threads in the pool |
When I checked with |
I'm still trying to wrap my head around this fib implementation, but I think that by recursively spawning threads this way they are effectively blocking one another until the threadpool runs out of threads. Then the last call to spawn blocks waiting for the threadpool to give it a new thread which is impossible at that point. |
You may be right I rewrote the example with import
# STD lib
os, strutils, threadpool, strformat
# Compile with "nim c --threads:on -d:danger --outdir:build
proc parfib(n: uint64): uint64 =
if n < 2:
return n
if preferSpawn():
let x = spawn parfib(n-1)
let y = parfib(n-2)
return ^x + y
else:
let x = parfib(n-1)
let y = parfib(n-2)
return x + y
proc main() =
if paramCount() != 1:
echo "Usage: fib <n-th fibonacci number requested>"
quit 0
let n = paramStr(1).parseUInt.uint64
let f = parfib(n)
echo "Result: ", f
main() Note: the spawnX template doesn't work when a spawned function returns something. One of the issues I can see is that the preferSpawn -> spawn sequence is not atomic. so it may just trigger pool exhaustion as well if 2 threads compete for the last thread in the pool. |
I'm still pretty new to nim so don't take my word on it. But since it doesn't freeze on every execution, I would say that this needs a mutex mutex somewhere and the pool is being exhausted between the calls to preferSpawn and spawn. I still can't think of a situation where it would make sense to spawn threads this way because the end result is going to be sequential anyway. |
Fibonacci is a toy example, there are much faster ways to implement it, but it's useful to understand the overhead of a multithreading runtime since the only computation done is so small. The main benefit of a low overhead multithreading runtime is that the developer doesn't have to worry anymore about the task size before using spawn and those are huge worries in:
This is mainly due to 2 things:
A runtime that doesn't address those makes it difficult to write a generic higher-order function like At least it's not alone:
Fibonacci is part of the following suites:
And was used to evaluate multithreading runtimes by the US-led scientific programs: |
Thanks for all the info. I still don't think this is a common case at all. I tried and failed to find an example in your links where they spawn threads recursively. IMHO what you're seeing here is the parallel equivalent of a stack overflow, and that would generally be considered a bug in your code. Of course, some languages have TCO but that's a design choice and not having it isn't fundamentally wrong. |
In universities people are introduced to parallel computing with recursive algorithms: http://ipcc.cs.uoregon.edu/lectures/lecture-9-fork-join.pdf In scientific computing, recursive divide-and-conquer is the standard, and also often the fastest for lots of matrix operations, see this course for example: https://www.cs.fsu.edu/~engelen/courses/HPC/Algorithms2.pdf So I think it's important to support those workloads. As a nice benefit, I do think Nim is a much nicer language to approach than C++ or Fortran, especially on professors slides ;). Besides the fact that it's ubiquitous, the main reason the fork-join / divide-and-conquer model is widespread is that it's efficient. The Cilk paper (http://supertech.csail.mit.edu/papers/PPoPP95.pdf) has been foundational and has formally proven bounds on the speedup of fork-join parallelism backed by work-stealing on N processors (a idle thread steals task in a busy thread task queue). And the Cilk approach is mathematically optimal (there is at least another optimal alternative I am aware that Julia is using, but it's still optimized for recursive spawns). Regarding concrete examples of divide-and-conquerBe sure to read the HPC course from the previous paragraph (https://www.cs.fsu.edu/~engelen/courses/HPC/Algorithms2.pdf)
N-queens spawns task recursively, and billions of tasks at that: https://github.com/mratsim/weave/blob/c26cfb2c7efbdb28ad7459a61bd567419d9e22c5/benchmarks/nqueens/weave_nqueens.nim#L111-L149 proc nqueens_par(n, j: int32, a: CharArray): int32 =
if n == j:
# Good solution, count it
return 1
var localCounts = alloca(Flowvar[int32], n)
zeroMem(localCounts, n * sizeof(Flowvar[int32]))
# Try each position for queen `j`
for i in 0 ..< n:
var b = alloca(char, j+1)
copyMem(b, a, j * sizeof(char))
b[j] = cast[char](i)
if isValid(j+1, b):
localCounts[i] = spawn nqueens_par(n, j+1, b)
for i in 0 ..< n:
if localCounts[i].isSpawned():
result += sync(localCounts[i])
const solutions = [
1,
0,
0,
2,
10, # 5x5
4,
10,
92, # 8x8
352,
724, # 10x10
2680,
14200,
73712,
365596,
2279184, # 15x15
14772512
] Cache-oblivious matrix multiplication is also recursive: proc matmul[T](A, B, C: Matrix[T], m, n, p: int, add: bool): bool =
# The original bench passes around a ``ld`` parameter (leading dimension?),
# we store it in the matrices
# We return a dummy bool to allow waiting on the matmul
# Threshold
if (m + n + p) <= 64:
if add:
for i in 0 ..< m:
for k in 0 ..< p:
var c = 0.T
for j in 0 ..< n:
c += A[i, j] * B[j, k]
C[i, k] += c
else:
for i in 0 ..< m:
for k in 0 ..< p:
var c = 0.T
for j in 0 ..< n:
c += A[i, j] * B[j, k]
C[i, k] = c
return
var h0, h1: FlowVar[bool]
## Each half of the computation
# matrix is larger than threshold
if m >= n and n >= p:
let m1 = m shr 1 # divide by 2
h0 = spawn matmul(A, B, C, m1, n, p, add)
h1 = spawn matmul(A.stride(m1, 0), B, C.stride(m1, 0), m - m1, n, p, add)
elif n >= m and n >= p:
let n1 = n shr 1 # divide by 2
h0 = spawn matmul(A, B, C, m, n1, p, add)
h1 = spawn matmul(A.stride(0, n1), B.stride(n1, 0), C, m, n - n1, p, add = true)
else:
let p1 = p shr 1
h0 = spawn matmul(A, B, C, m, n, p1, add)
h1 = spawn matmul(A, B.stride(0, p1), C.stride(0, p1), m, n, p - p1, add)
discard sync(h0)
discard sync(h1) Any tree algorithm is easier to do when you can spawn recursively, the unbalanced tree search used in the Sandia US-lab link is the gold standard to compare runtimes and is recursive, implementation in C: https://github.com/bsc-pm/bots/blob/2607a695128727a3f6c98754367705443663136d/omp-tasks/uts/uts.c#L169-L220. Similarly, in data science, and machine learning, there are many trees algorithms that can benefit from recursive spawning and would be simplified if we didn't had to worry about grain size:
|
Cool stuff, thanks for all the links. I'm familiar with fork-join and divide and conquer. I think we just have different expectations. The point I'm failing to articulate is that in the fib example above the branch that spawns will always exhaust the pool when the depth reaches the thread count. And for me that's perfectly normal and acceptable when working with a simple thread pool. You can totally implement mechanisms that deal with it under the hood and allow you to write pretty functional code (which I love). But there's a cost to that as well that shouldn't be ignored, and again that's not what I would expect of a thing called thread pool :) |
I see your point. It would be nice to have recursive thread-spawning, but it's also ok for the standard threadpool implementation to expect spawns from the main thread only. However, spawnX would be a nice solution if it worked with functions that return values. Not long ago, I had much more severe problems with threadpool. It would deadlock even from strictly main-thread usage, after I simply reduced the threadpool size. I need to be able to control the number of threads at runtime. (I reduce the size before using the threadpool, but the Nim's standard library threadpool initializes at program start-up.) That bug seems to be fixed, but I would rather have an instance of a threadpool anyway. It's just simpler. So I have switched to an alternative threadpool implementation: |
@mratsim Excuse me, but aren't you more qualified to write such an RFC? |
Well you kept saying that you will write one on IRC, but yes I can write one. |
Where are we on this? I was looking at weave, which is awesome, but it uses globals (right?) and I really want an instance of a threadpool. That's the only reason I'm still using yglukhov's. |
Yes it uses globals, I do plan to migrate away from them (mratsim/weave#9). There are 4 "globals" used: https://github.com/mratsim/weave/blob/e5a3701d/weave/contexts.nim#L28-L34 which are
All the operations can be changed to accept a global and thread-local context and then Weave can either use a global like today or work with pool instances with Also for the standard library @Araq and I agree that @yglukhov is much more suited to it. Weave is too complex to dump its maintenance to standard library maintainers. I do plan to have a much simpler (maybe 2000 lines at most) work-stealing threadpool (or threadpools) anyway because I need one for parallelizing some cryptographic workloads and we certainly don't want to have a potential bug/attack surface as large as Weave for crypto (memory management, globals, state machines, heap allocation, ...). My main issues with @yglukhov threadpools are that they also deadlock on common (dumb) examples (Fibonacci) and they use Nim channels and Nim channels have RTTI magic that seems super complex to reproduce, they are also likely to have low throughput due to using a single lock for the whole channel instead of 2 locks for reader and writers (which is actually not that much slower than the lock-free channel/queues I use in Weave for implementing FlowVars). |
That sounds like a great idea for weave: Make instance versions available (except for memory pools). For stdlib, yeah, I like yglukhov/threadpools. It too could have a global version and an instance version for the main functions. It also needs a bit of updating for the latest Nim version. I've never had a deadlock from yglukhov/threadpools, and I had deadlocks all the time from the stdlib threadpool. Could you point to a testcase? |
Unfortunately my 18 cores machine is out-of-order at the moment but it was just a plain Fibonacci. That said, GCC OpenMP (but not Clang OpenMP or Intel OpenMP and Intel TBB) also choke on this :/. |
I have |
Very nice. This belongs in the std lib. @brentp , if you can upgrade to Nim 1.6, this should solve all your problems. |
Even though it's incredibly inefficient, the "Hello World" of task parallelism is computing Fibonacci in recursive way, spawning about 2^n tasks.
Unfortunately, Nim threadpool reliably deadlocks on relatively small inputs:
Reproduction
The text was updated successfully, but these errors were encountered: