Don't clobber the scheduler #310

jpsamaroo · 2021-12-02T19:38:48Z

This PR is a set of bugfixes and optimizations that aim to improve the scheduler's ability to pump out tasks to workers, and maximize throughput. It will try to fix the following issues:

The scheduler thread should generally be de-prioritized for running tasks, or else scheduling will hang while a blocking task executes. All other potentially eligible processors should be considered before the scheduler thread.
Blocking tasks on workers may strangle the task trying to launch work (in do_tasks), which then (by nature of holding the scheduler lock) hangs the scheduler in fire_tasks!, which prevents new tasks from being added to the scheduler and launched. We shouldn't hold the lock during this time, and in the future, we'll want to have a dedicated thread for each worker to receive and launch work from.
Checking for new workers happens far too frequently; we should make this event-driven instead. I intend to funnel Context modifications through a custom setproperty! call which will notify any listening tasks that the set of available workers has changed (@DrChainsaw ).

Todo:

Refactor and test scheduler network optimization logic
Test MemPool.approxsize returns non-nothing for important types
Find and fix hangs introduced by 22c235f

DrChainsaw · 2021-12-03T20:06:23Z

we should make this event-driven instead

Everytime a while true; sleep ... loop dies the fiery death it deserves an angel gets its wings.

Iirc one thing that made me a bit nervous when I did the current code was race conditions where the same Thunk would be scheduled twice which is why I tried to confine any asyncronous scheduling on new workers to happen only when waiting for work to finish.

jpsamaroo · 2021-12-03T21:30:57Z

Can you elaborate on this potential race condition? We shouldn't be able to double-schedule a thunk, and even if we do, it would probably not cause significant harm to the scheduler's internal consistency.

DrChainsaw · 2021-12-03T23:23:36Z

Can you elaborate on this potential race condition?

I can't think of anything specific except for a general worry of things like non-deterministic bugs happening due to things running in some unexpected order in case multiple tasks yield. I suppose double-scheduling a thunk should not happen unless something firing off a thunk is threaded. It was probably more of an attempt at trying to keep the moving parts to a minimum.

jpsamaroo · 2021-12-04T01:48:24Z

That's a fair concern, although I think it's probably not something that we need to be concerned about. At this point, the scheduler should be multitasking-safe due to having a global lock taken during modifications. If I implement an event-driven notification system, it would be implemented with a thread-safe primitive, and scheduler updates would be serialized with the scheduler lock. Slow, maybe, but also race-free.

codecov-commenter · 2021-12-08T23:08:16Z

Codecov Report

Merging #310 (847044f) into master (566c35b) will not change coverage.
The diff coverage is 0.00%.

@@          Coverage Diff           @@
##           master    #310   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files          42      42           
  Lines        3596    3630   +34     
======================================
- Misses       3596    3630   +34

Impacted Files	Coverage Δ
src/processor.jl	`0.00% <0.00%> (ø)`
src/sch/Sch.jl	`0.00% <0.00%> (ø)`
src/sch/dynamic.jl	`0.00% <0.00%> (ø)`
src/sch/util.jl	`0.00% <0.00%> (ø)`
src/scopes.jl	`0.00% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 566c35b...847044f. Read the comment docs.

krynju · 2021-12-11T21:48:15Z

I get a reproducible hang on this branch when running Dagger.groupby (works fine on master)
Full hang, no cpu activity on main or any thread

i'll get an MWE later, but here's the benchmark steps

clone https://github.com/krynju/dtable_benchmarks
activate env and dev Dagger
julia -t16 .\scripts_benchmark\dtable3.jl 10000000 1000000 1000 4

jpsamaroo · 2021-12-12T15:13:36Z

Yeah I also get a hang somewhere on a large distributed benchmark. I plan to investigate before I merge.

krynju · 2021-12-12T16:43:07Z

It kinda looks like the one we had here #284
It suddenly stops doing anything and just idles with no activity

jpsamaroo · 2021-12-13T23:44:36Z

@DrChainsaw I'd appreciate a review of the Context changes, if you get the chance.

krynju · 2021-12-14T15:45:16Z

Hmmm sometimes I get crashes, but I managed to get my biggest (16GB) groupby work once and the performance ~ about the same
It's definitely less stable than master though - that one didn't really crash for me in these benchmarks

Just got smaller data size to crash and throw this:
(usually related to some race condition from my experience)


Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable3.jl:15

jpsamaroo · 2021-12-14T15:54:38Z

That's very concerning; is there any chance you can run benchmarks with a debug build of Julia? It might be that we've got a portion of ComputeState-modifying code running outside of the global lock, by accident.

jpsamaroo · 2021-12-14T16:00:17Z

I managed to get my biggest (16GB) groupby work once and the performance ~ about the same

Yeah, performance may be somewhat similar on non-distributed workloads, since work is over-subscribed first, and then executed (so you could get cycles of scheduling to execution, resulting in clobbering not being a problem).

For me, a distributed benchmark of heavy BLAS operations was helped a lot by this, as well as #165 .

krynju · 2021-12-14T16:48:55Z

Ah nvm it's usually faster. Sometimes even 2x faster in some longer runs

https://pastebin.com/e76V1DZQ

krynju · 2021-12-14T17:07:04Z

logs with debug on:

PS C:\Users\krynjupc\WS\dtable_benchmarks> c:\Users\krynjupc\WS\dtable_benchmarks\run.ps1

  Activating project at `C:\Users\krynjupc\WS\dtable_benchmarks`
tablesize 1600.0 MB
saving results to dtable_bench1639504217.csv
┌ Debug: (1) eager_thunk (1) Using available Dagger.ThreadProc(1, 5): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (2) Using available Dagger.ThreadProc(1, 11): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (3) Using available Dagger.ThreadProc(1, 12): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (4) Using available Dagger.ThreadProc(1, 7): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (5) Using available Dagger.ThreadProc(1, 3): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (7) Using available Dagger.ThreadProc(1, 9): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (6) Using available Dagger.ThreadProc(1, 8): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (8) Using available Dagger.ThreadProc(1, 2): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (9) Using available Dagger.ThreadProc(1, 16): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (10) Using available Dagger.ThreadProc(1, 6): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (2) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (4) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (8) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (7) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (6) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (5) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (3) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (9) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (10) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (12) Using available Dagger.ThreadProc(1, 13): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (11) Using available Dagger.ThreadProc(1, 15): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (14) Using available Dagger.ThreadProc(1, 11): 364767300 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (12) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (13) Using available Dagger.ThreadProc(1, 4): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (11) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (14) Releasing Dagger.ThreadProc: 364767300 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (13) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (16) Using available Dagger.ThreadProc(1, 9): 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (17) Using available Dagger.ThreadProc(1, 6): 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (15) Using available Dagger.ThreadProc(1, 3): 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (16) Releasing Dagger.ThreadProc: 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (17) Releasing Dagger.ThreadProc: 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (18) Using available Dagger.ThreadProc(1, 12): 92664743 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (15) Releasing Dagger.ThreadProc: 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (19) Using available Dagger.ThreadProc(1, 15): 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (18) Releasing Dagger.ThreadProc: 92664743 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (19) Releasing Dagger.ThreadProc: 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (20) Using available Dagger.ThreadProc(1, 2): 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (21) Using available Dagger.ThreadProc(1, 11): 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (20) Releasing Dagger.ThreadProc: 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (21) Releasing Dagger.ThreadProc: 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (23) Using available Dagger.ThreadProc(1, 10): 6526314 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (22) Using available Dagger.ThreadProc(1, 8): 12274329 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (24) Using available Dagger.ThreadProc(1, 13): 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (23) Releasing Dagger.ThreadProc: 6526314 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (25) Using available Dagger.ThreadProc(1, 7): 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (22) Releasing Dagger.ThreadProc: 12274329 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (24) Releasing Dagger.ThreadProc: 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (25) Releasing Dagger.ThreadProc: 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (26) Using available Dagger.ThreadProc(1, 6): 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (27) Using available Dagger.ThreadProc(1, 16): 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (26) Releasing Dagger.ThreadProc: 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (27) Releasing Dagger.ThreadProc: 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (29) Using available Dagger.ThreadProc(1, 9): 10667987 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (30) Using available Dagger.ThreadProc(1, 8): 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (28) Using available Dagger.ThreadProc(1, 3): 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (29) Releasing Dagger.ThreadProc: 10667987 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (30) Releasing Dagger.ThreadProc: 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (31) Using available Dagger.ThreadProc(1, 13): 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (28) Releasing Dagger.ThreadProc: 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (32) Using available Dagger.ThreadProc(1, 11): 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (31) Releasing Dagger.ThreadProc: 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (32) Releasing Dagger.ThreadProc: 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (33) Using available Dagger.ThreadProc(1, 14): 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (34) Using available Dagger.ThreadProc(1, 4): 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (33) Releasing Dagger.ThreadProc: 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (34) Releasing Dagger.ThreadProc: 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (36) Using available Dagger.ThreadProc(1, 12): 611821 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (35) Using available Dagger.ThreadProc(1, 2): 478442 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (36) Releasing Dagger.ThreadProc: 611821 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (37) Using available Dagger.ThreadProc(1, 11): 512880 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (35) Releasing Dagger.ThreadProc: 478442 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (37) Releasing Dagger.ThreadProc: 512880 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (38) Using available Dagger.ThreadProc(1, 13): 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (39) Using available Dagger.ThreadProc(1, 6): 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (38) Releasing Dagger.ThreadProc: 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (39) Releasing Dagger.ThreadProc: 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (41) Using available Dagger.ThreadProc(1, 15): 342435 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (40) Using available Dagger.ThreadProc(1, 9): 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (41) Releasing Dagger.ThreadProc: 342435 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (40) Releasing Dagger.ThreadProc: 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) #541 (42) Using available Dagger.ThreadProc(1, 8): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (44) Using available Dagger.ThreadProc(1, 10): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (45) Using available Dagger.ThreadProc(1, 3): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (46) Using available Dagger.ThreadProc(1, 13): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (47) Using available Dagger.ThreadProc(1, 12): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) #541 (42) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (48) Using available Dagger.ThreadProc(1, 16): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (49) Using available Dagger.ThreadProc(1, 11): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (53) Using available Dagger.ThreadProc(1, 9): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (52) Using available Dagger.ThreadProc(1, 7): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (50) Using available Dagger.ThreadProc(1, 6): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (51) Using available Dagger.ThreadProc(1, 15): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) build_groupby_index (43) Using available Dagger.ThreadProc(1, 4): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (47) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (45) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (46) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (44) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (48) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

src/sch/Sch.jl

DrChainsaw · 2021-12-14T21:22:35Z

Can one add new Thunks to an ongoing computation with add_thunk!? If so I could take a stab at fixing the add/rm procs test with it.

jpsamaroo · 2021-12-15T15:01:22Z

Can one add new Thunks to an ongoing computation with add_thunk!?

Absolutely! That's how the eager scheduler is implemented, as a regular thunk running on a regular scheduler, listening on a channel, and then constructing new Thunks with add_thunk!.

DrChainsaw · 2021-12-15T21:34:48Z

Here is a patched version of the add workers test which should not be unreliable. I could not get add_thunk! to use the existing context so therefore I had to disable the part where we add workers which we don't want to participate in the computation.

Let me know if it is acceptable and I'll try to fix the remove workers test too.

Add procs test

    setup = quote
        using Dagger, Distributed
        # blocked is to guarantee that processing is not completed before we add new workers
        # Note: blocked is used in expressions below
        blocked = true
        function testfun(i)
            i < 4 && return myid()
            # Wait for test to do its thing before we proceed
            if blocked
                sleep(0.1) # just so we don't end up overflowing or something while waiting for workers to be added
                # Here we would like to just wait to be rescheduled on another worker (which is not blocked)
                # but this functionality does not exist, so instead we do this weird thing where we reschedule
                # until we end up on a non-blocked worker
                h = Dagger.Sch.sch_handle()
                id = Dagger.Sch.add_thunk!(testfun, h, i)
                return fetch(h, id)
            end
            return myid()
        end
    end

    @testset "Add new workers" begin
        ps = []
        try
            ps1 = addprocs(2, exeflags="--project")
            append!(ps, ps1)

            @everywhere vcat(ps1, myid()) $setup

            ts = delayed(vcat)((delayed(testfun)(i) for i in 1:10)...)

            ctx = Context(ps1)
            job = @async collect(ctx, ts)

            while !istaskstarted(job)
                sleep(0.001)
            end

            # Will not be added, so they should never appear in output
            # TODO: Does not work: add_thunk! seems to create a new context using all available workers :(
            #ps2 = addprocs(2, exeflags="--project")
            #append!(ps, ps2)

            ps3 = addprocs(2, exeflags="--project")
            append!(ps, ps3)
            @everywhere ps3 $setup
            addprocs!(ctx, ps3)
            @test length(procs(ctx)) == 4

            @everywhere ps3 blocked=false

            ps_used = fetch(job)
            @test ps_used isa Vector

            @test any(p -> p in ps_used, ps1)
            @test any(p -> p in ps_used, ps3)
            #@test !any(p in ps2, ps_used)
        finally
            wait(rmprocs(ps))
        end
    end

jpsamaroo · 2021-12-16T16:12:38Z

id = Dagger.Sch.add_thunk!(testfun, h, i)

It'd probably be better to call into the scheduler (with Sch.exec!), check if the new workers are available in procs(ctx) (which is passed to the called function as the first argument), and if so, do Dagger.Sch.add_thunk!(testfun, h, i; single=wid); if not, reschedule and wait, as you're doing now.

Let me know if it is acceptable and I'll try to fix the remove workers test too.

Yes please!

jpsamaroo · 2021-12-16T16:21:11Z

I'll update your new test approach to use my suggestion.

jpsamaroo · 2021-12-16T18:45:00Z

Wonderfully enough, this PR seems to actually make fault handling more robust.

DrChainsaw · 2021-12-16T22:17:14Z

I tried a little to fix the remove procs test but I got stuck. It seems like add_thunk! behaves differently with the single keyword. It doesn't seem like they are ever executed.

I also get frequent IOError: connect: connection refused (ECONNREFUSED) (although the testcase still passes?!) when running the testset in one go. It looks like the CI jobs get the same error. I could not reproduce that with the old solution without wkrs.

Example

setup with logging and longer wait:

setup = quote
    using Dagger, Distributed
    function _list_workers(ctx, state, task, tid, _)
        return procs(ctx)
    end
    # blocked is to guarantee that processing is not completed before we add new workers
    # Note: blocked is used in expressions below
    blocked = true
    function testfun(i)
        i < 4 && return myid()
        # Wait for test to do its thing before we proceed
        if blocked
            sleep(0.5) # just so we don't end up overflowing or something while waiting for workers to be added
            # Here we would like to just wait to be rescheduled on another worker (which is not blocked)
            # but this functionality does not exist, so instead we do this weird thing where we reschedule
            # until we end up on a non-blocked worker
            h = Dagger.Sch.sch_handle()
            wkrs = Dagger.Sch.exec!(_list_workers, h)
            id = if length(wkrs) > 2
                id = Dagger.Sch.add_thunk!(testfun, h, i; single=last(wkrs).pid)
                @info "After adding from wkrs: $id"
                id
            else
                id = Dagger.Sch.add_thunk!(testfun, h, i)
                @info "After adding to all $id"
                id
            end
            return fetch(h, id)
        end
        return myid()
    end
end

First I ran this:

ps = []
ps1 = addprocs(2, exeflags="--project")
append!(ps, ps1)

@everywhere vcat(ps1, myid()) $setup

ts = delayed(vcat)((delayed(testfun)(i) for i in 1:10)...)

ctx = Context(ps1)
job = @async collect(ctx, ts)

while !istaskstarted(job)
    sleep(0.001)
end

Which prints a steady stream of:

      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(194, MemPool.DRef(1, 182, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(195, MemPool.DRef(1, 183, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(196, MemPool.DRef(1, 184, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(197, MemPool.DRef(1, 185, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(198, MemPool.DRef(1, 186, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(199, MemPool.DRef(1, 187, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(200, MemPool.DRef(1, 188, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(201, MemPool.DRef(1, 189, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(202, MemPool.DRef(1, 190, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(203, MemPool.DRef(1, 191, 0x0000000000000250))

But then after running:

ps3 = addprocs(2, exeflags="--project")
append!(ps, ps3)
@everywhere ps3 $setup
addprocs!(ctx, ps3)
@test length(procs(ctx)) == 4

It just prints the following and then goes silent:

      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(204, MemPool.DRef(1, 192, 0x0000000000000250))
      From worker 2:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(205, MemPool.DRef(1, 193, 0x0000000000000250))
      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(206, MemPool.DRef(1, 194, 0x0000000000000250))
      From worker 2:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(207, MemPool.DRef(1, 195, 0x0000000000000250))
      From worker 2:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(208, MemPool.DRef(1, 196, 0x0000000000000250))
      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(209, MemPool.DRef(1, 197, 0x0000000000000250))
      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(210, MemPool.DRef(1, 198, 0x0000000000000250))

Which seems to indicate that thunks get stuck somehow. Trying to unblock does nothing:

julia> job
Task (runnable) @0x000000000e9a5b30

julia> @everywhere blocked = false

julia> job
Task (runnable) @0x000000000e9a5b30

julia> @everywhere blocked = false

julia> job
Task (runnable) @0x000000000e9a5b30

Also, if I try to log anything about wkrs (even something like p = last(wkrs).pid; @info "p = $p") the process segfaults.

DrChainsaw · 2021-12-18T14:58:45Z

It seems like add_thunk! behaves differently with the single keyword. It doesn't seem like they are ever executed.

Ok, I think I found the root cause here: dynamic_listener! is never called for new procs. Changing the signature to dynamic_listener!(ctx, state; tids = keys(state.worker_chans)) (and looping over tids in the main loop) and then calling it with only the new procs makes everything tick along as expected.

I suppose the drawback is that there will be one task listening for halts each time this is done, so perhaps refactoring so that dynamic_listener! accepts a single proc and the listener_tasks is moved to state would be preferable. One should then call dynamic_listener! from init_proc as it is a required step for a proc to be operational.

I suppose one would also need/want a mechanism to clean up the listeners when procs are removed (although I dread dealing with the edge cases that this might create).

Let me know if you want a PR or code suggestion for this. I do feel a bit bad for having added more moving parts with the add/remove procs and I hope it is useful for someone else than me 😟

krynju · 2021-12-21T10:34:13Z

I tried more of my benchmarks on this branch and there's no more crashing and the work/data distribution is definitely improved

Example:

krynju · 2021-12-22T08:05:01Z

Got this interesting error log, but not sure if it's related.
I don't have that branch from Valentin with Distributed fix to message passing - it might be that

log

 Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
    From worker 4:      Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
    From worker 2:      Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
    From worker 3:      Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
@@@ TABLESIZE:       1600.0 MB
@@@ SAVING TO:       results\dtable_bench1640153009.csv
    From worker 4:    ┌ Error: Error on 4 while connecting to peer 3, exitingError in sending dynamic request:
no process with id 4 exists
Stacktrace:
[1] error(s::String)
 @ Base .\error.jl:33
[2] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
 @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1094
[3] worker_from_id
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1086 [inlined]
[4] #remotecall_fetch#158
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
[5] remotecall_fetch
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
[6] call_on_owner
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:567 [inlined]
[7] take!
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:763 [inlined]
[8] macro expansion
 @ C:\Users\krynjupc\.julia\dev\Dagger\src\sch\dynamic.jl:52 [inlined]
[9] (::Dagger.Sch.var"#38#42"{Context, Dagger.Sch.ComputeState, Task, RemoteChannel{Channel{Any}}, RemoteChannel{Channel{Any}}})()
 @ Dagger.Sch .\task.jl:466

┌ Error: Fatal error on process 1
│   exception =
│    attempt to send to unknown socket
│    Stacktrace:
│     [1] error(s::String)
│       @ Base .\error.jl:33
│     [2] send_msg_unknown(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:99
│     [3] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:115
│     [4] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::Nothing)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:95
│     [5] macro expansion
│       @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:286 [inlined]
│     [6] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
│       @ Distributed .\task.jl:466
└ @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:99
    From worker 4:    │   exception =
┌ Error: Fatal error on process 1
│   exception =
│    attempt to send to unknown socket
│    Stacktrace:
│     [1] error(s::String)
│       @ Base .\error.jl:33
│     [2] send_msg_unknown(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:99
│     [3] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:115
│     [4] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::Nothing)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:95
│     [5] macro expansion
│       @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:295 [inlined]
│     [6] (::Distributed.var"#109#111"{Distributed.CallWaitMsg, Distributed.MsgHeader, Sockets.TCPSocket})()
│       @ Distributed .\task.jl:466
└ @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:99
    From worker 4:    │    ConcurrencyViolationError("lock must be held")
Worker 4 terminated.      From worker 4:        │    Stacktrace:

Error in eager scheduler:
TaskFailedException

  nested task error: no process with id 4 exists
  Stacktrace:
   [1] error(s::String)
     @ Base .\error.jl:33
   [2] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
     @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1094
   [3] worker_from_id
     @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1086 [inlined]
   [4] remote_do(::Function, ::Int64, ::Dagger.NoOpLog, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
     @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:559
   [5] remote_do(::Function, ::Int64, ::Dagger.NoOpLog, ::Vararg{Any})
     @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:559
   [6] (::Dagger.Sch.var"#117#119"{Context, Set{Dagger.Chunk}, Int64})()
     @ Dagger.Sch .\task.jl:466
Stacktrace:
[1] sync_end(c::Channel{Any})
 @ Base .\task.jl:424
[2] macro expansion
 @ .\task.jl:443 [inlined]
[3] evict_all_chunks!(ctx::Context, to_evict::Set{Dagger.Chunk})
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:783
[4] finish_task!(ctx::Context, state::Dagger.Sch.ComputeState, node::Thunk, thunk_failed::Bool)
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:778
[5] (::Dagger.Sch.var"#90#96"{Context, Dagger.Sch.ComputeState, OSProc, NamedTuple{(:pressure, :loadavg, :threadtime, :transfer_rate), Tuple{UInt64, Tuple{Float64, Float64, Float64}, UInt64, UInt64}}, RemoteException, Int64, Dagger.ThreadProc, Int64})()
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:451
[6] lock(f::Dagger.Sch.var"#90#96"{Context, Dagger.Sch.ComputeState, OSProc, NamedTuple{(:pressure, :loadavg, :threadtime, :transfer_rate), Tuple{UInt64, Tuple{Float64, Float64, Float64}, UInt64, UInt64}}, RemoteException, Int64, Dagger.ThreadProc, Int64}, l::ReentrantLock)
 @ Base .\lock.jl:183
[7] compute_dag(ctx::Context, d::Thunk; options::Dagger.Sch.SchedulerOptions)
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:407
[8] compute(ctx::Context, d::Thunk; options::Dagger.Sch.SchedulerOptions)
 @ Dagger C:\Users\krynjupc\.julia\dev\Dagger\src\compute.jl:31
[9] (::Dagger.Sch.var"#61#62"{Context})()
 @ Dagger.Sch .\task.jl:466
    From worker 4:    │      [1] concurrency_violation()
    From worker 4:    │        @ Base .\condition.jl:8
    From worker 4:    │      [2] assert_havelock
    From worker 4:    │        @ .\condition.jl:25 [inlined]
    From worker 4:    │      [3] assert_havelock
    From worker 4:    │        @ .\condition.jl:48 [inlined]
    From worker 4:    │      [4] assert_havelock
    From worker 4:    │        @ .\condition.jl:72 [inlined]
    From worker 4:    │      [5] notify(c::Condition, arg::Any, all::Bool, error::Bool)
    From worker 4:    │        @ Base .\condition.jl:144
    From worker 4:    │      [6] #notify#570
    From worker 4:    │        @ .\condition.jl:142 [inlined]
    From worker 4:    │      [7] set_worker_state
    From worker 4:    │        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:148 [inlined]
    From worker 4:    │      [8] Distributed.Worker(id::Int64, r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, manager::Distributed.DefaultClusterManager; version::Nothing, config::WorkerConfig)
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:126
    From worker 4:    │      [9] connect_to_peer(manager::Distributed.DefaultClusterManager, rpid::Int64, wconfig::WorkerConfig)
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:356
    From worker 4:    │     [10] (::Distributed.var"#117#119"{Int64, WorkerConfig})()
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:342
    From worker 4:    │     [11] exec_conn_func(w::Distributed.Worker)
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:181
    From worker 4:    │     [12] (::Distributed.var"#17#20"{Distributed.Worker})()
    From worker 4:    │        @ Distributed .\task.jl:466
    From worker 4:    └ @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:362
    From worker 3:    ErrorException("Cookie read failed. Connection closed by peer.")CapturedException(ErrorException("Cookie read failed. Connection closed by peer."), Any[(error(s::String) at error.jl:33, 1), (process_hdr(s::Sockets.TCPSocket, validate_cookie::Bool) at process_messages.jl:251, 1), (message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) at process_messages.jl:151, 1), (process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) at process_messages.jl:126, 1), ((::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})() at task.jl:466, 1)])
    From worker 3:    Process(3) - Unknown remote, closing connection.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
[1] (::Base.var"#wait_locked#660")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
 @ Base .\stream.jl:941
[2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
 @ Base .\stream.jl:950
[3] unsafe_read
 @ .\io.jl:751 [inlined]
[4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
 @ Base .\io.jl:750
[5] read!
 @ .\io.jl:752 [inlined]
[6] deserialize_hdr_raw
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
[7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
 @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:165
[8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
 @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:126
[9] (::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
 @ Distributed .\task.jl:466
┌ Error: Error initializing worker OSProc(4)
│   exception =
│    KeyError: key 4 not found
│    Stacktrace:
│     [1] getindex
│       @ .\dict.jl:498 [inlined]
│     [2] (::Dagger.Sch.var"#74#79"{Dagger.Sch.ComputeState, OSProc})()
│       @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:280
│     [3] lock(f::Dagger.Sch.var"#74#79"{Dagger.Sch.ComputeState, OSProc}, l::ReentrantLock)
│       @ Base .\lock.jl:183
│     [4] init_proc(state::Dagger.Sch.ComputeState, p::OSProc, log_sink::Dagger.NoOpLog)
│       @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:257
│     [5] macro expansion
│       @ C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:364 [inlined]
│     [6] (::Dagger.Sch.var"#88#94"{Context, Dagger.Sch.ComputeState, OSProc})()
│       @ Dagger.Sch .\task.jl:466
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:366
    From worker 3:    ProcessExitedException(4)
    From worker 3:    Stacktrace:
    From worker 3:      [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
    From worker 3:        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1089
    From worker 3:      [2] worker_from_id
    From worker 3:        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1086 [inlined]
    From worker 3:      [3] #remotecall_fetch#158
    From worker 3:        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
    From worker 3:      [4] remotecall_fetch
    From worker 3:        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
    From worker 3:      [5] #68
    From worker 3:        @ C:\Users\krynjupc\.julia\dev\Dagger\src\processor.jl:98 [inlined]
    From worker 3:      [6] get!(default::Dagger.var"#68#69"{Int64}, h::Dict{Int64, Vector{Dagger.Processor}}, key::Int64)
    From worker 3:        @ Base .\dict.jl:481
    From worker 3:      [7] OSProc
    From worker 3:        @ C:\Users\krynjupc\.julia\dev\Dagger\src\processor.jl:97 [inlined]
    From worker 3:      [8] evict_chunks!(log_sink::Dagger.NoOpLog, chunks::Set{Dagger.Chunk})
    From worker 3:        @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:789
    From worker 3:      [9] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    From worker 3:        @ Base .\essentials.jl:731
    From worker 3:     [10] invokelatest(::Any, ::Any, ::Vararg{Any})
    From worker 3:        @ Base .\essentials.jl:729
    From worker 3:     [11] (::Distributed.var"#114#116"{Distributed.RemoteDoMsg})()
    From worker 3:        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:301
    From worker 3:     [12] run_work_thunk(thunk::Distributed.var"#114#116"{Distributed.RemoteDoMsg}, print_error::Bool)
    From worker 3:        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:63
    From worker 3:     [13] (::Distributed.var"#113#115"{Distributed.RemoteDoMsg})()
    From worker 3:        @ Distributed .\task.jl:466fatal: error thrown and no exception handler available.
    From worker 3:    InterruptException()
    From worker 2:    fatal: error thrown and no exception handler available.
    From worker 2:    InterruptException()

jpsamaroo added bug scheduler performance labels Dec 2, 2021

jpsamaroo force-pushed the jps/no-clobber-sch branch 2 times, most recently from d50360b to f2a2689 Compare December 8, 2021 18:54

jpsamaroo marked this pull request as ready for review December 8, 2021 21:46

jpsamaroo force-pushed the jps/no-clobber-sch branch from b03a61a to 22c235f Compare December 11, 2021 02:50

jpsamaroo marked this pull request as draft December 12, 2021 15:13

jpsamaroo force-pushed the jps/no-clobber-sch branch 2 times, most recently from 847044f to 5bd8fec Compare December 13, 2021 20:51

jpsamaroo marked this pull request as ready for review December 13, 2021 20:51

jpsamaroo force-pushed the jps/no-clobber-sch branch from 5bd8fec to 7601fa8 Compare December 13, 2021 21:04

jpsamaroo requested a review from krynju December 13, 2021 23:44

DrChainsaw reviewed Dec 14, 2021

View reviewed changes

src/sch/Sch.jl Show resolved Hide resolved

jpsamaroo force-pushed the jps/no-clobber-sch branch from 9bd1328 to a9f3358 Compare December 16, 2021 16:15

jpsamaroo and others added 13 commits December 16, 2021 10:16

Don't clobber Sch thread with work

c9a58a9

DaggerWebDash: ganttPlot: Render full seek window

33980f5

fix affinity calculation: perform sum correctly

aa4f682

schedule!: Fix skipping local transfers in tx costs

5c2563b

Factor out optimizer into smaller utilities

4c76445

DaggerWebDash: ganttPlot: Start sat line at first value

3181e8d

UnionScope: Widen internal scopes type

5c091e2

add_thunk!: Add logging

5792186

fire_tasks: Be async to prevent blocking

d5d5398

Sch: Monitor Context updates async

f69a232

Add MemPool.approx_size tests

ea8a896

Test Sch.estimate_task_costs

d94bbd9

Sch: Simple refactor and always teardown

a9f3358

Make 'Add workers' test more reliable

3f4fb15

jpsamaroo force-pushed the jps/no-clobber-sch branch from 3764402 to 3f4fb15 Compare December 16, 2021 19:14

jpsamaroo merged commit 217415c into master Dec 31, 2021

jpsamaroo deleted the jps/no-clobber-sch branch December 31, 2021 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't clobber the scheduler #310

Don't clobber the scheduler #310

jpsamaroo commented Dec 2, 2021 •

edited

Loading

DrChainsaw commented Dec 3, 2021

jpsamaroo commented Dec 3, 2021

DrChainsaw commented Dec 3, 2021

jpsamaroo commented Dec 4, 2021

codecov-commenter commented Dec 8, 2021 •

edited

Loading

krynju commented Dec 11, 2021 •

edited

Loading

jpsamaroo commented Dec 12, 2021

krynju commented Dec 12, 2021

jpsamaroo commented Dec 13, 2021

krynju commented Dec 14, 2021 •

edited

Loading

jpsamaroo commented Dec 14, 2021

jpsamaroo commented Dec 14, 2021

krynju commented Dec 14, 2021

krynju commented Dec 14, 2021

DrChainsaw commented Dec 14, 2021

jpsamaroo commented Dec 15, 2021

DrChainsaw commented Dec 15, 2021

jpsamaroo commented Dec 16, 2021

jpsamaroo commented Dec 16, 2021

jpsamaroo commented Dec 16, 2021

DrChainsaw commented Dec 16, 2021

DrChainsaw commented Dec 18, 2021

krynju commented Dec 21, 2021 •

edited

Loading

krynju commented Dec 22, 2021

Don't clobber the scheduler #310

Don't clobber the scheduler #310

Conversation

jpsamaroo commented Dec 2, 2021 • edited Loading

DrChainsaw commented Dec 3, 2021

jpsamaroo commented Dec 3, 2021

DrChainsaw commented Dec 3, 2021

jpsamaroo commented Dec 4, 2021

codecov-commenter commented Dec 8, 2021 • edited Loading

Codecov Report

krynju commented Dec 11, 2021 • edited Loading

jpsamaroo commented Dec 12, 2021

krynju commented Dec 12, 2021

jpsamaroo commented Dec 13, 2021

krynju commented Dec 14, 2021 • edited Loading

jpsamaroo commented Dec 14, 2021

jpsamaroo commented Dec 14, 2021

krynju commented Dec 14, 2021

krynju commented Dec 14, 2021

DrChainsaw commented Dec 14, 2021

jpsamaroo commented Dec 15, 2021

DrChainsaw commented Dec 15, 2021

jpsamaroo commented Dec 16, 2021

jpsamaroo commented Dec 16, 2021

jpsamaroo commented Dec 16, 2021

DrChainsaw commented Dec 16, 2021

DrChainsaw commented Dec 18, 2021

krynju commented Dec 21, 2021 • edited Loading

krynju commented Dec 22, 2021

jpsamaroo commented Dec 2, 2021 •

edited

Loading

codecov-commenter commented Dec 8, 2021 •

edited

Loading

krynju commented Dec 11, 2021 •

edited

Loading

krynju commented Dec 14, 2021 •

edited

Loading

krynju commented Dec 21, 2021 •

edited

Loading