Skip to content

Don't clobber the scheduler #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 31, 2021
Merged

Don't clobber the scheduler #310

merged 14 commits into from
Dec 31, 2021

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Dec 2, 2021

This PR is a set of bugfixes and optimizations that aim to improve the scheduler's ability to pump out tasks to workers, and maximize throughput. It will try to fix the following issues:

  • The scheduler thread should generally be de-prioritized for running tasks, or else scheduling will hang while a blocking task executes. All other potentially eligible processors should be considered before the scheduler thread.
  • Blocking tasks on workers may strangle the task trying to launch work (in do_tasks), which then (by nature of holding the scheduler lock) hangs the scheduler in fire_tasks!, which prevents new tasks from being added to the scheduler and launched. We shouldn't hold the lock during this time, and in the future, we'll want to have a dedicated thread for each worker to receive and launch work from.
  • Checking for new workers happens far too frequently; we should make this event-driven instead. I intend to funnel Context modifications through a custom setproperty! call which will notify any listening tasks that the set of available workers has changed (@DrChainsaw ).

Todo:

  • Refactor and test scheduler network optimization logic
  • Test MemPool.approxsize returns non-nothing for important types
  • Find and fix hangs introduced by 22c235f

@DrChainsaw
Copy link
Contributor

we should make this event-driven instead

Everytime a while true; sleep ... loop dies the fiery death it deserves an angel gets its wings.

Iirc one thing that made me a bit nervous when I did the current code was race conditions where the same Thunk would be scheduled twice which is why I tried to confine any asyncronous scheduling on new workers to happen only when waiting for work to finish.

@jpsamaroo
Copy link
Member Author

Can you elaborate on this potential race condition? We shouldn't be able to double-schedule a thunk, and even if we do, it would probably not cause significant harm to the scheduler's internal consistency.

@DrChainsaw
Copy link
Contributor

Can you elaborate on this potential race condition?

I can't think of anything specific except for a general worry of things like non-deterministic bugs happening due to things running in some unexpected order in case multiple tasks yield. I suppose double-scheduling a thunk should not happen unless something firing off a thunk is threaded. It was probably more of an attempt at trying to keep the moving parts to a minimum.

@jpsamaroo
Copy link
Member Author

That's a fair concern, although I think it's probably not something that we need to be concerned about. At this point, the scheduler should be multitasking-safe due to having a global lock taken during modifications. If I implement an event-driven notification system, it would be implemented with a thread-safe primitive, and scheduler updates would be serialized with the scheduler lock. Slow, maybe, but also race-free.

@jpsamaroo jpsamaroo force-pushed the jps/no-clobber-sch branch 2 times, most recently from d50360b to f2a2689 Compare December 8, 2021 18:54
@jpsamaroo jpsamaroo marked this pull request as ready for review December 8, 2021 21:46
@codecov-commenter
Copy link

codecov-commenter commented Dec 8, 2021

Codecov Report

Merging #310 (847044f) into master (566c35b) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master    #310   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files          42      42           
  Lines        3596    3630   +34     
======================================
- Misses       3596    3630   +34     
Impacted Files Coverage Δ
src/processor.jl 0.00% <0.00%> (ø)
src/sch/Sch.jl 0.00% <0.00%> (ø)
src/sch/dynamic.jl 0.00% <0.00%> (ø)
src/sch/util.jl 0.00% <0.00%> (ø)
src/scopes.jl 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 566c35b...847044f. Read the comment docs.

@krynju
Copy link
Member

krynju commented Dec 11, 2021

I get a reproducible hang on this branch when running Dagger.groupby (works fine on master)
Full hang, no cpu activity on main or any thread

i'll get an MWE later, but here's the benchmark steps

clone https://github.com/krynju/dtable_benchmarks
activate env and dev Dagger
julia -t16 .\scripts_benchmark\dtable3.jl 10000000 1000000 1000 4

@jpsamaroo
Copy link
Member Author

Yeah I also get a hang somewhere on a large distributed benchmark. I plan to investigate before I merge.

@jpsamaroo jpsamaroo marked this pull request as draft December 12, 2021 15:13
@krynju
Copy link
Member

krynju commented Dec 12, 2021

It kinda looks like the one we had here #284
It suddenly stops doing anything and just idles with no activity

@jpsamaroo jpsamaroo force-pushed the jps/no-clobber-sch branch 2 times, most recently from 847044f to 5bd8fec Compare December 13, 2021 20:51
@jpsamaroo jpsamaroo marked this pull request as ready for review December 13, 2021 20:51
@jpsamaroo jpsamaroo requested a review from krynju December 13, 2021 23:44
@jpsamaroo
Copy link
Member Author

@DrChainsaw I'd appreciate a review of the Context changes, if you get the chance.

@krynju
Copy link
Member

krynju commented Dec 14, 2021

Hmmm sometimes I get crashes, but I managed to get my biggest (16GB) groupby work once and the performance ~ about the same
It's definitely less stable than master though - that one didn't really crash for me in these benchmarks

Just got smaller data size to crash and throw this:
(usually related to some race condition from my experience)


Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable3.jl:15

@jpsamaroo
Copy link
Member Author

That's very concerning; is there any chance you can run benchmarks with a debug build of Julia? It might be that we've got a portion of ComputeState-modifying code running outside of the global lock, by accident.

@jpsamaroo
Copy link
Member Author

I managed to get my biggest (16GB) groupby work once and the performance ~ about the same

Yeah, performance may be somewhat similar on non-distributed workloads, since work is over-subscribed first, and then executed (so you could get cycles of scheduling to execution, resulting in clobbering not being a problem).

For me, a distributed benchmark of heavy BLAS operations was helped a lot by this, as well as #165 .

@krynju
Copy link
Member

krynju commented Dec 14, 2021

Ah nvm it's usually faster. Sometimes even 2x faster in some longer runs

https://pastebin.com/e76V1DZQ

@krynju
Copy link
Member

krynju commented Dec 14, 2021

logs with debug on:

PS C:\Users\krynjupc\WS\dtable_benchmarks> c:\Users\krynjupc\WS\dtable_benchmarks\run.ps1

  Activating project at `C:\Users\krynjupc\WS\dtable_benchmarks`
tablesize 1600.0 MB
saving results to dtable_bench1639504217.csv
┌ Debug: (1) eager_thunk (1) Using available Dagger.ThreadProc(1, 5): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (2) Using available Dagger.ThreadProc(1, 11): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (3) Using available Dagger.ThreadProc(1, 12): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (4) Using available Dagger.ThreadProc(1, 7): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (5) Using available Dagger.ThreadProc(1, 3): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (7) Using available Dagger.ThreadProc(1, 9): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (6) Using available Dagger.ThreadProc(1, 8): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (8) Using available Dagger.ThreadProc(1, 2): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (9) Using available Dagger.ThreadProc(1, 16): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (10) Using available Dagger.ThreadProc(1, 6): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (2) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (4) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (8) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (7) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (6) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (5) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (3) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (9) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (10) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (12) Using available Dagger.ThreadProc(1, 13): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (11) Using available Dagger.ThreadProc(1, 15): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (14) Using available Dagger.ThreadProc(1, 11): 364767300 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (12) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (13) Using available Dagger.ThreadProc(1, 4): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (11) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (14) Releasing Dagger.ThreadProc: 364767300 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (13) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (16) Using available Dagger.ThreadProc(1, 9): 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (17) Using available Dagger.ThreadProc(1, 6): 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (15) Using available Dagger.ThreadProc(1, 3): 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (16) Releasing Dagger.ThreadProc: 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (17) Releasing Dagger.ThreadProc: 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (18) Using available Dagger.ThreadProc(1, 12): 92664743 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (15) Releasing Dagger.ThreadProc: 184891886 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (19) Using available Dagger.ThreadProc(1, 15): 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (18) Releasing Dagger.ThreadProc: 92664743 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (19) Releasing Dagger.ThreadProc: 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (20) Using available Dagger.ThreadProc(1, 2): 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (21) Using available Dagger.ThreadProc(1, 11): 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (20) Releasing Dagger.ThreadProc: 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (21) Releasing Dagger.ThreadProc: 48001617 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (23) Using available Dagger.ThreadProc(1, 10): 6526314 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (22) Using available Dagger.ThreadProc(1, 8): 12274329 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (24) Using available Dagger.ThreadProc(1, 13): 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (23) Releasing Dagger.ThreadProc: 6526314 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (25) Using available Dagger.ThreadProc(1, 7): 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (22) Releasing Dagger.ThreadProc: 12274329 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (24) Releasing Dagger.ThreadProc: 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (25) Releasing Dagger.ThreadProc: 82865303 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (26) Using available Dagger.ThreadProc(1, 6): 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (27) Using available Dagger.ThreadProc(1, 16): 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (26) Releasing Dagger.ThreadProc: 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (27) Releasing Dagger.ThreadProc: 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (29) Using available Dagger.ThreadProc(1, 9): 10667987 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (30) Using available Dagger.ThreadProc(1, 8): 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (28) Using available Dagger.ThreadProc(1, 3): 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (29) Releasing Dagger.ThreadProc: 10667987 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (30) Releasing Dagger.ThreadProc: 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (31) Using available Dagger.ThreadProc(1, 13): 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (28) Releasing Dagger.ThreadProc: 21008475 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (32) Using available Dagger.ThreadProc(1, 11): 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (31) Releasing Dagger.ThreadProc: 1773385 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (32) Releasing Dagger.ThreadProc: 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (33) Using available Dagger.ThreadProc(1, 14): 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (34) Using available Dagger.ThreadProc(1, 4): 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (33) Releasing Dagger.ThreadProc: 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (34) Releasing Dagger.ThreadProc: 729671 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (36) Using available Dagger.ThreadProc(1, 12): 611821 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (35) Using available Dagger.ThreadProc(1, 2): 478442 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (36) Releasing Dagger.ThreadProc: 611821 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (37) Using available Dagger.ThreadProc(1, 11): 512880 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (35) Releasing Dagger.ThreadProc: 478442 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (37) Releasing Dagger.ThreadProc: 512880 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (38) Using available Dagger.ThreadProc(1, 13): 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (39) Using available Dagger.ThreadProc(1, 6): 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (38) Releasing Dagger.ThreadProc: 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (39) Releasing Dagger.ThreadProc: 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (41) Using available Dagger.ThreadProc(1, 15): 342435 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (40) Using available Dagger.ThreadProc(1, 9): 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) + (41) Releasing Dagger.ThreadProc: 342435 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) + (40) Releasing Dagger.ThreadProc: 380470 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) #541 (42) Using available Dagger.ThreadProc(1, 8): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (44) Using available Dagger.ThreadProc(1, 10): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (45) Using available Dagger.ThreadProc(1, 3): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (46) Using available Dagger.ThreadProc(1, 13): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (47) Using available Dagger.ThreadProc(1, 12): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) #541 (42) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (48) Using available Dagger.ThreadProc(1, 16): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (49) Using available Dagger.ThreadProc(1, 11): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (53) Using available Dagger.ThreadProc(1, 9): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (52) Using available Dagger.ThreadProc(1, 7): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (50) Using available Dagger.ThreadProc(1, 6): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (51) Using available Dagger.ThreadProc(1, 15): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) build_groupby_index (43) Using available Dagger.ThreadProc(1, 4): 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:973
┌ Debug: (1) distinct_partitions (47) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (45) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (46) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (44) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013
┌ Debug: (1) distinct_partitions (48) Releasing Dagger.ThreadProc: 1000000000 | 0/18446744073709551615
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:1013

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff96c4a0ee8 -- RtlVirtualUnwind at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at C:\Users\krynjupc\WS\dtable_benchmarks\scripts_benchmark\dtable2.jl:15

@DrChainsaw
Copy link
Contributor

Can one add new Thunks to an ongoing computation with add_thunk!? If so I could take a stab at fixing the add/rm procs test with it.

@jpsamaroo
Copy link
Member Author

Can one add new Thunks to an ongoing computation with add_thunk!?

Absolutely! That's how the eager scheduler is implemented, as a regular thunk running on a regular scheduler, listening on a channel, and then constructing new Thunks with add_thunk!.

@DrChainsaw
Copy link
Contributor

Here is a patched version of the add workers test which should not be unreliable. I could not get add_thunk! to use the existing context so therefore I had to disable the part where we add workers which we don't want to participate in the computation.

Let me know if it is acceptable and I'll try to fix the remove workers test too.

Add procs test
    setup = quote
        using Dagger, Distributed
        # blocked is to guarantee that processing is not completed before we add new workers
        # Note: blocked is used in expressions below
        blocked = true
        function testfun(i)
            i < 4 && return myid()
            # Wait for test to do its thing before we proceed
            if blocked
                sleep(0.1) # just so we don't end up overflowing or something while waiting for workers to be added
                # Here we would like to just wait to be rescheduled on another worker (which is not blocked)
                # but this functionality does not exist, so instead we do this weird thing where we reschedule
                # until we end up on a non-blocked worker
                h = Dagger.Sch.sch_handle()
                id = Dagger.Sch.add_thunk!(testfun, h, i)
                return fetch(h, id)
            end
            return myid()
        end
    end

    @testset "Add new workers" begin
        ps = []
        try
            ps1 = addprocs(2, exeflags="--project")
            append!(ps, ps1)

            @everywhere vcat(ps1, myid()) $setup

            ts = delayed(vcat)((delayed(testfun)(i) for i in 1:10)...)

            ctx = Context(ps1)
            job = @async collect(ctx, ts)

            while !istaskstarted(job)
                sleep(0.001)
            end

            # Will not be added, so they should never appear in output
            # TODO: Does not work: add_thunk! seems to create a new context using all available workers :(
            #ps2 = addprocs(2, exeflags="--project")
            #append!(ps, ps2)

            ps3 = addprocs(2, exeflags="--project")
            append!(ps, ps3)
            @everywhere ps3 $setup
            addprocs!(ctx, ps3)
            @test length(procs(ctx)) == 4

            @everywhere ps3 blocked=false

            ps_used = fetch(job)
            @test ps_used isa Vector

            @test any(p -> p in ps_used, ps1)
            @test any(p -> p in ps_used, ps3)
            #@test !any(p in ps2, ps_used)
        finally
            wait(rmprocs(ps))
        end
    end

@jpsamaroo
Copy link
Member Author

id = Dagger.Sch.add_thunk!(testfun, h, i)

It'd probably be better to call into the scheduler (with Sch.exec!), check if the new workers are available in procs(ctx) (which is passed to the called function as the first argument), and if so, do Dagger.Sch.add_thunk!(testfun, h, i; single=wid); if not, reschedule and wait, as you're doing now.

Let me know if it is acceptable and I'll try to fix the remove workers test too.

Yes please!

@jpsamaroo
Copy link
Member Author

I'll update your new test approach to use my suggestion.

@jpsamaroo
Copy link
Member Author

Wonderfully enough, this PR seems to actually make fault handling more robust.

@DrChainsaw
Copy link
Contributor

I tried a little to fix the remove procs test but I got stuck. It seems like add_thunk! behaves differently with the single keyword. It doesn't seem like they are ever executed.

I also get frequent IOError: connect: connection refused (ECONNREFUSED) (although the testcase still passes?!) when running the testset in one go. It looks like the CI jobs get the same error. I could not reproduce that with the old solution without wkrs.

Example setup with logging and longer wait:
setup = quote
    using Dagger, Distributed
    function _list_workers(ctx, state, task, tid, _)
        return procs(ctx)
    end
    # blocked is to guarantee that processing is not completed before we add new workers
    # Note: blocked is used in expressions below
    blocked = true
    function testfun(i)
        i < 4 && return myid()
        # Wait for test to do its thing before we proceed
        if blocked
            sleep(0.5) # just so we don't end up overflowing or something while waiting for workers to be added
            # Here we would like to just wait to be rescheduled on another worker (which is not blocked)
            # but this functionality does not exist, so instead we do this weird thing where we reschedule
            # until we end up on a non-blocked worker
            h = Dagger.Sch.sch_handle()
            wkrs = Dagger.Sch.exec!(_list_workers, h)
            id = if length(wkrs) > 2
                id = Dagger.Sch.add_thunk!(testfun, h, i; single=last(wkrs).pid)
                @info "After adding from wkrs: $id"
                id
            else
                id = Dagger.Sch.add_thunk!(testfun, h, i)
                @info "After adding to all $id"
                id
            end
            return fetch(h, id)
        end
        return myid()
    end
end

First I ran this:

ps = []
ps1 = addprocs(2, exeflags="--project")
append!(ps, ps1)

@everywhere vcat(ps1, myid()) $setup

ts = delayed(vcat)((delayed(testfun)(i) for i in 1:10)...)

ctx = Context(ps1)
job = @async collect(ctx, ts)

while !istaskstarted(job)
    sleep(0.001)
end

Which prints a steady stream of:

      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(194, MemPool.DRef(1, 182, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(195, MemPool.DRef(1, 183, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(196, MemPool.DRef(1, 184, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(197, MemPool.DRef(1, 185, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(198, MemPool.DRef(1, 186, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(199, MemPool.DRef(1, 187, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(200, MemPool.DRef(1, 188, 0x0000000000000250))
      From worker 2:    [ Info: After adding to all Dagger.Sch.ThunkID(201, MemPool.DRef(1, 189, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(202, MemPool.DRef(1, 190, 0x0000000000000250))
      From worker 3:    [ Info: After adding to all Dagger.Sch.ThunkID(203, MemPool.DRef(1, 191, 0x0000000000000250))

But then after running:

ps3 = addprocs(2, exeflags="--project")
append!(ps, ps3)
@everywhere ps3 $setup
addprocs!(ctx, ps3)
@test length(procs(ctx)) == 4

It just prints the following and then goes silent:

      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(204, MemPool.DRef(1, 192, 0x0000000000000250))
      From worker 2:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(205, MemPool.DRef(1, 193, 0x0000000000000250))
      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(206, MemPool.DRef(1, 194, 0x0000000000000250))
      From worker 2:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(207, MemPool.DRef(1, 195, 0x0000000000000250))
      From worker 2:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(208, MemPool.DRef(1, 196, 0x0000000000000250))
      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(209, MemPool.DRef(1, 197, 0x0000000000000250))
      From worker 3:    [ Info: After adding from wkrs: Dagger.Sch.ThunkID(210, MemPool.DRef(1, 198, 0x0000000000000250))

Which seems to indicate that thunks get stuck somehow. Trying to unblock does nothing:

julia> job
Task (runnable) @0x000000000e9a5b30

julia> @everywhere blocked = false

julia> job
Task (runnable) @0x000000000e9a5b30

julia> @everywhere blocked = false

julia> job
Task (runnable) @0x000000000e9a5b30

Also, if I try to log anything about wkrs (even something like p = last(wkrs).pid; @info "p = $p") the process segfaults.

@DrChainsaw
Copy link
Contributor

It seems like add_thunk! behaves differently with the single keyword. It doesn't seem like they are ever executed.

Ok, I think I found the root cause here: dynamic_listener! is never called for new procs. Changing the signature to dynamic_listener!(ctx, state; tids = keys(state.worker_chans)) (and looping over tids in the main loop) and then calling it with only the new procs makes everything tick along as expected.

I suppose the drawback is that there will be one task listening for halts each time this is done, so perhaps refactoring so that dynamic_listener! accepts a single proc and the listener_tasks is moved to state would be preferable. One should then call dynamic_listener! from init_proc as it is a required step for a proc to be operational.

I suppose one would also need/want a mechanism to clean up the listeners when procs are removed (although I dread dealing with the edge cases that this might create).

Let me know if you want a PR or code suggestion for this. I do feel a bit bad for having added more moving parts with the add/remove procs and I hope it is useful for someone else than me 😟

@krynju
Copy link
Member

krynju commented Dec 21, 2021

I tried more of my benchmarks on this branch and there's no more crashing and the work/data distribution is definitely improved

Example:
image

@krynju
Copy link
Member

krynju commented Dec 22, 2021

Got this interesting error log, but not sure if it's related.
I don't have that branch from Valentin with Distributed fix to message passing - it might be that

log
 Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
    From worker 4:      Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
    From worker 2:      Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
    From worker 3:      Activating project at `C:\Users\krynjupc\WS\mgr_benchmark_setup\dtable`
@@@ TABLESIZE:       1600.0 MB
@@@ SAVING TO:       results\dtable_bench1640153009.csv
    From worker 4:    ┌ Error: Error on 4 while connecting to peer 3, exitingError in sending dynamic request:
no process with id 4 exists
Stacktrace:
[1] error(s::String)
 @ Base .\error.jl:33
[2] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
 @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1094
[3] worker_from_id
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1086 [inlined]
[4] #remotecall_fetch#158
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
[5] remotecall_fetch
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
[6] call_on_owner
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:567 [inlined]
[7] take!
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:763 [inlined]
[8] macro expansion
 @ C:\Users\krynjupc\.julia\dev\Dagger\src\sch\dynamic.jl:52 [inlined]
[9] (::Dagger.Sch.var"#38#42"{Context, Dagger.Sch.ComputeState, Task, RemoteChannel{Channel{Any}}, RemoteChannel{Channel{Any}}})()
 @ Dagger.Sch .\task.jl:466

┌ Error: Fatal error on process 1
│   exception =
│    attempt to send to unknown socket
│    Stacktrace:
│     [1] error(s::String)
│       @ Base .\error.jl:33
│     [2] send_msg_unknown(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:99
│     [3] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:115
│     [4] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::Nothing)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:95
│     [5] macro expansion
│       @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:286 [inlined]
│     [6] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
│       @ Distributed .\task.jl:466
└ @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:99
    From worker 4:    │   exception =
┌ Error: Fatal error on process 1
│   exception =
│    attempt to send to unknown socket
│    Stacktrace:
│     [1] error(s::String)
│       @ Base .\error.jl:33
│     [2] send_msg_unknown(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:99
│     [3] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:115
│     [4] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::Nothing)
│       @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:95
│     [5] macro expansion
│       @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:295 [inlined]
│     [6] (::Distributed.var"#109#111"{Distributed.CallWaitMsg, Distributed.MsgHeader, Sockets.TCPSocket})()
│       @ Distributed .\task.jl:466
└ @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:99
    From worker 4:ConcurrencyViolationError("lock must be held")
Worker 4 terminated.      From worker 4:        │    Stacktrace:

Error in eager scheduler:
TaskFailedException

  nested task error: no process with id 4 exists
  Stacktrace:
   [1] error(s::String)
     @ Base .\error.jl:33
   [2] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
     @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1094
   [3] worker_from_id
     @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1086 [inlined]
   [4] remote_do(::Function, ::Int64, ::Dagger.NoOpLog, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
     @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:559
   [5] remote_do(::Function, ::Int64, ::Dagger.NoOpLog, ::Vararg{Any})
     @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:559
   [6] (::Dagger.Sch.var"#117#119"{Context, Set{Dagger.Chunk}, Int64})()
     @ Dagger.Sch .\task.jl:466
Stacktrace:
[1] sync_end(c::Channel{Any})
 @ Base .\task.jl:424
[2] macro expansion
 @ .\task.jl:443 [inlined]
[3] evict_all_chunks!(ctx::Context, to_evict::Set{Dagger.Chunk})
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:783
[4] finish_task!(ctx::Context, state::Dagger.Sch.ComputeState, node::Thunk, thunk_failed::Bool)
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:778
[5] (::Dagger.Sch.var"#90#96"{Context, Dagger.Sch.ComputeState, OSProc, NamedTuple{(:pressure, :loadavg, :threadtime, :transfer_rate), Tuple{UInt64, Tuple{Float64, Float64, Float64}, UInt64, UInt64}}, RemoteException, Int64, Dagger.ThreadProc, Int64})()
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:451
[6] lock(f::Dagger.Sch.var"#90#96"{Context, Dagger.Sch.ComputeState, OSProc, NamedTuple{(:pressure, :loadavg, :threadtime, :transfer_rate), Tuple{UInt64, Tuple{Float64, Float64, Float64}, UInt64, UInt64}}, RemoteException, Int64, Dagger.ThreadProc, Int64}, l::ReentrantLock)
 @ Base .\lock.jl:183
[7] compute_dag(ctx::Context, d::Thunk; options::Dagger.Sch.SchedulerOptions)
 @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:407
[8] compute(ctx::Context, d::Thunk; options::Dagger.Sch.SchedulerOptions)
 @ Dagger C:\Users\krynjupc\.julia\dev\Dagger\src\compute.jl:31
[9] (::Dagger.Sch.var"#61#62"{Context})()
 @ Dagger.Sch .\task.jl:466
    From worker 4:    │      [1] concurrency_violation()
    From worker 4:    │        @ Base .\condition.jl:8
    From worker 4:    │      [2] assert_havelock
    From worker 4:    │        @ .\condition.jl:25 [inlined]
    From worker 4:    │      [3] assert_havelock
    From worker 4:    │        @ .\condition.jl:48 [inlined]
    From worker 4:    │      [4] assert_havelock
    From worker 4:    │        @ .\condition.jl:72 [inlined]
    From worker 4:    │      [5] notify(c::Condition, arg::Any, all::Bool, error::Bool)
    From worker 4:    │        @ Base .\condition.jl:144
    From worker 4:    │      [6] #notify#570
    From worker 4:    │        @ .\condition.jl:142 [inlined]
    From worker 4:    │      [7] set_worker_state
    From worker 4:    │        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:148 [inlined]
    From worker 4:    │      [8] Distributed.Worker(id::Int64, r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, manager::Distributed.DefaultClusterManager; version::Nothing, config::WorkerConfig)
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:126
    From worker 4:    │      [9] connect_to_peer(manager::Distributed.DefaultClusterManager, rpid::Int64, wconfig::WorkerConfig)
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:356
    From worker 4:    │     [10] (::Distributed.var"#117#119"{Int64, WorkerConfig})()
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:342
    From worker 4:    │     [11] exec_conn_func(w::Distributed.Worker)
    From worker 4:    │        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:181
    From worker 4:    │     [12] (::Distributed.var"#17#20"{Distributed.Worker})()
    From worker 4:    │        @ Distributed .\task.jl:466
    From worker 4:    └ @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:362
    From worker 3:    ErrorException("Cookie read failed. Connection closed by peer.")CapturedException(ErrorException("Cookie read failed. Connection closed by peer."), Any[(error(s::String) at error.jl:33, 1), (process_hdr(s::Sockets.TCPSocket, validate_cookie::Bool) at process_messages.jl:251, 1), (message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) at process_messages.jl:151, 1), (process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool) at process_messages.jl:126, 1), ((::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})() at task.jl:466, 1)])
    From worker 3:    Process(3) - Unknown remote, closing connection.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
[1] (::Base.var"#wait_locked#660")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
 @ Base .\stream.jl:941
[2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
 @ Base .\stream.jl:950
[3] unsafe_read
 @ .\io.jl:751 [inlined]
[4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
 @ Base .\io.jl:750
[5] read!
 @ .\io.jl:752 [inlined]
[6] deserialize_hdr_raw
 @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\messages.jl:167 [inlined]
[7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
 @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:165
[8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
 @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:126
[9] (::Distributed.var"#99#100"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
 @ Distributed .\task.jl:466
┌ Error: Error initializing worker OSProc(4)
│   exception =
│    KeyError: key 4 not found
│    Stacktrace:
│     [1] getindex
│       @ .\dict.jl:498 [inlined]
│     [2] (::Dagger.Sch.var"#74#79"{Dagger.Sch.ComputeState, OSProc})()
│       @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:280
│     [3] lock(f::Dagger.Sch.var"#74#79"{Dagger.Sch.ComputeState, OSProc}, l::ReentrantLock)
│       @ Base .\lock.jl:183
│     [4] init_proc(state::Dagger.Sch.ComputeState, p::OSProc, log_sink::Dagger.NoOpLog)
│       @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:257
│     [5] macro expansion
│       @ C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:364 [inlined]
│     [6] (::Dagger.Sch.var"#88#94"{Context, Dagger.Sch.ComputeState, OSProc})()
│       @ Dagger.Sch .\task.jl:466
└ @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:366
    From worker 3:    ProcessExitedException(4)
    From worker 3:    Stacktrace:
    From worker 3:      [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
    From worker 3:        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1089
    From worker 3:      [2] worker_from_id
    From worker 3:        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\cluster.jl:1086 [inlined]
    From worker 3:      [3] #remotecall_fetch#158
    From worker 3:        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
    From worker 3:      [4] remotecall_fetch
    From worker 3:        @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:494 [inlined]
    From worker 3:      [5] #68
    From worker 3:        @ C:\Users\krynjupc\.julia\dev\Dagger\src\processor.jl:98 [inlined]
    From worker 3:      [6] get!(default::Dagger.var"#68#69"{Int64}, h::Dict{Int64, Vector{Dagger.Processor}}, key::Int64)
    From worker 3:        @ Base .\dict.jl:481
    From worker 3:      [7] OSProc
    From worker 3:        @ C:\Users\krynjupc\.julia\dev\Dagger\src\processor.jl:97 [inlined]
    From worker 3:      [8] evict_chunks!(log_sink::Dagger.NoOpLog, chunks::Set{Dagger.Chunk})
    From worker 3:        @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\Sch.jl:789
    From worker 3:      [9] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    From worker 3:        @ Base .\essentials.jl:731
    From worker 3:     [10] invokelatest(::Any, ::Any, ::Vararg{Any})
    From worker 3:        @ Base .\essentials.jl:729
    From worker 3:     [11] (::Distributed.var"#114#116"{Distributed.RemoteDoMsg})()
    From worker 3:        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:301
    From worker 3:     [12] run_work_thunk(thunk::Distributed.var"#114#116"{Distributed.RemoteDoMsg}, print_error::Bool)
    From worker 3:        @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\process_messages.jl:63
    From worker 3:     [13] (::Distributed.var"#113#115"{Distributed.RemoteDoMsg})()
    From worker 3:        @ Distributed .\task.jl:466fatal: error thrown and no exception handler available.
    From worker 3:    InterruptException()
    From worker 2:    fatal: error thrown and no exception handler available.
    From worker 2:    InterruptException()

@jpsamaroo jpsamaroo merged commit 217415c into master Dec 31, 2021
@jpsamaroo jpsamaroo deleted the jps/no-clobber-sch branch December 31, 2021 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants