Skip to content

Adding processes and using eager API produces warnings about workers dying #536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
m-fila opened this issue Jun 22, 2024 · 9 comments · Fixed by #537
Closed

Adding processes and using eager API produces warnings about workers dying #536

m-fila opened this issue Jun 22, 2024 · 9 comments · Fixed by #537

Comments

@m-fila
Copy link
Contributor

m-fila commented Jun 22, 2024

Adding extra processes and scheduling with eager API seems to be producing error and warnings about reschduling do to workers dying. For example, snippet taken from README:

using Distributed; addprocs() # Add one Julia worker per CPU core
using Dagger

# This runs first:
a = Dagger.@spawn rand(100, 100)

# These run in parallel:
b = Dagger.@spawn sum(a)
c = Dagger.@spawn prod(a)

# Finally, this runs:
wait(Dagger.@spawn println("b: ", b, ", c: ", c))

Gives the following error:

      From worker 2:    b: 5061.860461804876, c: 0.0
┌ Warning: Worker 2 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Error: Error assigning workers
│   exception =ProcessExitedException(2)
│    Stacktrace:
│     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
│       @ Distributed ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1093
│     [2] worker_from_id
│       @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/cluster.jl:1090 [inlined]
│     [3] remote_do
│       @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/remotecall.jl:557 [inlined]
│     [4] cleanup_proc(state::Dagger.Sch.ComputeState, p::OSProc, log_sink::TimespanLogging.NoOpLog)
│       @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:408
│     [5] monitor_procs_changed!(ctx::Context, state::Dagger.Sch.ComputeState)
│       @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:890
│     [6] (::Dagger.Sch.var"#100#102"{Context, Dagger.Sch.ComputeState})()
│       @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:508
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:510
┌ Warning: Worker 3 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 12 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 15 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 13 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 17 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 14 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 8 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 11 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 4 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 5 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 10 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 6 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 7 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 16 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545
┌ Warning: Worker 9 died, rescheduling work
└ @ Dagger.Sch ~/.julia/packages/Dagger/kBlIi/src/sch/Sch.jl:545

The error sometimes is omitted but warnings about workers dying are present.
If lazy API is used then there are no warnings or errors
The warnings seems to be harmless since they appear only while finishing the job

versioninfo:

Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 5700G with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Dagger: 0.18.11
I couldn't find any duplicates

@JamesWrigley
Copy link
Collaborator

Could you try on master? I believe this was fixed in #532.

@m-fila
Copy link
Contributor Author

m-fila commented Jun 22, 2024

Thank you. I tried master, the error is gone but the warnings are still there

@JamesWrigley
Copy link
Collaborator

Yeah I think the warnings will have to stay, unless we bring back Dagger.cleanup() for users to explicitly cleanup things. They can be safely ignored though, so I'll close this.

@jpsamaroo
Copy link
Member

If those warnings are happening during a clean Julia shutdown, then we need to improve our fault tolerance logic to properly detect a clean shutdown and thus not emit these warnings, since they're quite scary to see. @m-fila can you confirm that these occur during a Julia exit?

@m-fila
Copy link
Contributor Author

m-fila commented Jun 23, 2024

Yes, I confirm

@jpsamaroo
Copy link
Member

Ok, then re-opening this issue since we need to properly silence these warnings.

@jpsamaroo
Copy link
Member

@m-fila can you please validate that #537 makes the warnings go away for you? It works for me locally.

@m-fila
Copy link
Contributor Author

m-fila commented Jun 23, 2024

Yes, they are gone with #537. Thanks!

The warnings still appear tho if the workers are removed workers() |> rmprocs

@jpsamaroo
Copy link
Member

Yeah, that's a separate issue, because in this case Dagger has no idea that it was intentional for the workers to exit (Distributed.jl doesn't communicate this distinction to Dagger). You would need to call Dagger.rmprocs!(Dagger.Sch.eager_context(), workers()) before calling rmprocs to allow Dagger time to properly clean up the workers.

SmalRat added a commit to SmalRat/key4hep-julia-fwk that referenced this issue Jun 25, 2024
m-fila added a commit to m-fila/key4hep-julia-fwk that referenced this issue Jun 25, 2024
SmalRat added a commit to SmalRat/key4hep-julia-fwk that referenced this issue Jun 25, 2024
m-fila added a commit to key4hep/key4hep-julia-fwk that referenced this issue Jun 25, 2024
…11)

* Added examples, docs + some fixes

* Added examples results, updated readme

* Added examples for the render_logs() function

* Fixes to the main program; added example

* Small fixes

* Silence warnings about the workers dying and rescheduling JuliaParallel/Dagger.jl#536

Co-authored-by: Mateusz Jakub Fila <[email protected]>

* wait for notify tasks

---------

Co-authored-by: SmalRat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants