Skip to content

mapping Dagger.@spawn with remote workers can cause one hung thunk #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kleinschmidt opened this issue Aug 10, 2021 · 6 comments · Fixed by #264
Closed

mapping Dagger.@spawn with remote workers can cause one hung thunk #254

kleinschmidt opened this issue Aug 10, 2021 · 6 comments · Fixed by #264

Comments

@kleinschmidt
Copy link

MWE:

using Pkg
pkg"activate --temp"
pkg"add Dagger Tables"

using Dagger, Tables

using Distributed
addprocs(1)

@everywhere begin
    using Pkg
    Pkg.activate($(Pkg.project().path))
    Pkg.instantiate()
    using Dagger
end

t = (a=rand(10), b=rand(10))
f(row) = row.a + row.b

thunks = map(Tables.rows(t)) do row
    Dagger.@spawn f(row)
end

Results consistently in

10-element Vector{Dagger.EagerThunk}:
 EagerThunk (finished)
 EagerThunk (running)
 EagerThunk (finished)
 EagerThunk (finished)
 EagerThunk (finished)
 EagerThunk (finished)
 EagerThunk (finished)
 EagerThunk (finished)
 EagerThunk (finished)
 EagerThunk (finished)

That is, the second thunk is always permanently left in running until the worker is killed and the job is rescheduled. Also occurs with 2 worker procs instead of one, but not with zero.

@kolia
Copy link

kolia commented Aug 10, 2021

Can repro the MWE.

However I'm a little mystified as to why any of the thunks other than thunks[2] works at all, since f is only defined on pid 1 and hasn't been defined @everywhere.

When I define @everywhere f(row) = ..., it works.

@jpsamaroo
Copy link
Member

@kolia thanks for figuring that out, that means that this is likely a bug in error propagation.

@kleinschmidt
Copy link
Author

Another clue: when you use map(delayed(f), Tables.rows(t)) and then compute.(thunks) they ALL hang indefinitely. We're starting to suspect that this is due to an error not being handled properly and thus the job neither failing or returning.

@kolia
Copy link

kolia commented Aug 10, 2021

see #255

@jpsamaroo
Copy link
Member

Thanks for the reports @kleinschmidt and @kolia! This one has probably bitten many Dagger users (including me) in the butt without anyone realizing what was happening!

@kleinschmidt
Copy link
Author

Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants