Skip to content

undefined function => stalled thunks #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kolia opened this issue Aug 10, 2021 · 1 comment · Fixed by #264
Closed

undefined function => stalled thunks #255

kolia opened this issue Aug 10, 2021 · 1 comment · Fixed by #264
Assignees

Comments

@kolia
Copy link

kolia commented Aug 10, 2021

[mostly the same as #254, better diagnosis]

using Pkg
pkg"activate --temp"
pkg"add Dagger Tables"

using Dagger, Tables

using Distributed
addprocs(1)

@everywhere begin
    using Pkg
    Pkg.activate($(Pkg.project().path))
    Pkg.instantiate()
    using Dagger
end

f(x) = x

eager_thunks = map(1:4) do i
    Dagger.@spawn f(i)
end
#4-element Vector{Dagger.EagerThunk}:       
# EagerThunk (finished)                                                                                                                                     
# EagerThunk (running)
# EagerThunk (finished)                                                                                                                                     
# EagerThunk (finished)

d = delayed(f)(42)

compute(d)  # stalls

Seems likely that thunks[1] gets run on process pid 1, then thunks[2] gets run on the worker, and somehow the error you get from f not being defined there makes thunk[2] stall.

Seems likely that delayed(f)(42) gets scheduled on the worker, and exhibits the same behavior.

Interestingly, if instead of f(x) = x, I define f = identity, then delayed(f)(42) works, but thunk[2] stalls in the same way.

@jpsamaroo
Copy link
Member

Ok, found the issue: deserialize on worker 2 throws the error on f being undefined (before we get into Dagger's "managed" code), but we use remote_do in fire_tasks, so we silently swallow the error. This only occurs for the second eager task, and for the delayed call, because those executed on worker 2; the rest of the tasks execute on worker 1, and thus complete successfully. Will have a fix up shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants