Skip to content

Force reschedule in adjust_pressure! #284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 15, 2021

Conversation

jpsamaroo
Copy link
Member

(Hopefully) fixes #282

Thanks to @krynju for narrowing this one down!

@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2021

Codecov Report

Merging #284 (118f608) into master (ff601e8) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master    #284   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files          35      35           
  Lines        2912    2913    +1     
======================================
- Misses       2912    2913    +1     
Impacted Files Coverage Δ
src/sch/eager.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff601e8...118f608. Read the comment docs.

@jpsamaroo jpsamaroo force-pushed the jps/adjust-pressure-reschedule branch from 3f66902 to 19d9bd5 Compare September 22, 2021 15:01
@krynju krynju mentioned this pull request Oct 2, 2021
@krynju
Copy link
Member

krynju commented Oct 2, 2021

So here the latest status is that against all logic this piece below in _adjust_pressure! fixes the issue for the single threaded example:

if pressure > 0
    put!(state.chan, RescheduleSignal())
end

@krynju
Copy link
Member

krynju commented Oct 5, 2021

@jpsamaroo new findings based on this example

using Dagger
f = (x) -> 10 + x
g = (x) -> fetch(Dagger.spawn(f, x))
fetch(Dagger.spawn(g, 10))

so the state at which the thing hangs is that:

now i don't have any idea why does it land in failed_scheduling? maybe by that time the worker is still fully occupied with the pressure from g? or maybe it should get scheduled again after this fail?

btw not using the put! reschedule signal in adjust pressure for this investigation

@krynju
Copy link
Member

krynju commented Oct 5, 2021

Ok looked into it more, I see nothing weird about it, it's just stuck here:

chan_value = take!(chan) # get result of completed thunk

The below works fine for me, the condition on notify(TASK_SYNC) breaks things.
I assume it's due to the fact that adjusting pressure both ways somewhat means a task has finished, not necessairly the same task though

"Adjusts the scheduler's cached pressure indicator for the specified worker by
the specified amount."
function adjust_pressure!(h::SchedulerHandle, proctype::Type, pressure)
    uid = Dagger.get_tls().sch_uid
    lock(TASK_SYNC) do
        PROC_UTILIZATION[uid][proctype][] += pressure
        notify(TASK_SYNC)
    end
    exec!(_adjust_pressure!, h, myid(), proctype, pressure)
end
function _adjust_pressure!(ctx, state, task, tid, (pid, proctype, pressure))
    state.worker_pressure[pid][proctype] += pressure
    pressure < 0 && put!(state.chan, RescheduleSignal())
    nothing
end

@jpsamaroo
Copy link
Member Author

Being stuck on take! probably means that we need to put a RescheduleSignal.

now i don't have any idea why does it land in failed_scheduling?

Yeah failed_scheduling is fully related to pressure; you only end up there if at capacity on all workers.

@krynju
Copy link
Member

krynju commented Oct 5, 2021

Being stuck on take! probably means that we need to put a RescheduleSignal.

So I guess the question is whether adjust_pressure is the optimal place to put that RescheduleSignal.
On the way to figure out what's the issue I haven't really found a better place. This seems to be the moment to do that

@krynju
Copy link
Member

krynju commented Oct 14, 2021

Should we merge this already?
I see nothing else to investigate
The below addition is enough to fix the issue

function _adjust_pressure!(ctx, state, task, tid, (pid, proctype, pressure))
    state.worker_pressure[pid][proctype] += pressure
    pressure < 0 && put!(state.chan, RescheduleSignal())
    nothing
end

@jpsamaroo jpsamaroo force-pushed the jps/adjust-pressure-reschedule branch from 19d9bd5 to 91ba229 Compare October 15, 2021 13:46
@jpsamaroo jpsamaroo merged commit 8c42c59 into master Oct 15, 2021
@jpsamaroo
Copy link
Member Author

Thanks for the help with this @krynju !

@jpsamaroo jpsamaroo deleted the jps/adjust-pressure-reschedule branch October 15, 2021 14:47
@krynju krynju mentioned this pull request Dec 12, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang #1 in Eager API usage
3 participants