[manager/dispatcher] Synchronization fixes #2495

anshulpundir · 2018-01-30T22:32:57Z

Fixes for synchronizing the dispatcher shutdown with in-progress rpcs. This addresses the case where the Dispatcher.Register() rpc races with Dispatcher.Stop() and reinserts into the dispatcher node store after it has been cleaned up by Stop().

We'll hold off on changing the grace period for agents until we have tested this fix at scale.

stevvooe · 2018-01-30T22:44:00Z

manager/dispatcher/dispatcher.go

@@ -125,8 +125,18 @@ type clusterUpdate struct {

 // Dispatcher is responsible for dispatching tasks and tracking agent health.
 type Dispatcher struct {
-	mu                   sync.Mutex
-	wg                   sync.WaitGroup
+	// Lock to provide  mutually exclusive access to dispatcher fields


Start the comment with the name of the fields: // mu blah blah....

stevvooe · 2018-01-30T22:51:34Z

manager/dispatcher/dispatcher.go

+	mu sync.Mutex
+	// Flag to indicate shutdown and prevent new operations on the dispatcher.
+	// Set by calling Stop().
+	shutdown bool


Any reason to avoid using a tombstone channel?

Caught up with @stevvooe offline. I agree that using a tombstone is cleaner, but some of the operations don't use a select {} so using a bool saves some code.

codecov · 2018-01-30T23:25:52Z

Codecov Report

Merging #2495 into master will increase coverage by 1.2%.
The diff coverage is 86.2%.

@@            Coverage Diff             @@
##           master    #2495      +/-   ##
==========================================
+ Coverage   61.17%   62.38%    +1.2%     
==========================================
  Files          49      129      +80     
  Lines        6898    21323   +14425     
==========================================
+ Hits         4220    13302    +9082     
- Misses       2243     6586    +4343     
- Partials      435     1435    +1000

fcrisciani · 2018-01-31T17:43:05Z

manager/dispatcher/dispatcher.go

+	// Set shutdown to true.
+	// This will prevent RPCs that start after stop() is called
+	// from making progress and essentially puts the dispatcher in drain.
+	d.shutdown = true


don't see any place where the shutdown is set back to false. Remember that the Dispatcher is not recreated but is reused.

good point! Will address this.

I'm a little worried by the fact that the CI was green...

Yea, unfortunately the unit-test always creates a new dispatcher. I know.

Looking at adding a unit-test.

dperny · 2018-01-31T22:20:56Z

use an RWMutex instead of a regular one. There's only one case that writes (the actual shutdown) and everything else is reads. It's an easy win for optimization. In addition, this will allow you to eliminate the waitgroup, because you can just RLock the RWMutex for the duration of the method call, and the Stop method won't be able to acquire a write lock until all readlocks are released. Make sense?

anshulpundir · 2018-01-31T23:56:59Z

Using a RWMutex was the initial approach. Here's why I decided to use a wait group:

Using a WaitGroup alone is perhaps more performant than a RWLock and somewhat simpler (in the stop() function you just need to call wait() as opposed to grabbing a write lock) and somewhat more intuitive.
There was already a waitgroup on the dispatcher struct, so reusing that was an easy win.

I realized that the shutdown flag is not really needed since we can inspect the dispatcher context to signal shutdown. This code is much simpler and also inline with the two points above.

nishanttotla · 2018-02-01T00:00:12Z

manager/dispatcher/dispatcher.go

@@ -1137,6 +1157,9 @@ func (d *Dispatcher) getRootCACert() []byte {
 // a special boolean field Disconnect which if true indicates that node should
 // reconnect to another Manager immediately.
 func (d *Dispatcher) Session(r *api.SessionRequest, stream api.Dispatcher_SessionServer) error {
+	d.shutdownWait.Add(1)


@anshulpundir here and in a couple of places above, why isn't the check on d.shutdown required?

Actually nevermind, after your most recent comment, this is not relevant.

We're using the Dispatcher context to signal shutdown.

Note the order: we always increment the waitgroup first, followed by isRunningLocked(). If an RPC finds the context not cancelled, it will have +1 the WaitGroup, which will make sure that the Stop() function waits for this rpc to finish.

I would say that this is slightly clever, so make sure to document it somewhere that the ordering is what does the trick. It'll be hard to read later otherwise.

fcrisciani · 2018-02-01T00:17:39Z

manager/dispatcher/dispatcher.go

+	d.shutdownWait.Add(1)
+	defer d.shutdownWait.Done()
+
+	if !d.isRunning() {


This is the version with no lock, so can potentially race with the shutdown, is it a problem?
If not can you add a comment saying the reason why it is fine like that?

Its not because Stop() will wait for Heardbeat() to complete since it has already incremented the waitgroup. Added a comment.

The wait of the WG is not the first operation of the Stop so the check for running here is racing with the beginning of the stop

as I was saying maybe is not a problem but with this change you would have start the stop function and still process an extra heartbeat because the check isRunning check the bool with no lock

with this change you would have start the stop function and still process an extra heartbeat because the check isRunning check the bool with no lock

True. However is no correctness issue and the win is that we don't need to get a lock in Heartbeat(), which is the most frequent call on the dispatcher.

Signed-off-by: Anshul Pundir <[email protected]>

fcrisciani

LGTM

dperny · 2018-02-01T17:50:17Z

Alright, LGTM, we're shipping it.

nishanttotla · 2018-02-13T19:16:20Z

There are some potential races introduced by this PR causing tests to fail on moby/moby. We may have to fix that: moby/moby#36274

anshulpundir force-pushed the agent2 branch from 391d683 to 14a78bd Compare January 30, 2018 22:34

stevvooe reviewed Jan 30, 2018

View reviewed changes

anshulpundir force-pushed the agent2 branch from 14a78bd to b573b29 Compare January 30, 2018 23:25

anshulpundir requested review from fcrisciani and nishanttotla January 31, 2018 01:11

fcrisciani reviewed Jan 31, 2018

View reviewed changes

anshulpundir force-pushed the agent2 branch from b573b29 to 4b66525 Compare January 31, 2018 19:39

anshulpundir changed the title ~~[manager/dispatcher] Synchronization fixes [WIP]~~ [manager/dispatcher] Synchronization fixes Jan 31, 2018

anshulpundir force-pushed the agent2 branch 2 times, most recently from 3773677 to 071a9b9 Compare January 31, 2018 21:13

anshulpundir force-pushed the agent2 branch from 071a9b9 to 66c3ca6 Compare January 31, 2018 23:58

nishanttotla reviewed Feb 1, 2018

View reviewed changes

nishanttotla approved these changes Feb 1, 2018

View reviewed changes

fcrisciani reviewed Feb 1, 2018

View reviewed changes

[manager/dispatcher] Synchronize Dispatcher.Stop() with incoming rpcs.

0b2778a

Signed-off-by: Anshul Pundir <[email protected]>

anshulpundir force-pushed the agent2 branch from 66c3ca6 to 0b2778a Compare February 1, 2018 00:36

fcrisciani reviewed Feb 1, 2018

View reviewed changes

dperny merged commit 6ae5ffe into moby:master Feb 1, 2018

thaJeztah mentioned this pull request Feb 3, 2018

[manager/dispatcher] Synchronize Dispatcher.Stop() with incoming rpcs. #2498

Merged

This was referenced Feb 10, 2018

Bump SwarmKit to f74983e7c015a38a81c8642803a78b8322cf7eac moby/moby#36274

Merged

[17.12] [manager/dispatcher] Synchronize Dispatcher.Stop() with incoming rpcs. #2514

Closed

cyli mentioned this pull request Feb 22, 2018

Add/use container.Exec() to integration moby/moby#36326

Merged

thaJeztah mentioned this pull request Jul 10, 2018

[18.03] [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat() #2702

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[manager/dispatcher] Synchronization fixes #2495

[manager/dispatcher] Synchronization fixes #2495

anshulpundir commented Jan 30, 2018 •

edited

Loading

stevvooe Jan 30, 2018

stevvooe Jan 30, 2018

anshulpundir Jan 30, 2018

codecov bot commented Jan 30, 2018 •

edited

Loading

fcrisciani Jan 31, 2018

anshulpundir Jan 31, 2018

fcrisciani Jan 31, 2018

anshulpundir Jan 31, 2018 •

edited

Loading

dperny commented Jan 31, 2018

anshulpundir commented Jan 31, 2018

nishanttotla Feb 1, 2018

nishanttotla Feb 1, 2018

anshulpundir Feb 1, 2018

nishanttotla Feb 1, 2018

fcrisciani Feb 1, 2018

anshulpundir Feb 1, 2018

fcrisciani Feb 1, 2018

fcrisciani Feb 1, 2018

anshulpundir Feb 1, 2018 •

edited

Loading

fcrisciani left a comment

dperny commented Feb 1, 2018

nishanttotla commented Feb 13, 2018

[manager/dispatcher] Synchronization fixes #2495

[manager/dispatcher] Synchronization fixes #2495

Conversation

anshulpundir commented Jan 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 30, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anshulpundir Jan 31, 2018 • edited Loading

Choose a reason for hiding this comment

dperny commented Jan 31, 2018

anshulpundir commented Jan 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anshulpundir Feb 1, 2018 • edited Loading

Choose a reason for hiding this comment

fcrisciani left a comment

Choose a reason for hiding this comment

dperny commented Feb 1, 2018

nishanttotla commented Feb 13, 2018

anshulpundir commented Jan 30, 2018 •

edited

Loading

codecov bot commented Jan 30, 2018 •

edited

Loading

anshulpundir Jan 31, 2018 •

edited

Loading

anshulpundir Feb 1, 2018 •

edited

Loading