gps: simplify shutdown cleanup #1154

tamird · 2017-09-11T01:04:27Z

The work being done here is already guarded by an atomic. This code
appears very confused, but perhaps it is I who is confused.

What does this do / why do we need it?

Simplifies existing code.

What should your reviewer look out for in this PR?

I dunno; this might be subtle.

Do you need help or clarification on anything?

No.

Which issue(s) does this PR fix?

None.

jmank88

This makes sense to me

sdboyer · 2017-09-11T05:23:39Z

oi. maybe? i originally wrote this signal handling when there was a very different mechanism (no supervisor) in place for managing termination, and didn't seriously revisit it when i refactored to use the supervisor. my general recollection is that all the current logic was to make sure we dealt correctly with both ongoing and future method calls in the event of a signal, while also needing to make sure that Release() was well-behaved.

in any case, i may need a couple days to find the mental bandwidth to check this over, but i'll give it a serious look. no reason to keep all that complex logic if we don't need it.

jmank88 · 2017-09-11T11:47:06Z

internal/gps/source_manager.go

-					// don't have to do anything, as we'd just be redoing
-					// that work. Instead, deregister and return.
-					return
-				}


I think there is a small race regarding what count will be logged.

This block was setting the atomic prior to the count() call, which ensured that no more ops launch, fixing the count. Currently, more ops could launch between the count() call and the atomic swap at the start of Release(). What about moving this count/log line inside of Release(), just after the atomic swap?

I suppose the 'signal received' portion of the message won't apply from inside the method, so maybe that should still be logged separately.

This block was setting the atomic prior to the count() call, which ensured that no more ops launch, fixing the count. Currently, more ops could launch between the count() call and the atomic swap at the start of Release(). What about moving this count/log line inside of Release(), just after the atomic swap?

(just noting that this sounds like the class of concern that motivated the original design)

@tamird thoughts on this?

We could put that print statement inside the Release method - would that suit you?

actually, let's keep the print statement outside, where it is.

in general, we try to treat gps like a pure library within dep. for the most part, gps isn't aware dep exists - it only shares some low-level filesystem manipulation methods, and some test helpers. also to that end, we try really, really hard to avoid direct print statements in gps itself. this one print statement is the sole place, in all of gps, where we actually print directly, and i only let it in there because it would have required too much refactoring to arrange the signal handling otherwise.

so yeah, let's call this PR a victory on its own terms, and we can think about ways of refactoring for a more controlled, informative shutdown experience later.

sdboyer

ah ok, i remembered the issue.

sdboyer · 2017-09-11T17:09:41Z

internal/gps/source_manager.go

-	// _more_ method calls will stack up if/while waiting.
-	atomic.CompareAndSwapInt32(&sm.releasing, 0, 1)
-
-	// Whether 'releasing' is set or not, we don't want this function to return


ahh - this is the key, and why we need the sync.Once(). the proposed change would cause Release() to return immediately, even if the real work of doRelease() hasn't completed yet. so: say a signal is sent, which causes a call made by the solver to fail, causing the solver to error out. at that point, it's a very direct path to the entire dep process exiting. the only thing that prevents it from doing so is the call it makes to Release(). with this change, it would exit right away, as the signal handler would have already flipped the sm.releasing flag.

this could, albeit quite unlikely, result in some inconsistent disk state with the persistent cache, which @jmank88 is currently working on implementing.

i suspect the SourceMgr is probably strictly crash-only right now, but that's not currently an explicit design goal. items in the persistent cache have dependences on one another, and we don't necessarily impose any kind topological write order in a way that guarantees dependencies enter the cache prior to their dependers. we are, in general, self-healing in the face of such absent dependent data, as there are circumstances where it can occur during normal operation.

now, it would certainly be wise to achieve such an ordered write property, as it would more properly insulate us against true machine failure, e.g. power loss. but, at least until we make than an explicit goal, allowing any possibility of any call to Release() to return before all resources have actually released seems like tempting fate.

as a more practical matter, we also want to be very sure that there is no way any subprocesses can still be running after any call to Release() returns. we had scads of problems early on in our tests where unfinished e.g. git subprocesses, especially on windows, were causing intermittent test failures due to file locks they held on directories that should have been fine to remove.

probably worth updating the documentation to note this property about after-return-time, in addition to what it already says about after-call-time.

OK, so you want calls to Release to wait, is that it? Done.

sdboyer · 2017-09-11T17:15:15Z

internal/gps/source_manager.go

@@ -349,22 +332,13 @@ func (e CouldNotCreateLockError) Error() string {
 // longer safe to call methods against it; all method calls will immediately
 // result in errors.
 func (sm *SourceMgr) Release() {
-	// Set sm.releasing before entering the Once func to guarantee that no
-	// _more_ method calls will stack up if/while waiting.
-	atomic.CompareAndSwapInt32(&sm.releasing, 0, 1)


notwithstanding my substantive comment, this is kind of awkward - this essentially just amounts to 1) make any subsequent caught signal immediately deregister the signal handler and 2) don't print anything about waiting for ops to complete. could def be improved.

sdboyer · 2017-09-11T17:46:07Z

internal/gps/source.go

@@ -65,8 +65,6 @@ func newSourceCoordinator(superv *supervisor, deducer deducer, cachedir string,
 	}
 }

-func (sc *sourceCoordinator) close() {}


let's leave this in, please. we can add a comment about why we're preserving the no-op func, but i would rather have the call to it remain down in doRelease() so that, if we do have cleanup that we need to add later, we've already got the right hook in place.

sdboyer · 2017-09-11T17:51:21Z

internal/gps/source_manager.go

-	// Close the file handle for the lock file and remove it from disk
-	sm.lf.Unlock()
-	os.Remove(filepath.Join(sm.cachedir, "sm.lock"))
+	if atomic.CompareAndSwapInt32(&sm.releasing, 0, 1) {


this gets the ordering wrong - we're now marking ourselves as being in the process of releasing, and therefore rejecting new incoming calls, only after we've already waited for running commands to terminate. we need to start rejecting new calls immediately, signal for the termination of all existing calls, then wait for them all to terminate.

the old implementation had both sm.releasing and sm.released int32s for these purposes. i replaced sm.released with the sync.Once as it was a cleaner way of achieving the wait.

Understood - I think this is fixed now, though.

sdboyer · 2017-09-11T17:56:34Z

internal/gps/source_manager.go

@@ -403,7 +369,7 @@ func (sm *SourceMgr) GetManifestAndLock(id ProjectIdentifier, v Version, an Proj
 // ListPackages parses the tree of the Go packages at and below the ProjectRoot
 // of the given ProjectIdentifier, at the given version.
 func (sm *SourceMgr) ListPackages(id ProjectIdentifier, v Version) (pkgtree.PackageTree, error) {
-	if atomic.CompareAndSwapInt32(&sm.releasing, 1, 1) {
+	if atomic.LoadInt32(&sm.releasing) == 1 {


ahh good call, don't need the more complex CAS, as reaching 1 is the address' terminal state 👍

sdboyer

cool, that'll do it!

tamird · 2017-09-11T18:08:55Z

Ah, hang on - I think I see what you're saying about not wanting this function to return until after all the cleanup has been completed. On the other hand, I don't think we need both a sync.Once and an atomic here. I'll make another change.

sdboyer · 2017-09-11T18:08:56Z

(just gonna wait for tests to be sure)

sdboyer · 2017-09-11T18:11:09Z

yep, just storing the result of the CAS in a local var and using that to decide whether or not to do one-time-only cleanup after waiting achieves the desired effect 😄, and without the need for another synchronization primitive.

tamird · 2017-09-11T18:26:04Z

@sdboyer not quite, though, because we want to clean up the stuff on disk as well. your original comment (which i've now restored) was accurate in pointing that out. have another look, please!

sdboyer · 2017-09-11T18:38:19Z

oh, derp! yes, because of course, the fs cleanup itself also takes time, and we can't release until after that, either.

i'm not necessarily objectionable to merging this (at minimum, i appreciate having someone else go over this with a fine-tooth comb), but it seems like we've now mostly recreated a sync.Once. i'm curious what you see as the benefits of doing it this way vs. just using a sync.Once? is it just because everything's now under Release(), and it's impossible to ever make a mistake that accesses the doRelease() logic directly?

tamird · 2017-09-11T18:39:54Z

Yes, that's one benefit, and the other is simply avoiding the duplicating of the atomic integer, which the sync.Once already has embedded within it.

The previous code here was doing some odd things with CAS, none of which was necessary. There's no functional change here, it's just simpler.

tamird · 2017-09-11T19:40:44Z

Alright, we're mostly back to the old implementation. This amounts to a simplification, but I think it's still meaningfully easier to read.

Sorry for all the back and forth!

sdboyer · 2017-09-11T19:45:32Z

it's all good! i agree, this is a more readable version of the logic, and i feel twice as confident in what we have, now that someone else has really torn it apart and then put it back together 😄

tamird · 2017-09-11T22:53:56Z

@sdboyer ready?

sdboyer · 2017-09-11T23:19:29Z

ja, just finishing up parenting for the evening. in we go! 🎉

tamird · 2017-09-11T23:22:03Z

Kinda wish you hadn't squashed this - the commits were separately factored for ease of reading. Now, future readers will have to contend with aa8e076, which is just a mess :(

sdboyer · 2017-09-12T00:16:39Z

i squashed? eek, i didn't do so intentionally. i really never do so, as i also prefer having that incremental history :( i just clicked the button, and the default merge settings are for normal merges...double checks

sdboyer · 2017-09-12T00:21:34Z

ahh, nope, no squash:

tamird · 2017-09-12T00:25:16Z

Ah but the merge commit has a body? TIL, phew!

sdboyer · 2017-09-12T00:29:40Z

yeah, TIL, too - i think that's probably github trying to be helpful by doing the equivalent of showing git show -m <obj> instead of git show <obj> when you address a commit object directly.

tamird requested a review from sdboyer as a code owner September 11, 2017 01:04

googlebot added the cla: yes label Sep 11, 2017

jmank88 approved these changes Sep 11, 2017

View reviewed changes

ibrasho added the area: gps label Sep 11, 2017

jmank88 reviewed Sep 11, 2017

View reviewed changes

sdboyer suggested changes Sep 11, 2017

View reviewed changes

gps: use atomic.LoadInt32 instead of CAS

4332680

sdboyer suggested changes Sep 11, 2017

View reviewed changes

sdboyer reviewed Sep 11, 2017

View reviewed changes

sdboyer approved these changes Sep 11, 2017

View reviewed changes

tamird changed the title ~~gps: remove unnecessary sync.Once~~ gps: inline a sync.Once, avoid extra integer Sep 11, 2017

tamird added 3 commits September 11, 2017 15:24

gps: remove useless loop

0d49338

gps: avoid a goroutine using time.AfterFunc

a0b325d

gps: simplify shutdown cleanup

45da465

The previous code here was doing some odd things with CAS, none of which was necessary. There's no functional change here, it's just simpler.

tamird changed the title ~~gps: inline a sync.Once, avoid extra integer~~ gps: simplify shutdown cleanup Sep 11, 2017

sdboyer merged commit aa8e076 into golang:master Sep 11, 2017

tamird deleted the remove-unnecessary-once branch September 11, 2017 23:20

tamird mentioned this pull request Sep 11, 2017

Add static checkers #1157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gps: simplify shutdown cleanup #1154

gps: simplify shutdown cleanup #1154

tamird commented Sep 11, 2017

jmank88 left a comment

sdboyer commented Sep 11, 2017

jmank88 Sep 11, 2017 •

edited

Loading

jmank88 Sep 11, 2017

sdboyer Sep 11, 2017

jmank88 Sep 11, 2017

tamird Sep 11, 2017

sdboyer Sep 11, 2017

sdboyer left a comment

sdboyer Sep 11, 2017 •

edited

Loading

sdboyer Sep 11, 2017

tamird Sep 11, 2017

sdboyer Sep 11, 2017

sdboyer Sep 11, 2017

sdboyer Sep 11, 2017 •

edited

Loading

tamird Sep 11, 2017

sdboyer Sep 11, 2017

sdboyer left a comment

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 12, 2017

sdboyer commented Sep 12, 2017

tamird commented Sep 12, 2017

sdboyer commented Sep 12, 2017

gps: simplify shutdown cleanup #1154

gps: simplify shutdown cleanup #1154

Conversation

tamird commented Sep 11, 2017

What does this do / why do we need it?

What should your reviewer look out for in this PR?

Do you need help or clarification on anything?

Which issue(s) does this PR fix?

jmank88 left a comment

Choose a reason for hiding this comment

sdboyer commented Sep 11, 2017

jmank88 Sep 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdboyer left a comment

Choose a reason for hiding this comment

sdboyer Sep 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdboyer Sep 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdboyer left a comment

Choose a reason for hiding this comment

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 11, 2017

tamird commented Sep 11, 2017

sdboyer commented Sep 12, 2017

sdboyer commented Sep 12, 2017

tamird commented Sep 12, 2017

sdboyer commented Sep 12, 2017

jmank88 Sep 11, 2017 •

edited

Loading

sdboyer Sep 11, 2017 •

edited

Loading

sdboyer Sep 11, 2017 •

edited

Loading