-
Notifications
You must be signed in to change notification settings - Fork 1k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me
oi. maybe? i originally wrote this signal handling when there was a very different mechanism (no supervisor) in place for managing termination, and didn't seriously revisit it when i refactored to use the supervisor. my general recollection is that all the current logic was to make sure we dealt correctly with both ongoing and future method calls in the event of a signal, while also needing to make sure that in any case, i may need a couple days to find the mental bandwidth to check this over, but i'll give it a serious look. no reason to keep all that complex logic if we don't need it. |
// don't have to do anything, as we'd just be redoing | ||
// that work. Instead, deregister and return. | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a small race regarding what count will be logged.
This block was setting the atomic prior to the count()
call, which ensured that no more ops launch, fixing the count. Currently, more ops could launch between the count()
call and the atomic swap at the start of Release()
. What about moving this count/log line inside of Release()
, just after the atomic swap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the 'signal received' portion of the message won't apply from inside the method, so maybe that should still be logged separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block was setting the atomic prior to the
count()
call, which ensured that no more ops launch, fixing the count. Currently, more ops could launch between thecount()
call and the atomic swap at the start ofRelease()
. What about moving this count/log line inside ofRelease()
, just after the atomic swap?
(just noting that this sounds like the class of concern that motivated the original design)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamird thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put that print statement inside the Release method - would that suit you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, let's keep the print statement outside, where it is.
in general, we try to treat gps like a pure library within dep. for the most part, gps isn't aware dep exists - it only shares some low-level filesystem manipulation methods, and some test helpers. also to that end, we try really, really hard to avoid direct print statements in gps itself. this one print statement is the sole place, in all of gps, where we actually print directly, and i only let it in there because it would have required too much refactoring to arrange the signal handling otherwise.
so yeah, let's call this PR a victory on its own terms, and we can think about ways of refactoring for a more controlled, informative shutdown experience later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, i remembered the issue.
internal/gps/source_manager.go
Outdated
// _more_ method calls will stack up if/while waiting. | ||
atomic.CompareAndSwapInt32(&sm.releasing, 0, 1) | ||
|
||
// Whether 'releasing' is set or not, we don't want this function to return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh - this is the key, and why we need the sync.Once()
. the proposed change would cause Release()
to return immediately, even if the real work of doRelease()
hasn't completed yet. so: say a signal is sent, which causes a call made by the solver to fail, causing the solver to error out. at that point, it's a very direct path to the entire dep process exiting. the only thing that prevents it from doing so is the call it makes to Release()
. with this change, it would exit right away, as the signal handler would have already flipped the sm.releasing
flag.
this could, albeit quite unlikely, result in some inconsistent disk state with the persistent cache, which @jmank88 is currently working on implementing.
i suspect the SourceMgr
is probably strictly crash-only right now, but that's not currently an explicit design goal. items in the persistent cache have dependences on one another, and we don't necessarily impose any kind topological write order in a way that guarantees dependencies enter the cache prior to their dependers. we are, in general, self-healing in the face of such absent dependent data, as there are circumstances where it can occur during normal operation.
now, it would certainly be wise to achieve such an ordered write property, as it would more properly insulate us against true machine failure, e.g. power loss. but, at least until we make than an explicit goal, allowing any possibility of any call to Release()
to return before all resources have actually released seems like tempting fate.
as a more practical matter, we also want to be very sure that there is no way any subprocesses can still be running after any call to Release()
returns. we had scads of problems early on in our tests where unfinished e.g. git subprocesses, especially on windows, were causing intermittent test failures due to file locks they held on directories that should have been fine to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth updating the documentation to note this property about after-return-time, in addition to what it already says about after-call-time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so you want calls to Release to wait, is that it? Done.
@@ -349,22 +332,13 @@ func (e CouldNotCreateLockError) Error() string { | |||
// longer safe to call methods against it; all method calls will immediately | |||
// result in errors. | |||
func (sm *SourceMgr) Release() { | |||
// Set sm.releasing before entering the Once func to guarantee that no | |||
// _more_ method calls will stack up if/while waiting. | |||
atomic.CompareAndSwapInt32(&sm.releasing, 0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notwithstanding my substantive comment, this is kind of awkward - this essentially just amounts to 1) make any subsequent caught signal immediately deregister the signal handler and 2) don't print anything about waiting for ops to complete. could def be improved.
internal/gps/source.go
Outdated
@@ -65,8 +65,6 @@ func newSourceCoordinator(superv *supervisor, deducer deducer, cachedir string, | |||
} | |||
} | |||
|
|||
func (sc *sourceCoordinator) close() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's leave this in, please. we can add a comment about why we're preserving the no-op func, but i would rather have the call to it remain down in doRelease()
so that, if we do have cleanup that we need to add later, we've already got the right hook in place.
internal/gps/source_manager.go
Outdated
// Close the file handle for the lock file and remove it from disk | ||
sm.lf.Unlock() | ||
os.Remove(filepath.Join(sm.cachedir, "sm.lock")) | ||
if atomic.CompareAndSwapInt32(&sm.releasing, 0, 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this gets the ordering wrong - we're now marking ourselves as being in the process of releasing, and therefore rejecting new incoming calls, only after we've already waited for running commands to terminate. we need to start rejecting new calls immediately, signal for the termination of all existing calls, then wait for them all to terminate.
the old implementation had both sm.releasing
and sm.released
int32
s for these purposes. i replaced sm.released
with the sync.Once
as it was a cleaner way of achieving the wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood - I think this is fixed now, though.
@@ -403,7 +369,7 @@ func (sm *SourceMgr) GetManifestAndLock(id ProjectIdentifier, v Version, an Proj | |||
// ListPackages parses the tree of the Go packages at and below the ProjectRoot | |||
// of the given ProjectIdentifier, at the given version. | |||
func (sm *SourceMgr) ListPackages(id ProjectIdentifier, v Version) (pkgtree.PackageTree, error) { | |||
if atomic.CompareAndSwapInt32(&sm.releasing, 1, 1) { | |||
if atomic.LoadInt32(&sm.releasing) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh good call, don't need the more complex CAS, as reaching 1 is the address' terminal state 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, that'll do it!
Ah, hang on - I think I see what you're saying about not wanting this function to return until after all the cleanup has been completed. On the other hand, I don't think we need both a sync.Once and an atomic here. I'll make another change. |
(just gonna wait for tests to be sure) |
yep, just storing the result of the CAS in a local var and using that to decide whether or not to do one-time-only cleanup after waiting achieves the desired effect 😄, and without the need for another synchronization primitive. |
@sdboyer not quite, though, because we want to clean up the stuff on disk as well. your original comment (which i've now restored) was accurate in pointing that out. have another look, please! |
oh, derp! yes, because of course, the fs cleanup itself also takes time, and we can't release until after that, either. i'm not necessarily objectionable to merging this (at minimum, i appreciate having someone else go over this with a fine-tooth comb), but it seems like we've now mostly recreated a |
Yes, that's one benefit, and the other is simply avoiding the duplicating of the atomic integer, which the sync.Once already has embedded within it. |
The previous code here was doing some odd things with CAS, none of which was necessary. There's no functional change here, it's just simpler.
Alright, we're mostly back to the old implementation. This amounts to a simplification, but I think it's still meaningfully easier to read. Sorry for all the back and forth! |
it's all good! i agree, this is a more readable version of the logic, and i feel twice as confident in what we have, now that someone else has really torn it apart and then put it back together 😄 |
@sdboyer ready? |
ja, just finishing up parenting for the evening. in we go! 🎉 |
Kinda wish you hadn't squashed this - the commits were separately factored for ease of reading. Now, future readers will have to contend with aa8e076, which is just a mess :( |
i squashed? eek, i didn't do so intentionally. i really never do so, as i also prefer having that incremental history :( i just clicked the button, and the default merge settings are for normal merges...double checks |
Ah but the merge commit has a body? TIL, phew! |
yeah, TIL, too - i think that's probably github trying to be helpful by doing the equivalent of showing |
The work being done here is already guarded by an atomic. This code
appears very confused, but perhaps it is I who is confused.
What does this do / why do we need it?
Simplifies existing code.
What should your reviewer look out for in this PR?
I dunno; this might be subtle.
Do you need help or clarification on anything?
No.
Which issue(s) does this PR fix?
None.