Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reliable flush/close on streams, and avoid zombies #17522

Merged
merged 2 commits into from
Aug 4, 2016

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Jul 21, 2016

this makes it so that our close() function waits for the shutdown to finish before returning (#14434)

@Keno for review of libuv changes

LIBUV_BRANCH=julia-uv1.9.0
LIBUV_SHA1=ecbd6eddfac4940ab8db57c73166a7378563ebd3
LIBUV_BRANCH=jn/pid-zero
LIBUV_SHA1=13c00a4268a16d5a12f3cd15b3a40c683e575dfc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merging this should go on the branch, and checksums should be updated

@tkelman tkelman added the io Involving the I/O subsystem: libuv, read, write, etc. label Jul 21, 2016
@vtjnash vtjnash force-pushed the jn/reliable-flush-close branch from 7e9a7dc to 1be9e76 Compare July 21, 2016 17:11
LIBUV_BRANCH=julia-uv1.9.0
LIBUV_SHA1=ecbd6eddfac4940ab8db57c73166a7378563ebd3
LIBUV_BRANCH=jn/pid-zero
LIBUV_SHA1=cb6d0f875a5b8ca30cba45c0c1ef7442c87c1e68
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merging this should go on the branch, and checksums should be updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also mark this as wip

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not WIP, it's simply waiting for review across two repos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work remains to be done, sounds "in progress" to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not what the WIP tag is intended to convey

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's exactly what the WIP tag is intended to convey

The convention is to prefix the pull request title with "WIP:" for Work In Progress, or "RFC:" for Request for Comments when work is completed and ready for merging. This will prevent accidental merging of work that is in progress.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the work is done and ready for merging. this just requires slightly more manual work than the usual Github button. however, we don't consider the merging "work" in the WIP tag.

(aside: I find it strange when there is both a reply and reaction on a comment. aren't they conveying redundant information?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this PR is not ready for merging. in what sense is "requires slightly more manual work" not precisely a form of "work" that should not be accidentally merged in its current state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same way that #17579 was not WIP, even though it was not merged in its PR state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was a squash and the diff as pushed to master was identical to the pr state

@vtjnash
Copy link
Member Author

vtjnash commented Jul 24, 2016

bump @Keno (or maybe @StefanKarpinski?). CI failure were unrelated

ccall(:jl_close_uv, Void, (Ptr{Void},), stream.handle)
stream.status = StatusClosing
end
if uv_handle_data(stream) != C_NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What situation is this checking for and why was it not needed previously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because of the changed order in uvfinalize

@Keno
Copy link
Member

Keno commented Jul 25, 2016

So the problem is that libuv does not reap zombies if we close the handle before the process has exited?

@vtjnash
Copy link
Member Author

vtjnash commented Jul 26, 2016

So the problem is that libuv does not reap zombies if we close the handle before the process has exited?

right. if we close the libuv handle, then libuv has no way to track when the process dies and reap the zombie. so we need to instead ensure the libuv handle is valid for as long as its pid is non-zero.

@tkelman
Copy link
Contributor

tkelman commented Jul 26, 2016

Not that our relationship with libuv upstream is all that great right now, but does this fix a bug that they might care about?

@vtjnash
Copy link
Member Author

vtjnash commented Jul 26, 2016

Probably not, nodejs don't expose a close api

@vtjnash vtjnash force-pushed the jn/reliable-flush-close branch from 1be9e76 to 4ba3790 Compare July 29, 2016 03:03
@@ -1,2 +1,2 @@
LIBUV_BRANCH=julia-uv1.9.0
LIBUV_SHA1=ecbd6eddfac4940ab8db57c73166a7378563ebd3
LIBUV_SHA1=cb6d0f875a5b8ca30cba45c0c1ef7442c87c1e68
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs checksums updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and delete the old

@vtjnash vtjnash force-pushed the jn/reliable-flush-close branch 3 times, most recently from 389e2eb to 8822ccb Compare August 1, 2016 00:42
vtjnash added 2 commits July 31, 2016 23:16
while investigating #14434, I noticed that the test code there is creating zombies
to fix that, we need to avoid calling uv_close in the finalizer
until after the process has exited (or during atexit)
@vtjnash vtjnash force-pushed the jn/reliable-flush-close branch from 8822ccb to 67efdcf Compare August 1, 2016 03:17
@vtjnash vtjnash merged commit 08ae28a into master Aug 4, 2016
@vtjnash vtjnash deleted the jn/reliable-flush-close branch August 4, 2016 03:38
@tkelman
Copy link
Contributor

tkelman commented Aug 4, 2016

This is causing spawn test failures https://ci.appveyor.com/project/JuliaLang/julia/build/1.0.5013/job/ey1xj2tloe8x47lt. Can reproduce locally.

@tkelman
Copy link
Contributor

tkelman commented Aug 4, 2016

reverting on release-0.5 for rc1

tkelman added a commit that referenced this pull request Aug 4, 2016
…ose"

This reverts commit 08ae28a, reversing
changes made to 1eeb773.

This is causing spawn test failures on Windows.
@vtjnash
Copy link
Member Author

vtjnash commented Aug 4, 2016

That test is stronger than essential, so it may need another Base.process_events call on Windows to fully work (hard to say, since this is testing the kernel as much as it is testing Julia)

@tkelman
Copy link
Contributor

tkelman commented Aug 4, 2016

I'll try that.

@tkelman
Copy link
Contributor

tkelman commented Aug 4, 2016

If you meant

diff --git a/test/spawn.jl b/test/spawn.jl
index 0040630..0cc4184 100644
--- a/test/spawn.jl
+++ b/test/spawn.jl
@@ -287,7 +287,11 @@ let out = Pipe(), echo = `$exename --startup-file=no -e 'print(STDOUT, " 1\t", r
         @test_throws ArgumentError write(out, "now closed error")
         @test isreadable(out)
         @test !iswritable(out)
-        is_windows() && Base.process_events(false) # should be enough steps to fully propagate EOF now
+        if is_windows()
+            # should be enough steps to fully propagate EOF now
+            Base.process_events(false)
+            Base.process_events(false)
+        end
         @test !isopen(out)
     end
     wait(ready) # wait for writer task to be ready before using `out`

then no that didn't help

@vtjnash
Copy link
Member Author

vtjnash commented Aug 4, 2016

Maybe sleep(0.04)? I think that's the documented maximum time for this to propagate. But how about we just go with is_windows() && Base.wait_close(out) # WINNT kernel does not provide a fast mechanism for async propagation of EOF for a blocking stream, so just wait for it to catch up. This shouldn't take much more than 32ms.

tkelman added a commit that referenced this pull request Aug 4, 2016
@tkelman
Copy link
Contributor

tkelman commented Aug 9, 2016

This caused ZMQ.jl's tests to freeze on Windows. I'm even more glad I reverted this, and won't be backporting it until that gets addressed.

@vtjnash
Copy link
Member Author

vtjnash commented Aug 9, 2016

Is this supposed be some sort of guessing game? Usually bug reports should contain a little more context and a little less snark

But if I have to guess, I would point to the bug in the following test:
https://github.com/JuliaLang/ZMQ.jl/blob/master/test/runtests.jl#L90
which is supposed to freeze (it didn't previously, but that's why #14434 existed). Although on some kernels it is often allowed to pass due to the small amount of data involved.

@tkelman
Copy link
Contributor

tkelman commented Aug 9, 2016

Have no context yet, have identified the commit that introduced the problem but not yet identified which line was freezing. Sorry I'm extremely frustrated that we're trying to put out a release and last-minute bug fixes have been causing as many issues as they fix.

tkelman added a commit that referenced this pull request Aug 11, 2016
tkelman added a commit that referenced this pull request Aug 11, 2016
introduced by #17522

(cherry picked from commit a3f288c)
ref #17815
mfasi pushed a commit to mfasi/julia that referenced this pull request Sep 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants