-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax some libcontainerd client locking #36848
Conversation
ddaae60
to
7770ec9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit, LGTM.
It should allow other threads to proceed. Although I don't understand how/why the one you highlighted in the issue never gets an answer from containerd (which would unlock it).
Hum, may be because of the fifo being already closed when containerd tries to open then, in which case the context timeout fixes that too 🤔
libcontainerd/client_daemon.go
Outdated
@@ -131,9 +131,28 @@ func (c *client) Version(ctx context.Context) (containerd.Version, error) { | |||
return c.getRemote().Version(ctx) | |||
} | |||
|
|||
// Restore restores loads the containerd container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/restores//
Yes the timeout definitely fixes it, though it blocks everything else until the timeout is reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦁
What are these arbitrary 60 sec timeouts for? We shouldn't hide hangs like this and leave daemon in an undefined state. If there is some fifo hanging it should be fixed instead. We have tools for force closing fifos one-sided. |
@tonistiigi I think a hung daemon is an undefined state. Where as an error saying the deadline exceeded can provide much more valuable information because we get the call stack in the error rather than requiring users to fetch the stack. |
If this is a temporary fix in hope to gather stacktraces from users because we don't know where the issue is then it needs to be clearly marked this way. |
I think the issue on containerd side should have been fixed by containerd/containerd#2229 |
Hum, but that went into 1.0.x so I think it was probably there when @stevvooe got his issue. |
I don't see it as temporary. It protects the daemon from a range of potential bugs. |
The problem here isn't so much the timeout but ignoring this range of potential bugs. Regarding containerd bugs this isn't really different than most other places where we call to containerd and expect it to answer. I have nothing against detecting unexpected behavior from containerd and handling it better in moby (for example marking containers as "errored/unresponsive/corrupted", refusing to handle specific API calls etc. |
@tonistiigi I will remove the timeout for now, however I'm not sure I agree with the assessment that it's hiding bugs. It's exposing bugs in a much easier to debug way that doesn't block production systems. |
7770ec9
to
59e19fd
Compare
Codecov Report
@@ Coverage Diff @@
## master #36848 +/- ##
==========================================
+ Coverage 35.2% 35.25% +0.04%
==========================================
Files 614 614
Lines 45698 45710 +12
==========================================
+ Hits 16090 16114 +24
+ Misses 27473 27464 -9
+ Partials 2135 2132 -3 |
59e19fd
to
e6832ec
Compare
This unblocks the client to take other restore requests and makes sure that a long/stuck request can't block the client forever. Signed-off-by: Brian Goff <[email protected]>
e6832ec
to
806700e
Compare
Failure on PowerPC; could be a flake; https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/9465/console
|
https://jenkins.dockerproject.org/job/Docker-PRs/49012/console
s390x (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/9396/console) is a known flake: #36877
|
ping @tonistiigi LGTY? |
LGTM |
Looks like we have enough LGTM's - merging |
Fixes #36798
Release lock on
Restore
while interacting with containerd.Also adds some timeouts on the contexts on startup.