Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax some libcontainerd client locking #36848

Merged
merged 1 commit into from
Apr 18, 2018

Conversation

cpuguy83
Copy link
Member

@cpuguy83 cpuguy83 commented Apr 13, 2018

Fixes #36798

Release lock on Restore while interacting with containerd.
Also adds some timeouts on the contexts on startup.

Copy link
Contributor

@mlaventure mlaventure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit, LGTM.

It should allow other threads to proceed. Although I don't understand how/why the one you highlighted in the issue never gets an answer from containerd (which would unlock it).

Hum, may be because of the fifo being already closed when containerd tries to open then, in which case the context timeout fixes that too 🤔

@@ -131,9 +131,28 @@ func (c *client) Version(ctx context.Context) (containerd.Version, error) {
return c.getRemote().Version(ctx)
}

// Restore restores loads the containerd container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/restores//

@cpuguy83
Copy link
Member Author

Yes the timeout definitely fixes it, though it blocks everything else until the timeout is reached.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦁

@tonistiigi
Copy link
Member

What are these arbitrary 60 sec timeouts for? We shouldn't hide hangs like this and leave daemon in an undefined state.

If there is some fifo hanging it should be fixed instead. We have tools for force closing fifos one-sided. 0.2 branch used discardFifo helper that I don't see any more.

@cpuguy83
Copy link
Member Author

@tonistiigi I think a hung daemon is an undefined state. Where as an error saying the deadline exceeded can provide much more valuable information because we get the call stack in the error rather than requiring users to fetch the stack.

@tonistiigi
Copy link
Member

If this is a temporary fix in hope to gather stacktraces from users because we don't know where the issue is then it needs to be clearly marked this way.

@mlaventure
Copy link
Contributor

I think the issue on containerd side should have been fixed by containerd/containerd#2229

@mlaventure
Copy link
Contributor

Hum, but that went into 1.0.x so I think it was probably there when @stevvooe got his issue.

@cpuguy83
Copy link
Member Author

I don't see it as temporary. It protects the daemon from a range of potential bugs.
A stuck daemon, especially stuck in startup is really bad.

@tonistiigi
Copy link
Member

The problem here isn't so much the timeout but ignoring this range of potential bugs. Regarding containerd bugs this isn't really different than most other places where we call to containerd and expect it to answer. I have nothing against detecting unexpected behavior from containerd and handling it better in moby (for example marking containers as "errored/unresponsive/corrupted", refusing to handle specific API calls etc.

@cpuguy83
Copy link
Member Author

@tonistiigi I will remove the timeout for now, however I'm not sure I agree with the assessment that it's hiding bugs. It's exposing bugs in a much easier to debug way that doesn't block production systems.
But it would be nice indeed if we still had some sort of status on the container rather than just hiding the affected container.

@cpuguy83 cpuguy83 force-pushed the libcontainerd_client_locking branch from 7770ec9 to 59e19fd Compare April 16, 2018 14:52
@codecov
Copy link

codecov bot commented Apr 16, 2018

Codecov Report

Merging #36848 into master will increase coverage by 0.04%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master   #36848      +/-   ##
==========================================
+ Coverage    35.2%   35.25%   +0.04%     
==========================================
  Files         614      614              
  Lines       45698    45710      +12     
==========================================
+ Hits        16090    16114      +24     
+ Misses      27473    27464       -9     
+ Partials     2135     2132       -3

@cpuguy83 cpuguy83 force-pushed the libcontainerd_client_locking branch from 59e19fd to e6832ec Compare April 17, 2018 16:06
This unblocks the client to take other restore requests and makes sure
that a long/stuck request can't block the client forever.

Signed-off-by: Brian Goff <[email protected]>
@cpuguy83 cpuguy83 force-pushed the libcontainerd_client_locking branch from e6832ec to 806700e Compare April 17, 2018 16:07
@thaJeztah
Copy link
Member

Failure on PowerPC; could be a flake; https://jenkins.dockerproject.org/job/Docker-PRs-powerpc/9465/console

16:48:06 === RUN   TestNetworkLoopbackNat
16:48:07 --- FAIL: TestNetworkLoopbackNat (0.94s)
16:48:07 	nat_test.go:86: assertion failed: it works (msg string) !=  (string)

@thaJeztah
Copy link
Member

https://jenkins.dockerproject.org/job/Docker-PRs/49012/console

17:46:26 FAIL: docker_cli_swarm_test.go:1372: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
17:46:26 
17:46:26 [d1e6debbe7d15] waiting for daemon to start
17:46:26 [d1e6debbe7d15] daemon started
17:46:26 
17:46:26 [de4084354f573] waiting for daemon to start
17:46:26 [de4084354f573] daemon started
17:46:26 
17:46:26 [d1b0b5ac5c5a3] waiting for daemon to start
17:46:26 [d1b0b5ac5c5a3] daemon started
17:46:26 
17:46:26 [de4084354f573] exiting daemon
17:46:26 [de4084354f573] waiting for daemon to start
17:46:26 [de4084354f573] daemon started
17:46:26 
17:46:26 [d1b0b5ac5c5a3] exiting daemon
17:46:26 [d1b0b5ac5c5a3] waiting for daemon to start
17:46:26 [d1b0b5ac5c5a3] daemon started
17:46:26 
17:46:26 [de4084354f573] exiting daemon
17:46:26 [de4084354f573] waiting for daemon to start
17:46:26 [de4084354f573] daemon started
17:46:26 
17:46:26 assertion failed: 
17:46:26 Command:  /usr/local/cli/docker --host unix:///tmp/docker-integration/de4084354f573.sock swarm unlock
17:46:26 ExitCode: 0
17:46:26 Error:    <nil>
17:46:26 Stdout:   
17:46:26 Stderr:   
17:46:26 
17:46:26 Failures:
17:46:26 ExitCode was 0 expected 1
17:46:26 Expected stderr to contain "invalid key"
17:46:26 [d1e6debbe7d15] exiting daemon
17:46:26 [de4084354f573] exiting daemon
17:46:26 [d1b0b5ac5c5a3] exiting daemon

s390x (https://jenkins.dockerproject.org/job/Docker-PRs-s390x/9396/console) is a known flake: #36877

16:54:09 FAIL: docker_cli_exec_unix_test.go:18: DockerSuite.TestExecInteractiveStdinClose
16:54:09 
16:54:09 docker_cli_exec_unix_test.go:37:
16:54:09     c.Assert(strings.TrimSpace(output), checker.Equals, "hello")
16:54:09 ... obtained string = "hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
16:54:09 ... expected string = "hello"
16:54:09 
16:54:10 

@thaJeztah
Copy link
Member

ping @tonistiigi LGTY?

@tonistiigi
Copy link
Member

LGTM

@thaJeztah
Copy link
Member

Looks like we have enough LGTM's - merging

@thaJeztah thaJeztah merged commit 69a5611 into moby:master Apr 18, 2018
@cpuguy83 cpuguy83 deleted the libcontainerd_client_locking branch April 23, 2018 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants