-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolving stops working #1132
Comments
Are you on VPN? Is it similar to #997? |
I do regularly run on an IPsec VPN so I've just tested again and the same issue occurs with VPN disabled |
Hi, colleague of carn1x here, observing the same error. After a cooldown time for the load test container affected, starting any other containers shows:
It appears to be originating from this line: |
Thanks for the report! I have escalated this issue to our networking team. |
I have the same issue as @laffuste and the only way around it is to restart docker. I am using Docker for Mac (stable) Version 1.13.0 (15072). Thank you! |
+1 for this issue. In VPN most notably (Cisco AnyConnect) container is making request outbound that don't close. This issue is not as pronounced outside of VPN. Connections are remaining in CLOSED and FIN_WAIT2 statuses $ sudo lsof -i -P -n|grep com.dock
com.docke 45582 kdowney 153u IPv4 0x63aa6841fb399345 0t0 TCP 172.19.142.212:63444->23.209.176.27:443 (FIN_WAIT_2)
com.docke 45582 kdowney 154u IPv4 0x63aa684222a9dc3d 0t0 TCP 172.19.142.212:63448->23.213.69.112:443 (FIN_WAIT_2)
com.docke 45582 kdowney 155u IPv4 0x63aa684211710345 0t0 TCP 172.19.142.212:63701->23.213.69.112:443 (FIN_WAIT_2)
com.docke 45582 kdowney 156u IPv4 0x63aa6841f51c9a4d 0t0 TCP 172.19.142.212:63454->54.231.40.83:443 (CLOSED)
com.docke 45582 kdowney 157u IPv4 0x63aa684210e15e2d 0t0 TCP 172.19.142.212:63498->23.209.176.27:443 (FIN_WAIT_2)
com.docke 45582 kdowney 158u IPv4 0x63aa684211fa5e75 0t0 UDP *:49557
com.docke 45582 kdowney 159u IPv4 0x63aa684210e14c3d 0t0 TCP 172.19.142.212:63672->23.209.176.27:443 (FIN_WAIT_2)
com.docke 45582 kdowney 160u IPv4 0x63aa68421eeb801d 0t0 TCP 172.19.142.212:63708->23.213.69.112:443 (FIN_WAIT_2)
com.docke 45582 kdowney 161u IPv4 0x63aa68421eaa3345 0t0 TCP 172.19.142.212:63626->54.231.49.8:443 (CLOSED) BTW, lookups are public addresses: These unclosed connections buildup over time $ sudo lsof -i -P -n|grep com.dock|wc -l
173 When the count reaches close to 900 we get the error The only solution is to restart Docker app. This is blocking lots of developers from effectively using docker for local development. Is there a fix coming? Diagnostic ID: $ docker info
Containers: 8
Running: 2
Paused: 0
Stopped: 6
Images: 20
Server Version: 1.13.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 139
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.8-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.787 GiB
Name: moby
ID: 2OAV:6UBN:DQWC:BMUF:NHBN:V52V:QMTG:6LXS:SGWG:KBBC:NXJP:PJGX
Docker Root Dir: /var/lib/docker
Debug Mode (client): true
Debug Mode (server): true
File Descriptors: 33
Goroutines: 40
System Time: 2017-02-17T19:56:46.170042147Z
EventsListeners: 1
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
ec2-54-146-21-195.compute-1.amazonaws.com:10006
127.0.0.0/8
Live Restore Enabled: false $ docker version
Client:
Version: 1.13.1
API version: 1.26
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 08:47:51 2017
OS/Arch: darwin/amd64
Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 08:47:51 2017
OS/Arch: linux/amd64
Experimental: true $ docker-compose version
docker-compose version 1.11.1, build 7c5d5e4
docker-py version: 2.0.2
CPython version: 2.7.12
OpenSSL version: OpenSSL 1.0.2j 26 Sep 2016 |
+1 It appears to manifest on docker stable channel now:
|
The Mac has 2 connection / file descriptor limits: a per-process limit and a global limit. On my system the limits are:
Docker for Mac is quite conservative by default and imposes a lower limit of 900, stored here:
Perhaps this limit is too conservative. It's possible to increase the Docker for Mac limit by: (for example)
The VM may need to restart at this point. If you experiment with this, let me know what happens. It's important not to hit the global limit otherwise other software could start to fail. However I believe it is possible to increase |
Boosting the limit only postpones the problem, if your containers make many connections. The com.docker.slirp continues to accumulate connections over time, even if all containers have been stopped. Most of the connections are in CLOSE_WAIT state, from what I can see. |
for me this quick fix was fine: remove all of the unused/stopped containers:
The error occurred after not cleaning up after executing a bunch of docker builds and then iteratively running those builds, and ctrl-c out of the running container. I probably had 900+ containers just laying around without having been |
@jmunsch The issue continued to occur for me, even after removing all running and stopped containers. Only a restart of the VM actually clears out the open connections on the Mac. |
I think I'm also observing this on my machine macOS 10.12.3
I'm running an npm local server that has to fetch metadata from npm, and can't actually fetch the entire skimdb without halting (making it through about 30%) Here's my dockerfile if it helps to repro:
Docker compose:
run |
Same problem here, macOS 10.12.6 and docker for mac Version 17.06.0-ce-mac19 (18663) running 4-5 containers perfectly and networking just stops after a few days of very low (servers) activity. nothing but restart of the client helps. |
Same issue here on macOS 10.12.6 and docker Version 17.06.1-ce-mac24 (18950). It's very annoying since I can't test my stuff for more than 5 minutes at a time, and then i have to shut down all containers, and then wait for docker to restart, and then restart all containers, just to test for 5 more minutes. Can we add "osx/10.12.x" label to this issue and maybe get a timeframe on fix please? |
Same issue here on macOS 10.12.5 and docker 17.06.1-ce-mac24 (18950). I can't run functional tests that last ~ 30 minutes because after 5-10 minutes under load containers become unreachable through network. |
@bdharrington7 thanks for including repro steps. I managed to reproduce it once on 17.06 but not in subsequent runs. I've not managed to reproduce it on edge 17.07 yet. The one time I reproduced it I saw around ~50 file handles in use on the Mac but It might be worth trying the edge 17.07 version since that |
I followed the advice of @djs55: increasing to 2000 bought me some time, i was then forced to increase to 4000. i imagine in not too long i'll need to increase again or restart docker for mac. for info no VM/docker restart was necessary for me so see the benefit. if it is indeed a connection limit i don't need to do much to get to the first (500) limit, about 5 containers running some standard webapp / db tasks. happy to provide further debug if someone wants any specific info/output. -i |
I wonder if this is related to docker/docker-py#1293. |
Running with 17.12.0-ce-mac49 (stable f50f37646d) I just noticed that I cannot start more containers:
Checking with
where 192.168.1.123 would be my host machines IP, and 10.10.10.10. is an external server. Looks like some bug in vpnkit not cleaning up properly? |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so. Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. |
I can reproduce it very easily on edge (18.05.0-ce-mac66) by hosting a HAProxy container that health checks a few services outside the docker vm every second. I run out of ports within an hour. |
@STRML could you point me at a Dockerfile or docker-compose.yml that demonstrates the problem? Are these health checks TCP connects or HTTP transactions? |
@djs55 I think you may be right. DNS resolution seems fine, but connecting sockets seems to fail. I don't have a simple reproduction example. I've been using https://github.com/opennms-forge/docker-horizon-core-web set to monitor a few hosts. After a few days, the hosts are no longer reachable and I have to restart Docker itself. Connecting to the container with As a sanity test, I just set Here's my test loop. i=0
while exec 3>/dev/tcp/192.168.1.147/1234; do
echo "$i"
echo "$i" >&3
exec 3>&-
i=$((i+1))
done So the issue is unlikely to be just creating too many sockets. Next time the network starts failing, I'll try to debug more. |
@stevecheckoway thanks for the update. I've also tried to reproduce a failure along those lines but with no success so far :( There's clearly something a bit different about the workloads that trigger the problem... perhaps there's an issue with the transparent HTTP proxy used if the connections are to port 80 or 443? |
@djs55 Working reproduction at https://github.com/STRML/docker-for-mac-socket-issue. I see the open sockets increase at 5 per second, and they are never reaped. Only fix is to eventually restart the VM. |
@STRML thanks for the repro. I'm seeing some strange behaviour -- at first it was leaking at about 5 per second as you report but it seems to have flattened out at about 80. I'll leave it running to see what happens. Looking at |
Ah yes, I see the same top out at about 80, seems to be the python server. Replacing it a NodeJS listener causes it to continually increase as expected. I've updated the repo. |
This issue affects our integration tests as well: connections in |
@STRML thanks for the repo update. It seems to reproduce for me -- I let it climb to over 300 and then I killed the |
If a client sends SYN, we connect the external socket and reply with SYN ACK. If the client responds with RST ACK then previously we would leak the connection. This patch extends the existing mechanism which closes connections when switch ports are timed-out, adding a connection close when such an "early reset" is encountered. Once the connection has been established we assume we can use the existing closing mechanism: a client sending a RST should cause the TCP/IP stack to close our flow. Related to [docker/for-mac#1132] Signed-off-by: David Scott <[email protected]>
If a client sends SYN, we connect the external socket and reply with SYN ACK. If the client responds with RST ACK then previously we would leak the connection. This patch extends the existing mechanism which closes connections when switch ports are timed-out, adding a connection close when such an "early reset" is encountered. Once the connection has been established we assume we can use the existing closing mechanism: a client sending a RST should cause the TCP/IP stack to close our flow. Related to [docker/for-mac#1132] Signed-off-by: David Scott <[email protected]>
If a client sends SYN, we connect the external socket and reply with SYN ACK. If the client responds with RST ACK then previously we would leak the connection. This patch extends the existing mechanism which closes connections when switch ports are timed-out, adding a connection close when such an "early reset" is encountered. Once the connection has been established we assume we can use the existing closing mechanism: a client sending a RST should cause the TCP/IP stack to close our flow. Related to [docker/for-mac#1132] Signed-off-by: David Scott <[email protected]>
If a client sends SYN, we connect the external socket and reply with SYN ACK. If the client responds with RST ACK then previously we would leak the connection. This patch refactors the connection closing mechanism, creating an idempotent `close_flow` function which is called - on normal close when the proxy receives `FIN` etc - on a reset, including during the handshake - when a switch port is being timed-out. This replaces the previous `on_destroy` promise which was used in `Lwt.pick` since closing the connection should cause the proxy to receive EOF. Related to [docker/for-mac#1132] Signed-off-by: David Scott <[email protected]>
I believe I have a fix for the problem. The TCP keepalive used by
If you'd like to try the proposed fix for yourself, you can try the latest development build from: This is not yet released and may be buggy; therefore please don't use it for production :-) If you get a chance to try this, let me know how you get on! Thanks again for the reports and the repro case. |
Thanks @djs55 ! works like a charm! at first it started accumulating some CLOSED but now i check and cant see more than 20 . have a great day! |
@ceo thanks for letting me know -- glad it worked for you! |
FYI this is released on edge @ 18.06.0-ce-rc3-mac68, although it is erroneously labeled as having something to do with HAProxy, when it really is:
HAProxy just happens to one popular program that does this. |
Hello @djs55 ! I'm having the same problem but with CentOS 7.5 and Nginx. I have updated to latest Docker version 18.06.0-ce, build 0ffa825, but the problem is still there .... :( Is your fix only applicable to Mac? Update: Sorry, it seems I did not test it properly. The issue also seems to be resolved for me with version 18.06.0-ce in CentOS and Nginx. |
@ernestojpg This issue is about docker for mac (and probably for windows). It doesn't look like fixes in vpnkit are going to affect linux platform since there's no vpnkit deployed on linux. The issue you are having must be something else. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so. Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. |
The fix that was in edge 18.06.0-ce-rc3 should also be in the current stable release 18.06.0. I'll close this ticket since I believe the issue is fixed everywhere. If something else goes wrong, please open a fresh ticket. Thanks for your report! |
Closed issues are locked after 30 days of inactivity. If you have found a problem that seems similar to this, please open a new issue. Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. |
Expected behavior
A load test client container can maintain a high rate of requests indefinitely
Actual behavior
A load test client container can maintain a high rate of requests only for a short amount of time (5-10 minutes) before eventually no longer being able to resolve DNS. This same behaviour then spreads to all other containers already running or created after the issue occurs.
Information
Docker for Mac: version: 1.13.0-rc5-beta35 (25211e84d)
OS X: version 10.11.6 (build: 15G1108)
Diagnose tool whilst during this state outputs:
Steps to reproduce the behavior
Run an endless load test of many parallel clients (achieved with Locustio with 400 concurrent users, which runs multiple threads each firing requests using the python request library) for 5-10 minutes against an external (non-docker) webserver. Eventually requests will begin failing with no response.
Docker exec into the client container, any other running container, or run a new container and try to perform a DNS lookup and observe failure:
Steps to workaround
Restart Docker for Mac
Diagnose tool now outputs:
The text was updated successfully, but these errors were encountered: