@@ -91,6 +91,68 @@ For example:
91
91
6626799 FINISHED SUCCESS .ci/scripts/test.sh "8" "apollo-server-express" "false"
92
92
```
93
93
94
+ ## How to troubleshoot ` Container "$containerId" is unhealthy. ` errors
95
+
96
+ Each "Test" step of a Jenkins CI build uses ` docker-compose ` to start services
97
+ for testing, and then runs tests in a ` node_tests ` container. Starting those
98
+ services can fail with the following unhelpful message in the logs:
99
+
100
+ ```
101
+ [2022-09-19T05:55:43.897Z] .ci/scripts/test.sh:250: main(): docker-compose --no-ansi --log-level ERROR -f .ci/docker/docker-compose-all.yml up --exit-code-from node_tests --remove-orphans --abort-on-container-exit node_tests
102
+ ...
103
+ [2022-09-19T05:56:23.776Z] ERROR: for node_tests Container "2d979b0c797d" is unhealthy.
104
+ ```
105
+
106
+ That container ID does not identify * which* of the many service containers is
107
+ the one to fail. Two ways to troubleshoot this are as follows.
108
+
109
+ First, the Jenkins build will include log files of both Docker container logs
110
+ and Docker events as Jenkins build artifacts, if the "Test" step failed. These
111
+ are collected by filebeat and metricbeat (as configured by the ` dockerContext() `
112
+ block in ".ci/Jenkinsfile"). Here is an example querying the metricbeat log of
113
+ docker events for containers that are failing their healthcheck. This uses
114
+ [ ecslog] ( https://github.com/trentm/go-ecslog ) to filter and format the log file.
115
+
116
+ ```
117
+ $ ecslog -k 'docker.healthcheck.failingstreak > 0' -i container.image.name,docker.healthcheck docker-16-release-metricbeat.log-20220927.ndjson
118
+ ...
119
+ [2022-09-27T14:55:53.857Z] (on apm-ci-immutable-ubuntu-1804-1664290081867003348):
120
+ container: {
121
+ "image": {
122
+ "name": "mongo:6"
123
+ }
124
+ }
125
+ docker: {
126
+ "healthcheck": {
127
+ "status": "unhealthy",
128
+ "failingstreak": 49,
129
+ "event": {
130
+ "start_date": "2022-09-27T14:55:53.012Z",
131
+ "end_date": "2022-09-27T14:55:53.153Z",
132
+ "exit_code": -1,
133
+ "output": "OCI runtime exec failed: exec failed: unable to start container process: exec: \"mongo\": executable file not found in $PATH: unknown"
134
+ }
135
+ }
136
+ }
137
+ ```
138
+
139
+ Second, most of the time you should be able to reproduce a "Test" step failure
140
+ locally. Sometimes this requires forcing an update to the latest Docker image
141
+ for some services.
142
+
143
+ ```
144
+ $ docker system prune --all --force --volumes # heavy-handed purge of all local Docker data
145
+ ...
146
+
147
+ $ .ci/scripts/test.sh -b "release" -t "" "16" # or a different value for "16" depending which stage failed
148
+ ...
149
+ ```
150
+
151
+ Once the failure is reproduced, you should be able to use ` docker ps -a ` ,
152
+ ` docker inspect $containerId ` and other regular Docker commands and tooling to
153
+ dig into the issue.
154
+
155
+
94
156
# Maintenance tips
95
157
96
158
## How to check for outdated instrumentation modules
0 commit comments