Skip to content

Commit 831702b

Browse files
authored
ci: store Docker logs and events as Jenkins artifacts if a "Test" stage fails (elastic#2950)
1 parent 73c1ba5 commit 831702b

File tree

2 files changed

+72
-9
lines changed

2 files changed

+72
-9
lines changed

.ci/Jenkinsfile

+10-9
Original file line numberDiff line numberDiff line change
@@ -405,16 +405,17 @@ def generateStep(Map params = [:]){
405405
withEnv(["VERSION=${version}", "ELASTIC_APM_CONTEXT_MANAGER=${contextManager}"]) {
406406
deleteDir()
407407
unstash 'source'
408-
dir("${BASE_DIR}"){
409-
try {
410-
retryWithSleep(retries: 2, seconds: 5, backoff: true) {
411-
sh(label: "Run Tests", script: """.ci/scripts/test.sh -b "${buildType}" -t "${tav}" "${version}" """)
408+
// Grab the current docker context for helping to troubleshoot the docker containers using filebeat and metricbeat
409+
dockerContext(filebeatOutput: "docker-${version}-${buildType}.log", metricbeatOutput: "docker-${version}-${buildType}-metricbeat.log", archiveOnlyOnFail: true){
410+
dir("${BASE_DIR}"){
411+
try {
412+
retryWithSleep(retries: 2, seconds: 5, backoff: true) {
413+
sh(label: "Run Tests", script: """.ci/scripts/test.sh -b "${buildType}" -t "${tav}" "${version}" """)
414+
}
415+
} finally {
416+
junit(testResults: "test_output/*.junit.xml", allowEmptyResults: true, keepLongStdio: true)
417+
archiveArtifacts(artifacts: "test_output/*.tap", allowEmptyArchive: true)
412418
}
413-
} catch(e){
414-
error(e.toString())
415-
} finally {
416-
junit(testResults: "test_output/*.junit.xml", allowEmptyResults: true, keepLongStdio: true)
417-
archiveArtifacts(artifacts: "test_output/*.tap", allowEmptyArchive: true)
418419
}
419420
}
420421
}

DEVELOPMENT.md

+62
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,68 @@ For example:
9191
6626799 FINISHED SUCCESS .ci/scripts/test.sh "8" "apollo-server-express" "false"
9292
```
9393

94+
## How to troubleshoot `Container "$containerId" is unhealthy.` errors
95+
96+
Each "Test" step of a Jenkins CI build uses `docker-compose` to start services
97+
for testing, and then runs tests in a `node_tests` container. Starting those
98+
services can fail with the following unhelpful message in the logs:
99+
100+
```
101+
[2022-09-19T05:55:43.897Z] .ci/scripts/test.sh:250: main(): docker-compose --no-ansi --log-level ERROR -f .ci/docker/docker-compose-all.yml up --exit-code-from node_tests --remove-orphans --abort-on-container-exit node_tests
102+
...
103+
[2022-09-19T05:56:23.776Z] ERROR: for node_tests Container "2d979b0c797d" is unhealthy.
104+
```
105+
106+
That container ID does not identify *which* of the many service containers is
107+
the one to fail. Two ways to troubleshoot this are as follows.
108+
109+
First, the Jenkins build will include log files of both Docker container logs
110+
and Docker events as Jenkins build artifacts, if the "Test" step failed. These
111+
are collected by filebeat and metricbeat (as configured by the `dockerContext()`
112+
block in ".ci/Jenkinsfile"). Here is an example querying the metricbeat log of
113+
docker events for containers that are failing their healthcheck. This uses
114+
[ecslog](https://github.com/trentm/go-ecslog) to filter and format the log file.
115+
116+
```
117+
$ ecslog -k 'docker.healthcheck.failingstreak > 0' -i container.image.name,docker.healthcheck docker-16-release-metricbeat.log-20220927.ndjson
118+
...
119+
[2022-09-27T14:55:53.857Z] (on apm-ci-immutable-ubuntu-1804-1664290081867003348):
120+
container: {
121+
"image": {
122+
"name": "mongo:6"
123+
}
124+
}
125+
docker: {
126+
"healthcheck": {
127+
"status": "unhealthy",
128+
"failingstreak": 49,
129+
"event": {
130+
"start_date": "2022-09-27T14:55:53.012Z",
131+
"end_date": "2022-09-27T14:55:53.153Z",
132+
"exit_code": -1,
133+
"output": "OCI runtime exec failed: exec failed: unable to start container process: exec: \"mongo\": executable file not found in $PATH: unknown"
134+
}
135+
}
136+
}
137+
```
138+
139+
Second, most of the time you should be able to reproduce a "Test" step failure
140+
locally. Sometimes this requires forcing an update to the latest Docker image
141+
for some services.
142+
143+
```
144+
$ docker system prune --all --force --volumes # heavy-handed purge of all local Docker data
145+
...
146+
147+
$ .ci/scripts/test.sh -b "release" -t "" "16" # or a different value for "16" depending which stage failed
148+
...
149+
```
150+
151+
Once the failure is reproduced, you should be able to use `docker ps -a`,
152+
`docker inspect $containerId` and other regular Docker commands and tooling to
153+
dig into the issue.
154+
155+
94156
# Maintenance tips
95157

96158
## How to check for outdated instrumentation modules

0 commit comments

Comments
 (0)