-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-cluster-master-{error,kill} fail only in weird CI environment #6193
Comments
Given that I can only reproduce this in our CI build, I've started checking in experimental changes just to see how they fair. Here's an experiment which just adds logging: diff --git a/test/parallel/test-cluster-master-error.js b/test/parallel/test-cluster-master-error.js
index ae0f655..16d63c1 100644
--- a/test/parallel/test-cluster-master-error.js
+++ b/test/parallel/test-cluster-master-error.js
@@ -2,21 +2,39 @@
var common = require('../common');
var assert = require('assert');
var cluster = require('cluster');
+var fs = require('fs');
+
+// really wish we had `...values` syntax
+function log(strings) {
+ var line = ['TEST-CLUSTER-MASTER-ERROR ', process.pid, ' '];
+ for (let i = 0; i < strings.length; i++) {
+ line.push(strings[i]);
+ if (i+1 !== strings.length) {
+ line.push(arguments[i+1]);
+ }
+ }
+ fs.appendFileSync(__dirname + '/../../artifacts/folta.log', line.join('') + '\n');
+}
// Cluster setup
if (cluster.isWorker) {
+ log`WORKER started`;
var http = require('http');
http.Server(function() {
- }).listen(common.PORT, '127.0.0.1');
+ }).listen(common.PORT, '127.0.0.1', function() {
+ log`WORKER listening`;
+ });
} else if (process.argv[2] === 'cluster') {
+ log`MASTER started`;
var totalWorkers = 2;
// Send PID to testcase process
var forkNum = 0;
cluster.on('fork', function forkEvent(worker) {
+ log`MASTER on:fork ${worker.process.pid}`;
// Send PID
process.send({
@@ -26,6 +44,7 @@ if (cluster.isWorker) {
// Stop listening when done
if (++forkNum === totalWorkers) {
+ log`MASTER on:fork ${worker.process.pid} -- got totalWorkers`;
cluster.removeListener('fork', forkEvent);
}
});
@@ -33,14 +52,17 @@ if (cluster.isWorker) {
// Throw accidently error when all workers are listening
var listeningNum = 0;
cluster.on('listening', function listeningEvent() {
+ log`MASTER on:listening`;
// When all workers are listening
if (++listeningNum === totalWorkers) {
+ log`MASTER on:listening -- got totalWorkers`;
// Stop listening
cluster.removeListener('listening', listeningEvent);
// throw accidently error
process.nextTick(function() {
+ log`MASTER on:listening -- got totalWorkers -- about to throw`;
console.error('about to throw');
throw new Error('accidently error');
});
@@ -52,7 +74,10 @@ if (cluster.isWorker) {
cluster.fork();
cluster.fork();
+ log`MASTER done forking`;
} else {
+ log`=================================================================`;
+ log`TEST started`;
// This is the testcase
var fork = require('child_process').fork;
@@ -62,8 +87,10 @@ if (cluster.isWorker) {
//this will throw an error if the process is dead
process.kill(pid, 0);
+ log`TEST isAlive(${pid}) true`;
return true;
} catch (e) {
+ log`TEST isAlive(${pid}) false -- ${e.message}`;
return false;
}
};
@@ -82,12 +109,14 @@ if (cluster.isWorker) {
// Add worker pid to list and progress tracker
if (data.cmd === 'worker') {
+ log`TEST on:message worker ${data.workerPID}`;
workers.push(data.workerPID);
}
});
// When cluster is dead
master.on('exit', function(code) {
+ log`TEST on:exit code=${code}`;
// Check that the cluster died accidently
existMaster = !!code;
@@ -101,6 +130,7 @@ if (cluster.isWorker) {
setTimeout(checkWorkers, timeout);
function checkWorkers() {
+ log`TEST checkWorkers`;
// When master is dead all workers should be dead to
var alive = false;
workers.forEach(function(pid) {
@@ -110,15 +140,16 @@ if (cluster.isWorker) {
});
// If a worker was alive this did not act as expected
- existWorker = !alive;
+ existWorker = alive;
}
});
process.once('exit', function() {
+ log`TEST once:exit`;
var m = 'The master did not die after an error was throwed';
assert.ok(existMaster, m);
m = 'The workers did not die after an error in the master';
- assert.ok(existWorker, m);
+ assert.ok(!existWorker, m);
});
} With that, this is what I see in the build (where the test fails):
This is what I see when I run locally (where the test passes):
|
@nodejs/testing |
Odd... I just ran the same test on my macbook using off-the-shelf v4.4.3 and got a log of events very similar to the CI log above, except isAlive() returned false and the test passed. I suspect I'll have to do some deeper logging, perhaps adding additional listeners in the test or adding logging in the lib/cluster.js itself. |
@nodejs/docker Not sure that Docker is a factor here, but not sure it isn't. Maybe it's something those folks have seen? |
Curious. I updated the logging to (a) specify which event listener is emitting the events, and (b) add listeners for a bunch more stuff. With that, this is what I see in the build (where the test fails):
This is what I see when I run locally (where the test passes):
In both environments, the worker I think the next thing to check is the actual timing. Is the checkWorkers happening so quickly after the worker process.on(exit) that the process hasn't actually died yet? Is there something else that is causing the worker processes to hang around after that event? |
@drewfish I've been able to reproduce the error in
|
@santigimeno I tried that patch (plus the logging stuff). I'm still able to reproduce the test failure. Here's the log:
|
Uggg.. no new insights, just more weirdness. I added timestamps and listeners for many events. The most interesting change is calling this in each process: function installHandlers(who) {
process.on('error', (e) => { log`${who} process.on:error ${e}`; });
process.on('uncaughtException', (e) => { log`${who} process.on:uncaughtException ${e}`; process.exit(20); });
process.on('unhandledRejection', (reason, p) => { log`${who} process.on:unhandledRejection reason=${reason}`; process.exit(21); });
process.on('exit', (code) => { log`${who} process.on:exit code=${code}`; });
process.on('SIGPIPE', () => { log`${who} process.on:SIGPIPE`; });
process.on('SIGCHLD', () => { log`${who} process.on:SIGCHLD`; });
process.on('SIGTTIN', () => { log`${who} process.on:SIGTTIN`; });
process.on('SIGTTOU', () => { log`${who} process.on:SIGTTOU`; });
process.on('SIGTERM', () => { log`${who} process.on:SIGTERM`; process.exit(22); });
process.on('SIGINT', () => { log`${who} process.on:SIGINT`; process.exit(23); });
process.on('SIGHUP', () => { log`${who} process.on:SIGHUP`; process.exit(24); });
} With that, here's what I get in the failing environment:
Here's what I get in the succeeding environment:
(The multiple |
@bengl suggested I try the same test on 0.12. Here's the same test code (with new language features downgraded) run on 0.12.13:
|
|
@drewfish I've been able to reproduce the error more or less consistently by increasing the number of workers. I think the problem might be that sometimes the workers are not dying immediately (at least not in the 200ms after the master dies. Code here: https://github.com/nodejs/node/blob/v4.x/lib/cluster.js#L534-L540). Have you tried increasing the |
I had bumped up the timeout to 10 seconds but I was still seeing the failure. |
Yeah, I just added back a setInterval() and had it checkWorkers() every second and after 10 seconds all children were still alive. I think I'm going to have to wire some logging into lib/cluster and start looking at what's going on there. |
Rats. I just ran a build on 6.0.0 (w/out any custom modifications) and saw the same failures:
|
You mentioned it happens in docker? If the master dies before the workers, then will the workers become orphans and it will be the job of init (pid 1) to reap the exited worker process. (that is what takes 200ms). If you run it in docker, then there will the pid 1 not be an The point is: something needs to reap the workers, using Try run tini as first process in your container and see if that helps. EDIT: fix spelling of |
@ncopa Thanks for pointing out the init issue! I wasn't aware of this limitation in Docker, but it seems there has already been some discussion on this topic in nodejs/build-containers#19, @drewfish maybe the workaround mentioned there ("my_init" script) can resolve your issue. |
Yeah, this smells about right, thanks @ncopa! I'll give it a try (either tini or some other reaper). |
@drewfish Is this issue resolved? Or not exactly? |
ping @drewfish. Can this be closed? |
Awesome. Thanks for following up @bengl |
We build 4.x for use internally in the company, using a Jenkins job which creates a docker image in which the actual build and tests are run. Very oddly,
test-cluster-master-error
andtest-cluster-master-kill
fail every build. I've not been able to reproduce this in any other environment, even running the same docker image on my macbook.Here is the output of the two failing tests:
The tests were run via:
The text was updated successfully, but these errors were encountered: