Skip to content

Commit 1449058

Browse files
authored
doc: add basic troubleshooting guide for machines (#932)
Would be great to get some of these Tips & Tricks written down. Refs: #775 (comment)
1 parent fda8381 commit 1449058

File tree

1 file changed

+114
-0
lines changed

1 file changed

+114
-0
lines changed

doc/fixing-machines.md

+114
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Fixing machines
2+
3+
A basic guide to troubleshooting machine issues.
4+
5+
## Jenkins job failures
6+
7+
Usually the first sign of trouble is a failed Jenkins job. First thing to check
8+
is whether this is actually a Jenkins issue rather than a machine issue.
9+
10+
1. Rebuild the job.
11+
2. Clean the workspace and rebuild the job.
12+
3. Check vital signs on the machine page
13+
(https://ci.nodejs.org/computer/<machine>/)
14+
4. Take machine offline and rebuild, do other machines have the same problem?
15+
16+
## On the machine itself
17+
18+
Try to `su` to the Jenkins user if possible, this avoids accidentally doing
19+
things as root (which can destroy the machine, or break jobs).
20+
21+
```bash
22+
su iojs # Check user with `ps -ef | grep java`.
23+
```
24+
25+
It's also good practice to raise/comment in a GitHub issue (or on IRC) to let
26+
others know what you did.
27+
28+
### Out of space
29+
30+
Most common problem is that a partition is full. You can quickly check with:
31+
32+
```bash
33+
df -h # Or `df -m` on machines that don't support `-h`.
34+
```
35+
36+
Look at the `%Used` column (not the `%Iused`, that's for inodes). If it's full
37+
it probably needs a cleanup.
38+
39+
### Processes left behind
40+
41+
```bash
42+
ps -ef | grep iojs | grep -v java | grep -v grep
43+
```
44+
45+
If there are a lot of processes running on the machine, and nothing running on
46+
the machine according to Jenkins, then that's a warning flag. Paste the list of
47+
processes in an issue, and clean them up with:
48+
49+
```bash
50+
# Show all iojs processes except the Jenkins process and grep, and kill them.
51+
ps -ef | grep iojs | grep -v java | grep -v grep | awk '{print $2}' | xargs kill
52+
```
53+
54+
### Git or file permission issues
55+
56+
Sometimes we get problems in job workspaces, either because someone left a file
57+
there as `root` that `iojs` can't remove, or because some git cleanup (like `git
58+
gc`) needs to happen.
59+
60+
Jenkins job workspaces on machines can be wiped (as long as a job isn't running), so
61+
feel free to nuke a job's workspace, for example:
62+
63+
```bash
64+
rm -rf /home/iojs/build/workspace/node-test-commit/
65+
```
66+
67+
The next job will take longer as it re-clones the workspace.
68+
69+
70+
### Turn it off and on again
71+
72+
Sometimes something weird happens, and it's easier to just reboot the machine.
73+
On Unix just do one of:
74+
75+
```bash
76+
shutdown -r now
77+
# or:
78+
reboot
79+
```
80+
81+
When the machine comes back the Jenkins slave should reconnect automatically
82+
(check on ci.nodejs.org). If it doesn't see the next step.
83+
84+
### Restart the Jenkins agent
85+
86+
Most machines have a service to restart the Jenkins agent. Try one of:
87+
88+
```bash
89+
# Systemd init (some newer Linux distros):
90+
systemctl jenkins start
91+
# System V init (older Linux distros):
92+
/etc/init.d/jenkins start
93+
# AIX:
94+
/etc/rc.d/rc2.d/S20jenkins start
95+
# Things we don't have a service for yet:
96+
~iojs/start.sh
97+
```
98+
99+
Service files should be stored [here][Jenkins Worker Template], if none of these
100+
work look for the relevant file there.
101+
102+
## Problems with non-test machines
103+
104+
Ask someone in [infra][Infra Admins] or [release][Release Admins] to take a look.
105+
106+
## Machines needing reimaging or reconfiguring
107+
108+
You need to run the ansible scripts on the machines again, see the [Ansible
109+
Readme][].
110+
111+
[Infra Admins]: https://github.com/nodejs/build#infra-admins
112+
[Jenkins Worker Template]: https://github.com/nodejs/build/tree/master/ansible/roles/jenkins-worker/templates
113+
[Release Admins]: https://github.com/nodejs/build#release-admins
114+
[Ansible Readme]: https://github.com/nodejs/build/blob/master/ansible/README.md

0 commit comments

Comments
 (0)