|
| 1 | +# Fixing machines |
| 2 | + |
| 3 | +A basic guide to troubleshooting machine issues. |
| 4 | + |
| 5 | +## Jenkins job failures |
| 6 | + |
| 7 | +Usually the first sign of trouble is a failed Jenkins job. First thing to check |
| 8 | +is whether this is actually a Jenkins issue rather than a machine issue. |
| 9 | + |
| 10 | +1. Rebuild the job. |
| 11 | +2. Clean the workspace and rebuild the job. |
| 12 | +3. Check vital signs on the machine page |
| 13 | + (https://ci.nodejs.org/computer/<machine>/) |
| 14 | +4. Take machine offline and rebuild, do other machines have the same problem? |
| 15 | + |
| 16 | +## On the machine itself |
| 17 | + |
| 18 | +Try to `su` to the Jenkins user if possible, this avoids accidentally doing |
| 19 | +things as root (which can destroy the machine, or break jobs). |
| 20 | + |
| 21 | +```bash |
| 22 | +su iojs # Check user with `ps -ef | grep java`. |
| 23 | +``` |
| 24 | + |
| 25 | +It's also good practice to raise/comment in a GitHub issue (or on IRC) to let |
| 26 | +others know what you did. |
| 27 | + |
| 28 | +### Out of space |
| 29 | + |
| 30 | +Most common problem is that a partition is full. You can quickly check with: |
| 31 | + |
| 32 | +```bash |
| 33 | +df -h # Or `df -m` on machines that don't support `-h`. |
| 34 | +``` |
| 35 | + |
| 36 | +Look at the `%Used` column (not the `%Iused`, that's for inodes). If it's full |
| 37 | +it probably needs a cleanup. |
| 38 | + |
| 39 | +### Processes left behind |
| 40 | + |
| 41 | +```bash |
| 42 | +ps -ef | grep iojs | grep -v java | grep -v grep |
| 43 | +``` |
| 44 | + |
| 45 | +If there are a lot of processes running on the machine, and nothing running on |
| 46 | +the machine according to Jenkins, then that's a warning flag. Paste the list of |
| 47 | +processes in an issue, and clean them up with: |
| 48 | + |
| 49 | +```bash |
| 50 | +# Show all iojs processes except the Jenkins process and grep, and kill them. |
| 51 | +ps -ef | grep iojs | grep -v java | grep -v grep | awk '{print $2}' | xargs kill |
| 52 | +``` |
| 53 | + |
| 54 | +### Git or file permission issues |
| 55 | + |
| 56 | +Sometimes we get problems in job workspaces, either because someone left a file |
| 57 | +there as `root` that `iojs` can't remove, or because some git cleanup (like `git |
| 58 | +gc`) needs to happen. |
| 59 | + |
| 60 | +Jenkins job workspaces on machines can be wiped (as long as a job isn't running), so |
| 61 | +feel free to nuke a job's workspace, for example: |
| 62 | + |
| 63 | +```bash |
| 64 | +rm -rf /home/iojs/build/workspace/node-test-commit/ |
| 65 | +``` |
| 66 | + |
| 67 | +The next job will take longer as it re-clones the workspace. |
| 68 | + |
| 69 | + |
| 70 | +### Turn it off and on again |
| 71 | + |
| 72 | +Sometimes something weird happens, and it's easier to just reboot the machine. |
| 73 | +On Unix just do one of: |
| 74 | + |
| 75 | +```bash |
| 76 | +shutdown -r now |
| 77 | +# or: |
| 78 | +reboot |
| 79 | +``` |
| 80 | + |
| 81 | +When the machine comes back the Jenkins slave should reconnect automatically |
| 82 | +(check on ci.nodejs.org). If it doesn't see the next step. |
| 83 | + |
| 84 | +### Restart the Jenkins agent |
| 85 | + |
| 86 | +Most machines have a service to restart the Jenkins agent. Try one of: |
| 87 | + |
| 88 | +```bash |
| 89 | +# Systemd init (some newer Linux distros): |
| 90 | +systemctl jenkins start |
| 91 | +# System V init (older Linux distros): |
| 92 | +/etc/init.d/jenkins start |
| 93 | +# AIX: |
| 94 | +/etc/rc.d/rc2.d/S20jenkins start |
| 95 | +# Things we don't have a service for yet: |
| 96 | +~iojs/start.sh |
| 97 | +``` |
| 98 | + |
| 99 | +Service files should be stored [here][Jenkins Worker Template], if none of these |
| 100 | +work look for the relevant file there. |
| 101 | + |
| 102 | +## Problems with non-test machines |
| 103 | + |
| 104 | +Ask someone in [infra][Infra Admins] or [release][Release Admins] to take a look. |
| 105 | + |
| 106 | +## Machines needing reimaging or reconfiguring |
| 107 | + |
| 108 | +You need to run the ansible scripts on the machines again, see the [Ansible |
| 109 | +Readme][]. |
| 110 | + |
| 111 | +[Infra Admins]: https://github.com/nodejs/build#infra-admins |
| 112 | +[Jenkins Worker Template]: https://github.com/nodejs/build/tree/master/ansible/roles/jenkins-worker/templates |
| 113 | +[Release Admins]: https://github.com/nodejs/build#release-admins |
| 114 | +[Ansible Readme]: https://github.com/nodejs/build/blob/master/ansible/README.md |
0 commit comments