|
| 1 | + |
| 2 | +## Node.js application crash diagnostics: Best Practices series #1 |
| 3 | + |
| 4 | +This is the first of a series of best practices and useful tips if you |
| 5 | +are using Node.js in large scale production systems. |
| 6 | + |
| 7 | +## Introduction |
| 8 | + |
| 9 | +Typical prodcution systems do not enjoy the benefits of development |
| 10 | +and staging system in man aspects: |
| 11 | + |
| 12 | + - those are isolated from public internet |
| 13 | + - those are not loaded with development and debug tools |
| 14 | + - those are configured with the most robust and secure |
| 15 | + configurations possible at the OS level |
| 16 | + - in certain deployment scenarios (such as Cloud) those |
| 17 | + operate in a head-less mode [ no ssh ] |
| 18 | + - in certain deployment scenarios (such as Cloud) those |
| 19 | + operate in a state-less mode [ no persistent disk] |
| 20 | + |
| 21 | +The net effect of these constraints are that your production system |
| 22 | +need to be manually `prepared` in advance to enable crash dianostic |
| 23 | +data generation on the first failure itself, without loosing vital data. |
| 24 | +The rest of the document illustrates this preparation steps. |
| 25 | + |
| 26 | +## Available disk space |
| 27 | +Ensure that there is enough disk space available for the core file |
| 28 | +to be written: |
| 29 | + |
| 30 | + - Maximum of 4GB for a 32 bit process. |
| 31 | + - Much larger for 64 bit process (common case). To know the precise |
| 32 | + requirement, measure the peak-load memory usage of your application. |
| 33 | + Add a 10% to that to accommodate core metadata. If you are using |
| 34 | + common monitoring tools, one of the graph should reveal the peak |
| 35 | + memory. If not, you can measure it directly in the system. |
| 36 | + |
| 37 | +In Linux variants, you can use `top -p <pid>` to see the instantaneous |
| 38 | +memory usage of the process: |
| 39 | + |
| 40 | +``` |
| 41 | +$ top -p 106916 |
| 42 | +
|
| 43 | + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
| 44 | +106916 user 20 0 600404 54500 15572 R 109.7 0.0 81098:54 node |
| 45 | +``` |
| 46 | + |
| 47 | +In Darwin, the flag is `-pid` |
| 48 | +In AIX, the command is `topas` |
| 49 | +In freebsd, the command is `top`. In both AIX and freebsd, there is no |
| 50 | +flag to show per-process details. In Windows, you could use the task |
| 51 | +manager window and view the process attributes visually. |
| 52 | + |
| 53 | +Insufficient file system space will result in truncated core files, |
| 54 | +and can severely hamper the ability to diagnose the problem. |
| 55 | + |
| 56 | +Figure out how much free space is available in the file system: |
| 57 | +`df -k` can be used invariably across UNIX platforms. |
| 58 | +In Windows, Windows explorer when pointed to a disk partition, |
| 59 | +provides a view of the available space in that partition. |
| 60 | + |
| 61 | +## Core file location and name |
| 62 | + |
| 63 | +By default, core file is generated on a crash event, and is |
| 64 | +written to the current working directory - the location from |
| 65 | +where the node process was started, in most of the UNIX variants. |
| 66 | +In Darwin, it appears in /cores location. |
| 67 | + |
| 68 | +By default, core files from node processes on Linux are named as |
| 69 | +`core` or `core.<pid>`, where <pid> is node process id. |
| 70 | +By default, core files from node processes on AIX and Darwin are |
| 71 | +named ‘core’. |
| 72 | +By default, core files from node processes on freebsd are named |
| 73 | +‘%N.core’. where `%N` is the name of the crashed process. |
| 74 | + |
| 75 | +However, Superuser (root) can control and change these defaults. |
| 76 | + |
| 77 | +In Linux, `sysctl kernel.core_pattern` shows corrent core file pattern. |
| 78 | + |
| 79 | +Modify pattern using `sysctl -w kernel.core_pattern=pattern` as root. |
| 80 | + |
| 81 | +In AIX, `lscore` shows the current core file pattern. |
| 82 | + |
| 83 | +Enable full core dump generation using `chdev -l sys0 -a fullcore=true` |
| 84 | +Modify the current pattern using `chcore -p on -n on -l /path/to/coredumps` |
| 85 | + |
| 86 | +In Darwin and freebsd, `sysctl kern.corefile` shows the corrent core file pattern. |
| 87 | + |
| 88 | +Modify the current pattern using `sysctl -w kern.corefile=newpattern` as root. |
| 89 | + |
| 90 | +To obtain full core files, set the following ulimit options, across UNIX variants: |
| 91 | + |
| 92 | +`ulimit -c unlimited` - turn on core file generation capability with unlimited size |
| 93 | +`ulimit -d unlimited` - set the user data limit to unlimited |
| 94 | +`ulimit -f unlimited` - set the file limit to unlimited |
| 95 | + |
| 96 | +The current ulimit settings can be displayed using: |
| 97 | + |
| 98 | +`ulimit -a` |
| 99 | + |
| 100 | +However, these are the `soft` limits and are enforced per user, per |
| 101 | +shell environment. Please note that these values are themselves |
| 102 | +practically constrained by the system-wide `hard` limit set by the |
| 103 | +system administrator. System administrators (with superuser privileges) |
| 104 | +may display, set or change the hard limits by adding the -H flag to |
| 105 | +the standard set of ulimit commands. |
| 106 | + |
| 107 | +## Manual dump generation |
| 108 | + |
| 109 | +Under certain circumstances where you want to collect a core |
| 110 | +manually follow these steps: |
| 111 | + |
| 112 | +In linux, use `gcore [-a] [-o filename] pid` where `-a` |
| 113 | +specifies to dump everything. |
| 114 | +In AIX, use `gencore [pid] [filename]` |
| 115 | +In freebsd and Darwin, use `gcore [-s] [executable] pid` |
| 116 | +In Windows, you can use `Task manager` window, right click on the |
| 117 | +node process and select `create dump` option. |
| 118 | + |
| 119 | +Special note on Ubuntu systems with `Yama hardened kernel` |
| 120 | + |
| 121 | +Yama security policy inhibits a second process from collecting dump, |
| 122 | +practically rendering `gcore` unusable. |
| 123 | + |
| 124 | +`setcap cap_sys_ptrace=+ep `which gdb`` |
| 125 | + |
| 126 | + |
| 127 | +These steps make sure that when your Node.js application crashes in |
| 128 | +production a valid, full core dump is generated at a known location that |
| 129 | +can be loaded into debuggers that understand Node.js internsls, and |
| 130 | +diagnose the issue. Next article in this series will focus on that part. |
0 commit comments