|
1 |
| -//TODO |
| 1 | + |
| 2 | +# Node.js application crash diagnostics: Best Practices series #1 |
| 3 | + |
| 4 | +This is the first of a series of best practices and useful tips if you |
| 5 | +are using Node.js in large scale production systems. |
| 6 | + |
| 7 | +## Introduction |
| 8 | + |
| 9 | +Typical prodcution systems do not enjoy the benefits of development |
| 10 | +and staging systems in many aspects: |
| 11 | + |
| 12 | + - they are isolated from public internet |
| 13 | + - they are not loaded with development and debug tools |
| 14 | + - they are configured with the most robust and secure |
| 15 | + configurations possible at the OS level |
| 16 | + - they operate with tight performance goals and |
| 17 | + constraints which means they cannot use many tools |
| 18 | + that slow dows the process significantly |
| 19 | + - in certain deployment scenarios (such as Cloud) those |
| 20 | + operate in a head-less mode [ no ssh ] |
| 21 | + - in certain deployment scenarios (such as Cloud) those |
| 22 | + operate in a state-less mode [ no persistent disk] |
| 23 | + |
| 24 | +The net effect of these constraints is that your production systems |
| 25 | +need to be manually `prepared` in advance to enable crash dianostic |
| 26 | +data generation on the first failure itself, without loosing vital data. |
| 27 | +The rest of the document illustrates this preparation steps. |
| 28 | + |
| 29 | +The key artifacts for exploring Node.js application crashes in production are: |
| 30 | + - core dump (a.k.a system dump, core file) |
| 31 | + - diagnostic report (originally known as node report) |
| 32 | + |
| 33 | +Reference: [Diagnostic Report](https://nodejs.org/dist/latest-v12.x/docs/api/report.html) |
| 34 | + |
| 35 | +## Common issues |
| 36 | + |
| 37 | +While the said key artifacts are expected to be generated on abnormal |
| 38 | +program conditions such as crash, (diagnostic report is still |
| 39 | +experimental so requires explicit command line flags to switch it ON) |
| 40 | +there are a number of issues that affects the automatic and complete |
| 41 | +generation of these artifacts. Most common such issues are: |
| 42 | + - Insufficient disk space for writing core dump data |
| 43 | + - Insufficient privilege to the core dump generator function |
| 44 | + - Insufficient resource limits set on the user |
| 45 | + - In case of diagnostic report, absence of report and symtpom flag |
| 46 | + |
| 47 | +## Recommended Best Practice |
| 48 | + |
| 49 | +This section provides specific recommendations for |
| 50 | +how to configure your systems in advance in order to be |
| 51 | +ready to investigate crashes. |
| 52 | + |
| 53 | +### Available disk space |
| 54 | +Ensure that there is enough disk space available for the core file |
| 55 | +to be written: |
| 56 | + |
| 57 | + - Maximum of 4GB for a 32 bit process. |
| 58 | + - Much larger for 64 bit process (common case). To know the precise |
| 59 | + requirement, measure the peak-load memory usage of your application. |
| 60 | + Add a 10% to that to accommodate core metadata. If you are using |
| 61 | + common monitoring tools, one of the graph should reveal the peak |
| 62 | + memory. If not, you can measure it directly in the system. |
| 63 | + |
| 64 | +In Linux variants, you can use `top -p <pid>` to see the instantaneous |
| 65 | +memory usage of the process: |
| 66 | + |
| 67 | +``` |
| 68 | +$ top -p 106916 |
| 69 | +
|
| 70 | + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
| 71 | +106916 user 20 0 600404 54500 15572 R 109.7 0.0 81098:54 node |
| 72 | +``` |
| 73 | + |
| 74 | +In Darwin, the flag is `-pid`. |
| 75 | + |
| 76 | +In AIX, the command is `topas`. |
| 77 | + |
| 78 | +In freebsd, the command is `top`. In both AIX and freebsd, there is no |
| 79 | +flag to show per-process details. |
| 80 | + |
| 81 | +In Windows, you could use the task |
| 82 | +manager window and view the process attributes visually. |
| 83 | + |
| 84 | +Insufficient file system space will result in truncated core files, |
| 85 | +and can severely hamper the ability to diagnose the problem. |
| 86 | + |
| 87 | +Figure out how much free space is available in the file system: |
| 88 | + |
| 89 | +`df -k` can be used invariably across UNIX platforms. |
| 90 | + |
| 91 | +In Windows, Windows explorer when pointed to a disk partition, |
| 92 | +provides a view of the available space in that partition. |
| 93 | + |
| 94 | +By default, core file is generated on a crash event, and is |
| 95 | +written to the current working directory - the location from |
| 96 | +where the node process was started, in most of the UNIX variants. |
| 97 | + |
| 98 | +In Darwin, it appears in /cores location. |
| 99 | + |
| 100 | +By default, core files from node processes on Linux are named as |
| 101 | +`core` or `core.<pid>`, where <pid> is node process id. |
| 102 | + |
| 103 | +By default, core files from node processes on AIX and Darwin are |
| 104 | +named ‘core’. |
| 105 | + |
| 106 | +By default, core files from node processes on freebsd are named |
| 107 | +‘%N.core’. where `%N` is the name of the crashed process. |
| 108 | + |
| 109 | +However, Superuser (root) can control and change these defaults. |
| 110 | + |
| 111 | +In Linux, `sysctl kernel.core_pattern` shows corrent core file pattern. |
| 112 | + |
| 113 | +Modify pattern using `sysctl -w kernel.core_pattern=pattern` as root. |
| 114 | + |
| 115 | +In AIX, `lscore` shows the current core file pattern. |
| 116 | + |
| 117 | +A best practice is to remove old core files on regular intervals. |
| 118 | + |
| 119 | +This makes sure that the space in the system is used efficiently, |
| 120 | +and no application spefific data is persisted inadvertently. |
| 121 | + |
| 122 | +A best practice is to name core file with the name, process ID and |
| 123 | +the creation timestamp of the failed process. |
| 124 | + |
| 125 | +This makes it easy to relate the binary dump with crash specific context. |
| 126 | + |
| 127 | +### Configuring to ensure core generation |
| 128 | + |
| 129 | +Enable full core dump generation using `chdev -l sys0 -a fullcore=true` |
| 130 | + |
| 131 | +Modify the current pattern using `chcore -p on -n on -l /path/to/coredumps` |
| 132 | + |
| 133 | +In Darwin and freebsd, `sysctl kern.corefile` shows the corrent core file pattern. |
| 134 | + |
| 135 | +Modify the current pattern using `sysctl -w kern.corefile=newpattern` as root. |
| 136 | + |
| 137 | +To obtain full core files, set the following ulimit options, across UNIX variants: |
| 138 | + |
| 139 | +`ulimit -c unlimited` - turn on core file generation capability with unlimited size |
| 140 | + |
| 141 | +`ulimit -d unlimited` - set the user data limit to unlimited |
| 142 | + |
| 143 | +`ulimit -f unlimited` - set the file limit to unlimited |
| 144 | + |
| 145 | +The current ulimit settings can be displayed using: |
| 146 | + |
| 147 | +`ulimit -a` |
| 148 | + |
| 149 | +However, these are the `soft` limits and are enforced per user, per |
| 150 | +shell environment. Please note that these values are themselves |
| 151 | +practically constrained by the system-wide `hard` limit set by the |
| 152 | +system administrator. System administrators (with superuser privileges) |
| 153 | +may display, set or change the hard limits by adding the -H flag to |
| 154 | +the standard set of ulimit commands. |
| 155 | + |
| 156 | +For example with: |
| 157 | +`ulimit -c -H` |
| 158 | + |
| 159 | +104857600 |
| 160 | + |
| 161 | +we cannot increase the core file size to 200 MB. So |
| 162 | + |
| 163 | +`ulimit -c 209715200` |
| 164 | + |
| 165 | +will fail with reason: |
| 166 | + |
| 167 | +`ulimit: core size: cannot modify limit: Invalid argument` |
| 168 | + |
| 169 | +So if you hard limit settings are constraining your application's |
| 170 | +requirement, relax those specific settings through administrator |
| 171 | +account. |
| 172 | + |
| 173 | +## Additional information |
| 174 | + |
| 175 | +### Manual dump generation |
| 176 | + |
| 177 | +Under certain circumstances where you want to collect a core |
| 178 | +manually follow these steps: |
| 179 | + |
| 180 | +In linux, use `gcore [-a] [-o filename] pid` where `-a` |
| 181 | +specifies to dump everything. |
| 182 | + |
| 183 | +In AIX, use `gencore [pid] [filename]` |
| 184 | + |
| 185 | +In freebsd and Darwin, use `gcore [-s] [executable] pid` |
| 186 | + |
| 187 | +In Windows, you can use `Task manager` window, right click on the |
| 188 | +node process and select `create dump` option. |
| 189 | + |
| 190 | +Special note on Ubuntu systems with `Yama hardened kernel` |
| 191 | + |
| 192 | +Yama security policy inhibits a second process from collecting dump, |
| 193 | +practically rendering `gcore` unusable. |
| 194 | + |
| 195 | +Workaround this by enabling `ptrace` capability to gdb. |
| 196 | +Execute the below as root: |
| 197 | + |
| 198 | +`setcap cap_sys_ptrace=+ep `which gdb`` |
| 199 | + |
| 200 | + |
| 201 | +These steps make sure that when your Node.js application crashes in |
| 202 | +production a valid, full core dump is generated at a known location that |
| 203 | +can be loaded into debuggers that understand Node.js internals, and |
| 204 | +diagnose the issue. Next article in this series will focus on that part. |
0 commit comments