Skip to content

Commit 38a577a

Browse files
doc: add crash pre-requisites [ best practices ]
This covers the preparation of crash diagnostics Refs: #254 PR-URL: #285 Reviewed-By: Michael Dawson <[email protected]> Reviewed-By: Peter Marton <[email protected]>
1 parent 191ad6e commit 38a577a

File tree

1 file changed

+204
-1
lines changed
  • documentation/abnormal_termination

1 file changed

+204
-1
lines changed
+204-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,204 @@
1-
//TODO
1+
2+
# Node.js application crash diagnostics: Best Practices series #1
3+
4+
This is the first of a series of best practices and useful tips if you
5+
are using Node.js in large scale production systems.
6+
7+
## Introduction
8+
9+
Typical prodcution systems do not enjoy the benefits of development
10+
and staging systems in many aspects:
11+
12+
- they are isolated from public internet
13+
- they are not loaded with development and debug tools
14+
- they are configured with the most robust and secure
15+
configurations possible at the OS level
16+
- they operate with tight performance goals and
17+
constraints which means they cannot use many tools
18+
that slow dows the process significantly
19+
- in certain deployment scenarios (such as Cloud) those
20+
operate in a head-less mode [ no ssh ]
21+
- in certain deployment scenarios (such as Cloud) those
22+
operate in a state-less mode [ no persistent disk]
23+
24+
The net effect of these constraints is that your production systems
25+
need to be manually `prepared` in advance to enable crash dianostic
26+
data generation on the first failure itself, without loosing vital data.
27+
The rest of the document illustrates this preparation steps.
28+
29+
The key artifacts for exploring Node.js application crashes in production are:
30+
- core dump (a.k.a system dump, core file)
31+
- diagnostic report (originally known as node report)
32+
33+
Reference: [Diagnostic Report](https://nodejs.org/dist/latest-v12.x/docs/api/report.html)
34+
35+
## Common issues
36+
37+
While the said key artifacts are expected to be generated on abnormal
38+
program conditions such as crash, (diagnostic report is still
39+
experimental so requires explicit command line flags to switch it ON)
40+
there are a number of issues that affects the automatic and complete
41+
generation of these artifacts. Most common such issues are:
42+
- Insufficient disk space for writing core dump data
43+
- Insufficient privilege to the core dump generator function
44+
- Insufficient resource limits set on the user
45+
- In case of diagnostic report, absence of report and symtpom flag
46+
47+
## Recommended Best Practice
48+
49+
This section provides specific recommendations for
50+
how to configure your systems in advance in order to be
51+
ready to investigate crashes.
52+
53+
### Available disk space
54+
Ensure that there is enough disk space available for the core file
55+
to be written:
56+
57+
- Maximum of 4GB for a 32 bit process.
58+
- Much larger for 64 bit process (common case). To know the precise
59+
requirement, measure the peak-load memory usage of your application.
60+
Add a 10% to that to accommodate core metadata. If you are using
61+
common monitoring tools, one of the graph should reveal the peak
62+
memory. If not, you can measure it directly in the system.
63+
64+
In Linux variants, you can use `top -p <pid>` to see the instantaneous
65+
memory usage of the process:
66+
67+
```
68+
$ top -p 106916
69+
70+
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
71+
106916 user 20 0 600404 54500 15572 R 109.7 0.0 81098:54 node
72+
```
73+
74+
In Darwin, the flag is `-pid`.
75+
76+
In AIX, the command is `topas`.
77+
78+
In freebsd, the command is `top`. In both AIX and freebsd, there is no
79+
flag to show per-process details.
80+
81+
In Windows, you could use the task
82+
manager window and view the process attributes visually.
83+
84+
Insufficient file system space will result in truncated core files,
85+
and can severely hamper the ability to diagnose the problem.
86+
87+
Figure out how much free space is available in the file system:
88+
89+
`df -k` can be used invariably across UNIX platforms.
90+
91+
In Windows, Windows explorer when pointed to a disk partition,
92+
provides a view of the available space in that partition.
93+
94+
By default, core file is generated on a crash event, and is
95+
written to the current working directory - the location from
96+
where the node process was started, in most of the UNIX variants.
97+
98+
In Darwin, it appears in /cores location.
99+
100+
By default, core files from node processes on Linux are named as
101+
`core` or `core.<pid>`, where <pid> is node process id.
102+
103+
By default, core files from node processes on AIX and Darwin are
104+
named ‘core’.
105+
106+
By default, core files from node processes on freebsd are named
107+
‘%N.core’. where `%N` is the name of the crashed process.
108+
109+
However, Superuser (root) can control and change these defaults.
110+
111+
In Linux, `sysctl kernel.core_pattern` shows corrent core file pattern.
112+
113+
Modify pattern using `sysctl -w kernel.core_pattern=pattern` as root.
114+
115+
In AIX, `lscore` shows the current core file pattern.
116+
117+
A best practice is to remove old core files on regular intervals.
118+
119+
This makes sure that the space in the system is used efficiently,
120+
and no application spefific data is persisted inadvertently.
121+
122+
A best practice is to name core file with the name, process ID and
123+
the creation timestamp of the failed process.
124+
125+
This makes it easy to relate the binary dump with crash specific context.
126+
127+
### Configuring to ensure core generation
128+
129+
Enable full core dump generation using `chdev -l sys0 -a fullcore=true`
130+
131+
Modify the current pattern using `chcore -p on -n on -l /path/to/coredumps`
132+
133+
In Darwin and freebsd, `sysctl kern.corefile` shows the corrent core file pattern.
134+
135+
Modify the current pattern using `sysctl -w kern.corefile=newpattern` as root.
136+
137+
To obtain full core files, set the following ulimit options, across UNIX variants:
138+
139+
`ulimit -c unlimited` - turn on core file generation capability with unlimited size
140+
141+
`ulimit -d unlimited` - set the user data limit to unlimited
142+
143+
`ulimit -f unlimited` - set the file limit to unlimited
144+
145+
The current ulimit settings can be displayed using:
146+
147+
`ulimit -a`
148+
149+
However, these are the `soft` limits and are enforced per user, per
150+
shell environment. Please note that these values are themselves
151+
practically constrained by the system-wide `hard` limit set by the
152+
system administrator. System administrators (with superuser privileges)
153+
may display, set or change the hard limits by adding the -H flag to
154+
the standard set of ulimit commands.
155+
156+
For example with:
157+
`ulimit -c -H`
158+
159+
104857600
160+
161+
we cannot increase the core file size to 200 MB. So
162+
163+
`ulimit -c 209715200`
164+
165+
will fail with reason:
166+
167+
`ulimit: core size: cannot modify limit: Invalid argument`
168+
169+
So if you hard limit settings are constraining your application's
170+
requirement, relax those specific settings through administrator
171+
account.
172+
173+
## Additional information
174+
175+
### Manual dump generation
176+
177+
Under certain circumstances where you want to collect a core
178+
manually follow these steps:
179+
180+
In linux, use `gcore [-a] [-o filename] pid` where `-a`
181+
specifies to dump everything.
182+
183+
In AIX, use `gencore [pid] [filename]`
184+
185+
In freebsd and Darwin, use `gcore [-s] [executable] pid`
186+
187+
In Windows, you can use `Task manager` window, right click on the
188+
node process and select `create dump` option.
189+
190+
Special note on Ubuntu systems with `Yama hardened kernel`
191+
192+
Yama security policy inhibits a second process from collecting dump,
193+
practically rendering `gcore` unusable.
194+
195+
Workaround this by enabling `ptrace` capability to gdb.
196+
Execute the below as root:
197+
198+
`setcap cap_sys_ptrace=+ep `which gdb``
199+
200+
201+
These steps make sure that when your Node.js application crashes in
202+
production a valid, full core dump is generated at a known location that
203+
can be loaded into debuggers that understand Node.js internals, and
204+
diagnose the issue. Next article in this series will focus on that part.

0 commit comments

Comments
 (0)