|
| 1 | +# Domain Module Postmortem |
| 2 | + |
| 3 | +## Usability Issues |
| 4 | + |
| 5 | +### Implicit Behavior |
| 6 | + |
| 7 | +It's possible for a developer to create a new domain and then simply run |
| 8 | +`domain.enter()`. Which then acts as a catch-all for any exception in the |
| 9 | +future that couldn't be observed by the thrower. Allowing a module author to |
| 10 | +intercept the exceptions of unrelated code in a different module. Preventing |
| 11 | +the originator of the code from knowing about its own exceptions. |
| 12 | + |
| 13 | +Here's an example of how one indirectly linked modules can affect another: |
| 14 | + |
| 15 | +```js |
| 16 | +// module a.js |
| 17 | +const b = require('./b'); |
| 18 | +const c = require('./c'); |
| 19 | + |
| 20 | + |
| 21 | +// module b.js |
| 22 | +const d = require('domain').create(); |
| 23 | +d.on('error', () => { /* silence everything */ }); |
| 24 | +d.enter(); |
| 25 | + |
| 26 | + |
| 27 | +// module c.js |
| 28 | +const dep = require('some-dep'); |
| 29 | +dep.method(); // Uh-oh! This method doesn't actually exist. |
| 30 | +``` |
| 31 | + |
| 32 | +Since module `b` enters the domain but never exits any uncaught exception will |
| 33 | +be swallowed. Leaving module `c` in the dark as to why it didn't run the entire |
| 34 | +script. Leaving a potentially partially populated `module.exports`. Doing this |
| 35 | +is not the same as listening for `'uncaughtException'`. As the latter is |
| 36 | +explicitly meant to globally catch errors. The other issue is that domains are |
| 37 | +processed prior to any `'uncaughtException'` handlers, and prevent them from |
| 38 | +running. |
| 39 | + |
| 40 | +Another issue is that domains route errors automatically if no `'error'` |
| 41 | +handler was set on the event emitter. There is no opt-in mechanism for this, |
| 42 | +and automatically propagates across the entire asynchronous chain. This may |
| 43 | +seem useful at first, but once asynchronous calls are two or more modules deep |
| 44 | +and one of them doesn't include an error handler the creator of the domain will |
| 45 | +suddenly be catching unexpected exceptions, and the thrower's exception will go |
| 46 | +unnoticed by the author. |
| 47 | + |
| 48 | +The following is a simple example of how a missing `'error'` handler allows |
| 49 | +the active domain to hijack the error: |
| 50 | + |
| 51 | +```js |
| 52 | +const domain = require('domain'); |
| 53 | +const net = require('net'); |
| 54 | +const d = domain.create(); |
| 55 | +d.on('error', (err) => console.error(err.message)); |
| 56 | + |
| 57 | +d.run(() => net.createServer((c) => { |
| 58 | + c.end(); |
| 59 | + c.write('bye'); |
| 60 | +}).listen(8000)); |
| 61 | +``` |
| 62 | + |
| 63 | +Even manually removing the connection via `d.remove(c)` does not prevent the |
| 64 | +connection's error from being automatically intercepted. |
| 65 | + |
| 66 | +Failures that plagues both error routing and exception handling are the |
| 67 | +inconsistencies in how errors are bubbled. The following is an example of how |
| 68 | +nested domains will and won't bubble the exception based on when they happen: |
| 69 | + |
| 70 | +```js |
| 71 | +const domain = require('domain'); |
| 72 | +const net = require('net'); |
| 73 | +const d = domain.create(); |
| 74 | +d.on('error', () => console.error('d intercepted an error')); |
| 75 | + |
| 76 | +d.run(() => { |
| 77 | + const server = net.createServer((c) => { |
| 78 | + const e = domain.create(); // No 'error' handler being set. |
| 79 | + e.run(() => { |
| 80 | + // This will not be caught by d's error handler. |
| 81 | + setImmediate(() => { |
| 82 | + throw new Error('thrown from setImmediate'); |
| 83 | + }); |
| 84 | + // Though this one will bubble to d's error handler. |
| 85 | + throw new Error('immediately thrown'); |
| 86 | + }); |
| 87 | + }).listen(8080); |
| 88 | +}); |
| 89 | +``` |
| 90 | + |
| 91 | +It may be expected that nested domains always remain nested, and will always |
| 92 | +propagate the exception up the domain stack. Or that exceptions will never |
| 93 | +automatically bubble. Unfortunately both these situations occur, leading to |
| 94 | +potentially confusing behavior that may even be prone to difficult to debug |
| 95 | +timing conflicts. |
| 96 | + |
| 97 | + |
| 98 | +### API Gaps |
| 99 | + |
| 100 | +While APIs based on using `EventEmitter` can use `bind()` and errback style |
| 101 | +callbacks can use `intercept()`, alternative APIs that implicitly bind to the |
| 102 | +active domain must be executed inside of `run()`. Meaning if module authors |
| 103 | +wanted to support domains using a mechanism alternative to those mentioned they |
| 104 | +must manually implement domain support themselves. Instead of being able to |
| 105 | +leverage the implicit mechanisms already in place. |
| 106 | + |
| 107 | + |
| 108 | +### Error Propagation |
| 109 | + |
| 110 | +Propagating errors across nested domains is not straight forward, if even |
| 111 | +possible. Existing documentation shows a simple example of how to `close()` an |
| 112 | +`http` server if there is an error in the request handler. What it does not |
| 113 | +explain is how to close the server if the request handler creates another |
| 114 | +domain instance for another async request. Using the following as a simple |
| 115 | +example of the failing of error propagation: |
| 116 | + |
| 117 | +```js |
| 118 | +const d1 = domain.create(); |
| 119 | +d1.foo = true; // custom member to make more visible in console |
| 120 | +d1.on('error', (er) => { /* handle error */ }); |
| 121 | + |
| 122 | +d1.run(() => setTimeout(() => { |
| 123 | + const d2 = domain.create(); |
| 124 | + d2.bar = 43; |
| 125 | + d2.on('error', (er) => console.error(er.message, domain._stack)); |
| 126 | + d2.run(() => { |
| 127 | + setTimeout(() => { |
| 128 | + setTimeout(() => { |
| 129 | + throw new Error('outer'); |
| 130 | + }); |
| 131 | + throw new Error('inner') |
| 132 | + }); |
| 133 | + }); |
| 134 | +})); |
| 135 | +``` |
| 136 | + |
| 137 | +Even in the case that the domain instances are being used for local storage so |
| 138 | +access to resources are made available there is still no way to allow the error |
| 139 | +to continue propagating from `d2` back to `d1`. Quick inspection may tell us |
| 140 | +that simply throwing from `d2`'s domain `'error'` handler would allow `d1` to |
| 141 | +then catch the exception and execute its own error handler. Though that is not |
| 142 | +the case. Upon inspection of `domain._stack` you'll see that the stack only |
| 143 | +contains `d2`. |
| 144 | + |
| 145 | +This may be considered a failing of the API, but even if it did operate in this |
| 146 | +way there is still the issue of transmitting the fact that a branch in the |
| 147 | +asynchronous execution has failed, and that all further operations in that |
| 148 | +branch must cease. In the example of the http request handler, if we fire off |
| 149 | +several asynchronous requests and each one then `write()`'s data back to the |
| 150 | +client many more errors will arise from attempting to `write()` to a closed |
| 151 | +handle. More on this in _Resource Cleanup on Exception_. |
| 152 | + |
| 153 | + |
| 154 | +### Resource Cleanup on Exception |
| 155 | + |
| 156 | +The script [`domain-resource-cleanup.js`](domain-resource-cleanup.js) |
| 157 | +contains a more complex example of properly cleaning up in a small resource |
| 158 | +dependency tree in the case that an exception occurs in a given connection or |
| 159 | +any of its dependencies. Breaking down the script into its basic operations: |
| 160 | + |
| 161 | +- When a new connection happens, concurrently: |
| 162 | + - Open a file on the file system |
| 163 | + - Open Pipe to unique socket |
| 164 | +- Read a chunk of the file asynchronously |
| 165 | +- Write chunk to both the TCP connection and any listening sockets |
| 166 | +- If any of these resources error, notify all other attached resources that |
| 167 | + they need to clean up and shutdown |
| 168 | + |
| 169 | +As we can see from this example a lot more must be done to properly clean up |
| 170 | +resources when something fails than what can be done strictly through the |
| 171 | +domain API. All that domains offer is an exception aggregation mechanism. Even |
| 172 | +the potentially useful ability to propagate data with the domain is easily |
| 173 | +countered, in this example, by passing the needed resources as a function |
| 174 | +argument. |
| 175 | + |
| 176 | +One problem domains perpetuated was the supposed simplicity of being able to |
| 177 | +continue execution, contrary to what the documentation stated, of the |
| 178 | +application despite an unexpected exception. This example demonstrates the |
| 179 | +fallacy behind that idea. |
| 180 | + |
| 181 | +Attempting proper resource cleanup on unexpected exception becomes more complex |
| 182 | +as the application itself grows in complexity. This example only has 3 basic |
| 183 | +resources in play, and all of them with a clear dependency path. If an |
| 184 | +application uses something like shared resources or resource reuse the ability |
| 185 | +to cleanup, and properly test that cleanup has been done, grows greatly. |
| 186 | + |
| 187 | +In the end, in terms of handling errors, domains aren't much more than a |
| 188 | +glorified `'uncaughtException'` handler. Except with more implicit and |
| 189 | +unobservable behavior by third-parties. |
| 190 | + |
| 191 | + |
| 192 | +### Resource Propagation |
| 193 | + |
| 194 | +Another use case for domains was to use it to propagate data along asynchronous |
| 195 | +data paths. One problematic point is the ambiguity of when to expect the |
| 196 | +correct domain when there are multiple in the stack (which must be assumed if |
| 197 | +the async stack works with other modules). Also the conflict between being |
| 198 | +able to depend on a domain for error handling while also having it available to |
| 199 | +retrieve the necessary data. |
| 200 | + |
| 201 | +The following is a involved example demonstrating the failing using domains to |
| 202 | +propagate data along asynchronous stacks: |
| 203 | + |
| 204 | +```js |
| 205 | +const domain = require('domain'); |
| 206 | +const net = require('net'); |
| 207 | + |
| 208 | +const server = net.createServer((c) => { |
| 209 | + // Use a domain to propagate data across events within the |
| 210 | + // connection so that we don't have to pass arguments |
| 211 | + // everywhere. |
| 212 | + const d = domain.create(); |
| 213 | + d.data = { connection: c }; |
| 214 | + d.add(c); |
| 215 | + // Mock class that does some useless async data transformation |
| 216 | + // for demonstration purposes. |
| 217 | + const ds = new DataStream(dataTransformed); |
| 218 | + c.on('data', (chunk) => ds.data(chunk)); |
| 219 | +}).listen(8080, () => console.log(`listening on 8080`)); |
| 220 | + |
| 221 | +function dataTransformed(chunk) { |
| 222 | + // FAIL! Because the DataStream instance also created a |
| 223 | + // domain we have now lost the active domain we had |
| 224 | + // hoped to use. |
| 225 | + domain.active.data.connection.write(chunk); |
| 226 | +} |
| 227 | + |
| 228 | +function DataStream(cb) { |
| 229 | + this.cb = cb; |
| 230 | + // DataStream wants to use domains for data propagation too! |
| 231 | + // Unfortunately this will conflict with any domain that |
| 232 | + // already exists. |
| 233 | + this.domain = domain.create(); |
| 234 | + this.domain.data = { inst: this }; |
| 235 | +} |
| 236 | + |
| 237 | +DataStream.prototype.data = function data(chunk) { |
| 238 | + // This code is self contained, but pretend it's a complex |
| 239 | + // operation that crosses at least one other module. So |
| 240 | + // passing along "this", etc., is not easy. |
| 241 | + this.domain.run(function() { |
| 242 | + // Simulate an async operation that does the data transform. |
| 243 | + setImmediate(() => { |
| 244 | + for (var i = 0; i < chunk.length; i++) |
| 245 | + chunk[i] = ((chunk[i] + Math.random() * 100) % 96) + 33; |
| 246 | + // Grab the instance from the active domain and use that |
| 247 | + // to call the user's callback. |
| 248 | + const self = domain.active.data.inst; |
| 249 | + self.cb.call(self, chunk); |
| 250 | + }); |
| 251 | + }); |
| 252 | +}; |
| 253 | +``` |
| 254 | + |
| 255 | +The above shows that it is difficult to have more than one asynchronous API |
| 256 | +attempt to use domains to propagate data. This example could possibly be fixed |
| 257 | +by assigning `parent: domain.active` in the `DataStream` constructor. Then |
| 258 | +restoring it via `domain.active = domain.active.data.parent` just before the |
| 259 | +user's callback is called. Also the instantiation of `DataStream` in the |
| 260 | +`'connection'` callback must be run inside `d.run()`, instead of simply using |
| 261 | +`d.add(c)`, otherwise there will be no active domain. |
| 262 | + |
| 263 | +In short, for this to have a prayer of a chance usage would need to strictly |
| 264 | +adhere to a set of guidelines that would be difficult to enforce or test. |
| 265 | + |
| 266 | + |
| 267 | +## Performance Issues |
| 268 | + |
| 269 | +A significant deterrent from using domains is the overhead. Using node's |
| 270 | +built-in http benchmark, `http_simple.js`, without domains it can handle over |
| 271 | +22,000 requests/second. Whereas if it's run with `NODE_USE_DOMAINS=1` that |
| 272 | +number drops down to under 17,000 requests/second. In this case there is only |
| 273 | +a single global domain. If we edit the benchmark so the http request callback |
| 274 | +creates a new domain instance performance drops further to 15,000 |
| 275 | +requests/second. |
| 276 | + |
| 277 | +While this probably wouldn't affect a server only serving a few hundred or even |
| 278 | +a thousand requests per second, the amount of overhead is directly proportional |
| 279 | +to the number of asynchronous requests made. So if a single connection needs to |
| 280 | +connect to several other services all of those will contribute to the overall |
| 281 | +latency of delivering the final product to the client. |
| 282 | + |
| 283 | +Using `AsyncWrap` and tracking the number of times |
| 284 | +`init`/`pre`/`post`/`destroy` are called in the mentioned benchmark we find |
| 285 | +that the sum of all events called is over 170,000 times per second. This means |
| 286 | +even adding 1 microsecond overhead per call for any type of setup or tear down |
| 287 | +will result in a 17% performance loss. Granted, this is for the optimized |
| 288 | +scenario of the benchmark, but I believe this demonstrates the necessity for a |
| 289 | +mechanism such as domain to be as cheap to run as possible. |
| 290 | + |
| 291 | + |
| 292 | +## Looking Ahead |
| 293 | + |
| 294 | +The domain module has been soft deprecated since Dec 2014, but has not yet been |
| 295 | +removed because node offers no alternative functionality at the moment. As of |
| 296 | +this writing there is ongoing work building out the `AsyncWrap` API and a |
| 297 | +proposal for Zones being prepared for the TC39. At such time there is suitable |
| 298 | +functionality to replace domains it will undergo the full deprecation cycle and |
| 299 | +eventually be removed from core. |
0 commit comments