-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(node) refactor contextlines to use readline #12221
Conversation
size-limit report 📦
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Is it possible to do a basic benchmark for comparison here? Maybe like an express server that requires a trace of huge files?
It seems that createReadStream error handling was only backported to node 16 and not 14, which would explain the unit test failure on 14 I am tempted to add a fs.stat here to check if the file exists, but it feels like a very poor fix... |
If anyone wants to give this another look, feel free to. I still need to run benchmarks and confirm the tradeoff before merging (sometime this week). |
It just occurred to me that we should store the result of I'll run benchmarks after that change |
dev-packages/node-integration-tests/suites/public-api/captureException/catched-error/test.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
damn this is so clean 🔥
tyty @JonasBa for helping us with this!
Seems like dev is broken - the tests fail on span tests which are unrelated. @AbhiPrasad on a side note, I think we should add some simple heuristic where we short circuit adding context frames past a certain line or column number. For example a colno past a few hundred lines is very likely to be minified code and anything past a few thousand lines is likely bundled. |
I wonder if we can get avg/p75/p95 of line/col length by analyzing sentry errors, and then deciding a number based on that. |
@AbhiPrasad I just ran some CPU benchmarks between 1201eb2d7 and 2bf1b24ce. The large file was roughly 64KB, ~4k limes while the medium was only 500 lines (others were just simple init sdk and throw cases). In the case of medium and large file benchmarks, the error being thrown was at the last few lines to force a worst case scenario where we need to scan an entire file. As is, there is no limit on the upper bound, but I think it'd be reasonable to set one (even if it's something like 10k lines). The benchmarks were run using hyperfine with a warmup of 3 cycles and 40 min iteration count.
Something to note is that in a lot of the benchmarks (both previous and current) hyperfine was warnings about statistical outliers - you can see that the ranges for each benchmark vary quite a bit, and seem to have no correlation with the size of the file we are reading. My takeaway is that either my benchmarks are wrong, or the readline approach is not significantly worse in terms of CPU usage than mapping the entire file to memory while giving us a reduced memory usage. I did try measure the memory usage, but it seemed that it just varies in the range of a few megabytes, which makes it similarly unreliable. Maybe one last thing to test is to throw errors from a few large files and see how it impacts memory usage, wdyt? I also think that we should set some reasonably high max line (maybe at 10k) to avoid cases where we might be reading a very large bundled file. |
These benchmark results make sense to me because I wouldn't have expected the readline approach to be considerably faster, especially when the errors are towards the end of the file (ie. worst case). Do the v8 debugger "Record Allocation profile" results look considerably different between this branch and develop? I've never looked into the node readline code before. But now I'm looking at the API, in the worse case scenario, I'm doubtful it ends up allocating less than reading the entire file to string and splitting. Since there's no way to pass string slice references in JavaScript, every call to the To make it faster and allocate less you'd need to skip to the first required line through the raw stream and only then slice the buffer and allocate to strings for the required lines. |
Yeah exactly, though the goal here is to reduce peak memory usage while not impact CPU time. I would expect to see fs spike to allocate the entire file while readline would perform smaller allocation, somewhat similar as to what is described in https://betterprogramming.pub/a-memory-friendly-way-of-reading-files-in-node-js-a45ad0cc7bb6 which also indicates that the fs.read could be the next option if we want to optimize the individual string allocations. Worst case here is that we end up reading a large minified file, then none of this applies and we just map the entire string (we could avoid with some sensible max colno after which we just give up on reading such files) |
This PR has already achieved that by using Using I think that's enough to merge this PR! To reduce the allocations in the worse cases, we'd want a |
You nailed it :) It would be super interesting to see how well that performs (though I wont be tackling this anytime soon I think). I'm going to try and see if we have any internal services that we can monitor once this PR goes out. |
@JonasBa there is https://github.com/getsentry/chartcuterie/ and https://github.com/getsentry/sentry-docs but we don't have the best infra metrics monitoring for them re: CPU/memory |
…file handles (#14995) The ContextLines integration uses readable streams to more memory efficiently read files that it uses to attach source context to outgoing events. See more details here: #12221 Unfortunately, if we don't explicitly destroy the stream after creating and using it, it won't get closed, even when we remove the readline interface that uses the stream (which actual does the reading of files). To fix this, we adjust the resolve logic when getting file context to destroy the stream, as we anyway are done with the readline interface.
…file handles (#14995) The ContextLines integration uses readable streams to more memory efficiently read files that it uses to attach source context to outgoing events. See more details here: #12221 Unfortunately, if we don't explicitly destroy the stream after creating and using it, it won't get closed, even when we remove the readline interface that uses the stream (which actual does the reading of files). To fix this, we adjust the resolve logic when getting file context to destroy the stream, as we anyway are done with the readline interface.
Refactors contextlines integration to use readline instead of reading the entire file into memory (we were also doubling the memory as we were splitting the file via
.split(\n)
.Current implementation is not leveraging the cache to optimize file reads, so I need to reintroduce that. Something to note is that the caching behavior will become per frame location vs per file, which I think makes more sense as we can only assume that this stack frame might throw in the future (as opposed to anything from the file it is in).