-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy_misaligned_words: avoid out-of-bounds accesses #799
Conversation
ea4768d
to
469b91f
Compare
I've been meaning to ask about this, it sounds like a great idea to me. You can just add a new main.yml CI job, probably with these bits compiler-builtins/.github/workflows/main.yml Lines 107 to 115 in 571ce5f
The float tests can probably be skipped since that module has no unsafe (we might even be able to |
Even |
e4ef842
to
842fa94
Compare
src/mem/impls.rs
Outdated
dest_usize = dest_usize.wrapping_add(1); | ||
} | ||
|
||
// There's one more element left to go, and we can't use the loop for that as on the `src` side, | ||
// it is partially out-of-bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code previously seemed unaware that there can also be OOB accesses at the end of the range -- but of course that's fundamentally the same problem as at the beginning.
605b6ca
to
d8cf8a2
Compare
Miri is looking good on CI :) |
cd0efd6
to
01c90a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some surface-level notes, I'll take a closer look at perf soonish
b3464f6
to
9330e7c
Compare
Unfortunately it looks like this comes close to doubling the total line and label counts of this routine https://godbolt.org/z/7WYa6e83n. I agree that the UB is worth fixing even at a performance hit, but I have to imagine this could be improved with massaging. @nbdd0121 I know it has been a long time since you worked on #405 but do you have any ideas on how to improve the codegen here without OOB access? (I haven't actually tested so it is possible visual asm heuristics don't accurately reflect runtime, but the end blocks are definitely larger) |
Yeah there's prefix and postfix handling now which of course adds some extra code and labels. The original code neglected to treat the last loop iteration differently which makes this not a fully fair comparison (at the very least, we should compare with a version that uses an atomic/volatile load for the last round, as that one can also be OOB). The code size could be reduced by using This compares the "original but with the final loop iteration unrolled" with the |
Technically last iteration of loop doesn't special handling if use use unordered atomic load for each loop iteration. The codegen shouldn't be massively different. Using byte copy doesn't necessarily mean worse performance as it at most (on each end) performs 3 additional byte copies. But it also removes data-dependent branches which is hard to predict. This can also merged with the out-most byte-copy computation. I guess benchmarking would be necessary. |
In my view, since this is almost certainly still faster than the code before #405, and that PR achieved its performance by having UB, this is still a win. But I'd also be curious what the numbers actually look like. If someone has access to an ARM-32 system and could benchmark this, that would be great. :) |
Just for an update, I'm trying pretty hard to get some form of benchmarks just so we have a reference, not that I think perf is really worth blocking on as long as it's not somehow awful. There are now instruction count benchmarks, you need Mind rebasing at some point to pick that up? (Sorry, I moved some things around hence the conflict) |
9330e7c
to
277aecd
Compare
Sure, the rebase went through without any manual work. |
Gave up on qemu, but I had a 32-bit raspberry pi laying around :)
I don't expect that to make or break anyone's day, so let's get this UB fixed 🎉. Full benchmark log
|
277aecd
to
b603726
Compare
That is strange, for larger tests the slowdown should trail off to 0, no? The core "hot" loop is exactly the same as before. Should I submit a PR that copies the prefix/postfix bytewise instead of via a 1-byte-load and a 2-byte-load, just so that we can get those numbers for comparison? |
Maybe they were tapering at different rates so the delta appears larger at larger copy lengths? Though 4k and 1M are kind of large to be seeing that behavior. I don't think we really need to do anything here but if you have something in mind, I'm happy to test it. |
Fixes #559 for
memmove
/memcpy
: load the underaligned prefix and suffix incopy_*_misaligned_words
in up to 3 separate aligned loads (a 1-byte load, a 2-byte load, and for 64bit targets a 4-byte load), while only doing those loads that are actually inbounds. The hope is that the performance loss compared to a single aligned ptr-sized load is negligible.I confirmed that this now passes Miri (the second of these already worked before this PR):
I added a new test since the existing test had some slack space around the memory being copied, making all accesses accidentally inbounds (but Miri was still helpful to confirm everything is aligned). This test found a bug in my code, fixed in the second commit. :D
This also add those above commands to CI so hopefully this crate still stay green for Miri. :)