Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in the past 9 days #118501

Closed
RivenSkaye opened this issue Dec 1, 2023 · 9 comments
Closed

Performance regression in the past 9 days #118501

RivenSkaye opened this issue Dec 1, 2023 · 9 comments
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@RivenSkaye
Copy link

RivenSkaye commented Dec 1, 2023

I'm doing the AoC and I figured I might as well upgrade after finishing day one. I re-ran benchmarks but I'm consistently getting back slowdowns.
rustup update updated my nightly toolchains as follows:

   nightly-x86_64-pc-windows-gnu updated - rustc 1.76.0-nightly (87e1447aa 2023-11-30) (from rustc 1.76.0-nightly (2f8d81f9d 2023-11-21))
   nightly-x86_64-pc-windows-msvc updated - rustc 1.76.0-nightly (87e1447aa 2023-11-30) (from rustc 1.76.0-nightly (2f8d81f9d 2023-11-21))

Code

I tried this code:
AoC repo. Run from the 2023 directory
I'll happily try and make an MRE over the weekend if anyone prefers that

I expected to see this happen: Same speed, as I'm not using any nightly-only APIs in the code being run in the benchmarks. The common utility module only uses #![feature(byte_slice_trim_ascii)] and that's only being called once.

Instead, this happened: The same safe code is consistently running slower with no changes in system load.

Old situation on my hardware:

part1bench 17.791 ns/iter (+/- 2956)
part2bench 27.777 ns/iter (+- 476)

New:

part1bench 19,953 ns/iter (+/- 3,506)
part2bench 31,927 ns/iter (+/- 675)

CPU: AMD Ryzen 7 5800H
mem: 15.4 / 16GB usable
OS: Windows 10 Home
Version: 22H2
Build: 19045.3693

Version it worked on

It most recently worked on: rustc 1.76.0-nightly (2f8d81f 2023-11-21)
Possibly later, I usually don't grab daily updates. I'll go through the other 8 to find where the performance characteristics changed.

Version with regression

rustc --version --verbose:

rustc 1.76.0-nightly (87e1447aa 2023-11-30)
binary: rustc
commit-hash: 87e1447aadaa2899ff6ccabe1fa669eb50fb60a1
commit-date: 2023-11-30
host: x86_64-pc-windows-gnu
release: 1.76.0-nightly
LLVM version: 17.0.5

I have tried both windows-gnu and windows-msvc toolchains, they both hit around the same speeds, and the same slowdown.

@RivenSkaye RivenSkaye added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Dec 1, 2023
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 1, 2023
@RivenSkaye
Copy link
Author

Did some more testing; removing mimalloc as global allocator makes no difference, while I added that because it made a similar difference in my favor on the 9 days old nightly.
Makes me wonder if rustc is using the allocator at all. Is there any way to ensure that it is?

@RivenSkaye RivenSkaye changed the title Speed regression in the past 9 days Performance regression in the past 9 days Dec 1, 2023
@lukas-code
Copy link
Member

Bisects to #110303, which does have an apparent perf impact.

Interestingly that PR seems to improve performance on my machine (Intel i7-6700K / Arch Linux), so maybe we can blame this all on inlining changes?

before:

test aoc_benching::part1bench ... bench:      30,336 ns/iter (+/- 341)
test aoc_benching::part2bench ... bench:      50,313 ns/iter (+/- 790)

after:

test aoc_benching::part1bench ... bench:      31,631 ns/iter (+/- 295)
test aoc_benching::part2bench ... bench:      46,727 ns/iter (+/- 868)

@saethlin
Copy link
Member

saethlin commented Dec 1, 2023

Can you share what you used to bisect this? Running the benchmarks myself on a quiet system with fixed CPU frequency, I can get either the slow number or the fast number just by running the benchmark executable repeatedly. The amount of spurious changes in the results that I am seeing is pretty typical for using the built-in benchmarking. criterion is better. It still has spurious reports, but at the rate its confidence intervals indicate. The built-in benchmarking does not provide a confidence interval.

Based on my own investigation, I do not think there has been a codegen change. The two benchmarks are completely inlined into a single function, so I highly doubt there was a relevant inlining change.

@RivenSkaye Running nm target/release/day01 | grep mi_ suggests that mimalloc has indeed been linked in, though I'll admit I'm not familiar with the allocator.

@RivenSkaye
Copy link
Author

First, mimalloc indeed still get linked into it, as I do get a similar loss of performance when removing it. That's good to know, that means nothing's actively broken. forget I mentioned it, it briefly crossed my mind when going through benchmark history

Okay, so I've run the benchmarks on a loop before opening the report, and it pretty consistently reported averages within 800ns from one another across runs. My system is a tad more noisy than I'd like, but I'm still getting pretty close results across runs and usually ~2µs difference. There are outliers on both but that's to be expected. Though I do understand I can't come in saying coarse benchmarks report slightly different numbers and expect a magical fix, so I come bearing gifts. And I come bearing questions, as I'd like to know what, if anything, I could do myself to further investigate possible differences in the output files?

The gifts: The output of cargo asm for both rust versions, and an accompanying unified diff for brevity viewing
cargo-asm_newbuild.log
cargo-asm_oldbuild.log
cargo-asm_diff.patch

@saethlin
Copy link
Member

saethlin commented Dec 1, 2023

My best guess then is that somehow your system ends up being very sensitive to code alignment in a way that mine isn't.

I've disassembled both part_one and part_two (well actually the fold that all the part_two implementation is in) and they have exactly the same instructions in both toolchains, nightly-2023-11-21 and nightly-2023-11-30. I did that with objdump -dC target/release/deps/day01-fdf9de81948434a9 | less (that path is printed by cargo bench) then searching for part_one>: or part_two>: will jump you to the implementation. For part_two you should see a call 186c0 or some other address, then search for that 186c0 or whatever yours is, and you should find the function that contains the actual implementation.

If you want to diff the output of objdump, you'll need to strip out the instruction addresses. If you can find a diff in the resulting instructions under the toolchains, that might indicate this perf difference is due to something more than lucky/unlucky alignment.

@lukas-code
Copy link
Member

lukas-code commented Dec 1, 2023

Can you share what you used to bisect this?

I used cargo bisect-rustc --start 2023-11-21 --prompt -- bench. On my machine i got a very consistent ~50-51us before and ~46-47us after on the part2bench and I just selected "baseline" for the 50us and "regressed" for the 46us.

I also tried testing it with criterion now and around the same results initially. criterion test here

However, I found out something interesting: sticking #[inline(never)] on the part_two function will swap around the results for some reason, so I now get:

inline allowed #[inline(never)]
5c97719 (old) slow (50us) fast (46us)
9529a5d (new) fast (46us) slow (50us)

The only way i can explain these results is some hardware prefetch gone wrong.

@the8472
Copy link
Member

the8472 commented Dec 2, 2023

Maybe try running this with -Ctarget-cpu=native or -Ztune-cpu=native? Sometimes the baseline heuristics are counterproductive on modern CPUs.

@apiraino
Copy link
Contributor

apiraino commented Dec 4, 2023

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium +I-slow

@rustbot rustbot added I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Dec 4, 2023
@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 4, 2023
@saethlin saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 8, 2023
@RivenSkaye
Copy link
Author

Okay, sorry for stalling so long. There's loads of differences other than addresses, mostly in ordering within the output and the remainder in code outside of my control (and also not in places that would impact the benchmarks).

Decided to run against yesterday's nightly too for completeness sake and got results similar to the old situation; taking a peek at the objdump from that reveals slightly improved codegen with ordering very similar to the slower output.

No matter the case, sorry for the wasted time as this seemed to have indeed been an issue to do with either alignment or binary layout. Whatever the case, this isn't on the rustc end of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants