Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Use #[thread_local] instead of thread_local! in rustc_middle #106270

Closed
wants to merge 1 commit into from

Conversation

jyn514
Copy link
Member

@jyn514 jyn514 commented Dec 29, 2022

This avoid monomorphizing LocalKey::try_with 5 times for each query. It has the downside that destructors for the thread-local are never run; but we don't depend on the destructors for ImplicitContext itself anywhere.

Helps with #65031 (comment)

r? @jyn514

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 29, 2022
@jyn514
Copy link
Member Author

jyn514 commented Dec 29, 2022

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 29, 2022
@bors
Copy link
Contributor

bors commented Dec 29, 2022

⌛ Trying commit afd4524ac844d1f2dc3ed87a61d25021897f2993 with merge 389dafa688a3ad30c1a062340c53cdf031a3f1e1...

static TLV: Cell<usize> = const { Cell::new(0) };
}
#[thread_local]
static TLV: Cell<usize> = const { Cell::new(0) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have an inline_const here? Couldn't that just call Cell::new directly?

@bjorn3
Copy link
Member

bjorn3 commented Dec 29, 2022

This breaks on systems without #[thread_local] support like I believe MinGW.

Destructors not running doesn't matter for TLV as Cell<usize> doesn't have one anyway.

@jyn514
Copy link
Member Author

jyn514 commented Dec 29, 2022

This breaks on systems without #[thread_local] support like I believe MinGW.

How can I tell which systems aren't supported by thread_local?

@thomcc
Copy link
Member

thomcc commented Dec 29, 2022

This breaks on systems without #[thread_local] support like I believe MinGW.

How can I tell which systems aren't supported by thread_local?

#[cfg(target_thread_local)], for the most part.

@bors
Copy link
Contributor

bors commented Dec 29, 2022

☀️ Try build successful - checks-actions
Build commit: 389dafa688a3ad30c1a062340c53cdf031a3f1e1 (389dafa688a3ad30c1a062340c53cdf031a3f1e1)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (389dafa688a3ad30c1a062340c53cdf031a3f1e1): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
1.4% [1.4%, 1.4%] 1
Improvements ✅
(primary)
-0.4% [-0.5%, -0.3%] 4
Improvements ✅
(secondary)
-0.9% [-1.4%, -0.3%] 2
All ❌✅ (primary) -0.4% [-0.5%, -0.3%] 4

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.8% [2.1%, 3.4%] 5
Regressions ❌
(secondary)
2.2% [0.7%, 3.7%] 49
Improvements ✅
(primary)
-0.9% [-0.9%, -0.9%] 1
Improvements ✅
(secondary)
-1.7% [-2.3%, -1.4%] 3
All ❌✅ (primary) 2.1% [-0.9%, 3.4%] 6

Cycles

This benchmark run did not return any relevant results for this metric.

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Dec 30, 2022
@jyn514
Copy link
Member Author

jyn514 commented Dec 30, 2022

Comparison of -Z dump-mono-stats before and after:

$ jq -c 'sort_by(.total_estimate, .name) | .[]' rustc_query_impl.mono_items-thread_local_attr.json | diff - <(jq -c 'sort_by(.total_estimate, .name) | .[]' rustc_query_impl.mono_items-thread_local_macro.json)
1893a1894
> {"name":"rustc_middle::ty::tls::get_tlv::{closure#0}","instantiation_count":1,"size_estimate":5,"total_estimate":5}
2997a2999
> {"name":"std::thread::__FastLocalKeyInner::<T>::register_dtor","instantiation_count":1,"size_estimate":10,"total_estimate":10}
9137a9140
> {"name":"rustc_middle::ty::tls::TLV::__getit","instantiation_count":3,"size_estimate":75,"total_estimate":225}
9175d9177
< {"name":"std::mem::needs_drop","instantiation_count":125,"size_estimate":2,"total_estimate":250}
9182a9185
> {"name":"std::mem::needs_drop","instantiation_count":128,"size_estimate":2,"total_estimate":256}
9267d9269
< {"name":"std::thread::LocalKey::<T>::with","instantiation_count":19,"size_estimate":17,"total_estimate":323}
9384c9386
< {"name":"std::mem::drop","instantiation_count":257,"size_estimate":2,"total_estimate":514}
---
> {"name":"std::mem::drop","instantiation_count":256,"size_estimate":2,"total_estimate":512}
9386d9387
< {"name":"<std::result::Result<T, F> as std::ops::FromResidual<std::result::Result<std::convert::Infallible, E>>>::from_residual","instantiation_count":40,"size_estimate":13,"total_estimate":520}
9395a9397
> {"name":"<std::result::Result<T, F> as std::ops::FromResidual<std::result::Result<std::convert::Infallible, E>>>::from_residual","instantiation_count":41,"size_estimate":13,"total_estimate":533}
9828d9829
< {"name":"std::thread::LocalKey::<T>::try_with","instantiation_count":19,"size_estimate":58,"total_estimate":1102}
9880a9882
> {"name":"std::cell::Cell::<T>::get","instantiation_count":192,"size_estimate":8,"total_estimate":1536}
9883d9884
< {"name":"std::cell::Cell::<T>::get","instantiation_count":194,"size_estimate":8,"total_estimate":1552}
9989,9990c9990,9991
< {"name":"std::mem::MaybeUninit::<T>::uninit","instantiation_count":1454,"size_estimate":2,"total_estimate":2908}
< {"name":"std::mem::ManuallyDrop::<T>::into_inner","instantiation_count":1456,"size_estimate":2,"total_estimate":2912}
---
> {"name":"std::mem::MaybeUninit::<T>::uninit","instantiation_count":1453,"size_estimate":2,"total_estimate":2906}
> {"name":"std::mem::ManuallyDrop::<T>::into_inner","instantiation_count":1455,"size_estimate":2,"total_estimate":2910}
9998a10000
> {"name":"std::cell::Cell::<T>::set","instantiation_count":192,"size_estimate":16,"total_estimate":3072}
10001d10002
< {"name":"std::cell::Cell::<T>::set","instantiation_count":193,"size_estimate":16,"total_estimate":3088}
10059a10061
> {"name":"std::cell::Cell::<T>::replace","instantiation_count":192,"size_estimate":25,"total_estimate":4800}
10062d10063
< {"name":"std::cell::Cell::<T>::replace","instantiation_count":193,"size_estimate":25,"total_estimate":4825}
10137c10138
< {"name":"std::cell::UnsafeCell::<T>::get","instantiation_count":1257,"size_estimate":8,"total_estimate":10056}
---
> {"name":"std::cell::UnsafeCell::<T>::get","instantiation_count":1255,"size_estimate":8,"total_estimate":10040}
10151c10152
< {"name":"std::mem::MaybeUninit::<T>::assume_init","instantiation_count":1456,"size_estimate":8,"total_estimate":11648}
---
> {"name":"std::mem::MaybeUninit::<T>::assume_init","instantiation_count":1455,"size_estimate":8,"total_estimate":11640}
10157c10158
< {"name":"std::mem::MaybeUninit::<T>::as_mut_ptr","instantiation_count":1544,"size_estimate":8,"total_estimate":12352}
---
> {"name":"std::mem::MaybeUninit::<T>::as_mut_ptr","instantiation_count":1543,"size_estimate":8,"total_estimate":12344}
10172a10174,10175
> {"name":"rustc_middle::ty::tls::set_tlv::{closure#0}::{closure#0}","instantiation_count":1597,"size_estimate":9,"total_estimate":14373}
> {"name":"rustc_middle::ty::tls::set_tlv::{closure#1}","instantiation_count":1597,"size_estimate":9,"total_estimate":14373}
10181,10182c10184
< {"name":"rustc_middle::ty::tls::set_tlv::{closure#0}","instantiation_count":1597,"size_estimate":11,"total_estimate":17567}
< {"name":"std::ptr::read::runtime","instantiation_count":1426,"size_estimate":13,"total_estimate":18538}
---
> {"name":"std::ptr::read::runtime","instantiation_count":1425,"size_estimate":13,"total_estimate":18525}
10184c10186
< {"name":"std::result::Result::<T, E>::expect","instantiation_count":968,"size_estimate":20,"total_estimate":19360}
---
> {"name":"std::result::Result::<T, E>::expect","instantiation_count":969,"size_estimate":20,"total_estimate":19380}
10185a10188
> {"name":"rustc_middle::ty::tls::set_tlv::{closure#0}","instantiation_count":1597,"size_estimate":13,"total_estimate":20761}
10189c10192
< {"name":"std::mem::replace","instantiation_count":1140,"size_estimate":20,"total_estimate":22800}
---
> {"name":"std::mem::replace","instantiation_count":1139,"size_estimate":20,"total_estimate":22780}
10199c10202
< {"name":"std::ptr::write::runtime","instantiation_count":1819,"size_estimate":16,"total_estimate":29104}
---
> {"name":"std::ptr::write::runtime","instantiation_count":1818,"size_estimate":16,"total_estimate":29088}
10212,10213c10215,10216
< {"name":"std::ptr::write","instantiation_count":1790,"size_estimate":21,"total_estimate":37590}
< {"name":"std::ptr::read","instantiation_count":1426,"size_estimate":28,"total_estimate":39928}
---
> {"name":"std::ptr::write","instantiation_count":1789,"size_estimate":21,"total_estimate":37569}
> {"name":"std::ptr::read","instantiation_count":1425,"size_estimate":28,"total_estimate":39900}
10227a10231
> {"name":"std::thread::LocalKey::<T>::with","instantiation_count":3214,"size_estimate":17,"total_estimate":54638}
10229c10233
< {"name":"rustc_middle::ty::tls::set_tlv","instantiation_count":1597,"size_estimate":37,"total_estimate":59089}
---
> {"name":"rustc_middle::ty::tls::set_tlv","instantiation_count":1597,"size_estimate":39,"total_estimate":62283}
10240a10245
> {"name":"std::thread::LocalKey::<T>::try_with","instantiation_count":3214,"size_estimate":58,"total_estimate":186412}

In particular, notice that the LocalKey::try_with total_estimate has gone from 186412 -> 1102, and LocalKey::with has gone from 54638 -> 323. tls::set_tlv has also gone from 62283 -> 59089, and while there are some other changes none of them look too significant.

@Mark-Simulacrum
Copy link
Member

Do you have a sense of what percentage that is of the total sum? It looks like wall time is estimated to have gone down by about 3% -- I wonder how good our estimate is. (Note that the wall time is a little noisy, so it might actually be less or more than 3%. It's also true that we have a -j 2 build for this so there's some parallelism involved).

@jyn514
Copy link
Member Author

jyn514 commented Dec 30, 2022

$ jq 'map(.total_estimate) | add' rustc_query_impl.mono_items-thread_local_macro.json | head
5153413

thread_local_macro is before this PR (29d76cc). dump-mono-stats reports a total reduction of (186412-1102) + (54638 - 323) + (62283 - 59089) = 242819. So 242819 / 5153413 is a 4.7% reduction; and I think the overcounting is only because I've cherry-picked the LocalKey examples, let me try doing it for everything in the file.

@jyn514
Copy link
Member Author

jyn514 commented Dec 30, 2022

ah no, apparently I was undercounting.

$ jq 'map(.total_estimate) | add' rustc_query_impl.mono_items-thread_local_macro.json
5153413
$ jq 'map(.total_estimate) | add' rustc_query_impl.mono_items-thread_local_attr.json
4878568
$ python
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
>>> 1 - 4878568/5153413
0.05333261665618494

So according to dump-mono-stats this is a 5.3% reduction in code that needs to be sent in LLVM :) it says "estimate" so that might not be 100% accurate but it seems roughly comparable to what perf.rlo says.

@jyn514
Copy link
Member Author

jyn514 commented Dec 30, 2022

@bjorn3 hmm, x check --host x86_64-pc-windows-gnu compiler works ok for me locally - how could I verify whether this breaks mingw? Does it only break at runtime?

@lqd
Copy link
Member

lqd commented Dec 30, 2022

The size estimate is a rough approximation here, it's still basically the number of MIR statements (here cargo-llvm-lines stats should be more realistic wrt LLVM compile times).

It's still useful of course, but improving the estimates to better reflect the backends' actual compilation-to-native-code work (rather than, say, the lowering to their own IR) is tracked in issues like #69382 and others.

@thomcc
Copy link
Member

thomcc commented Dec 30, 2022

Does it only break at runtime?

Yes, but not deterministically.

@mati865
Copy link
Contributor

mati865 commented Dec 30, 2022

@bjorn3 hmm, x check --host x86_64-pc-windows-gnu compiler works ok for me locally - how could I verify whether this breaks mingw? Does it only break at runtime?

You need to build the compiler that will use TLS so at the very least use build instead of check.

@jyn514 jyn514 force-pushed the faster-thread-local branch from afd4524 to 2a845e8 Compare December 30, 2022 22:40
@rust-log-analyzer

This comment has been minimized.

This avoid monomorphizing `LocalKey::try_with` 5 times for each query.
It has the downside that destructors for the thread-local are never run;
but we don't depend on the destructors for ImplicitContext itself
anywhere.
@jyn514 jyn514 force-pushed the faster-thread-local branch from 2a845e8 to 0aa4956 Compare December 30, 2022 23:02
@jyn514
Copy link
Member Author

jyn514 commented Jan 2, 2023

#106311 has a much better perf improvement, and also avoids needing multiple versions of the code for different platforms.

@jyn514 jyn514 closed this Jan 2, 2023
@jyn514 jyn514 deleted the faster-thread-local branch January 5, 2023 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.