Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

park_timeout not waking up after timeout on macs #59020

Closed
lespea opened this issue Mar 8, 2019 · 6 comments · Fixed by #59676
Closed

park_timeout not waking up after timeout on macs #59020

lespea opened this issue Mar 8, 2019 · 6 comments · Fixed by #59676
Labels
C-bug Category: This is a bug. O-macos Operating system: macOS T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@lespea
Copy link

lespea commented Mar 8, 2019

I was trying to troubleshoot this issue: jonhoo/bus#18 and eventually we figured out that on my machine park_timeout wasn't waking up. I wrote this simple test code which demonstrates the issue:

use std::thread;
use std::time::Duration;

fn main() {
    let t1 = thread::spawn(|| {
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking1");
            thread::park_timeout(sleep);
        }
    });

    let t2 = thread::spawn(|| {
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking2");
            thread::park_timeout(sleep);
        }
    });

    t1.join().expect("Couldn't join thread 1");
    t2.join().expect("Couldn't join thread 2");
}

Most of the time it will only print this:

Parking1
Parking2

Every once in a while it will finish normally. This behavior happens on stable and nightly and I tried a handfull of stable versions all the way back to 1.12 and they all got stuck. Windows (both msvc and gnu appear to never get stuck (at least I couldn't reproduce the issue)). I was able to duplicate this an another mac as well (both running 10.14)

If only a single thread is started (or the park happens on the main thread) then I can't reproduce.

I'm on 10.14; uname: Darwin 34363bc7dc9c 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20 20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64 i386 MacBookPro11,3 Darwin

rustc 1.33.0 (2aa4c46cf 2019-02-28)
binary: rustc
commit-hash: 2aa4c46cfdd726e97360c2734835aa3515e8c858
commit-date: 2019-02-28
host: x86_64-apple-darwin
release: 1.33.0
LLVM version: 8.0
@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

This definitely seems broken. Here's dtruss output:

	PID/THRD  SYSCALL(args) 		 = return
Parking1
Parking2
48727/0x4294e5:  open("/dev/dtracehelper\0", 0x2, 0xFFFFFFFFE70CECA0)		 = 3 0
48727/0x4294e5:  ioctl(0x3, 0x80086804, 0x7FFEE70CEC00)		 = 0 0
48727/0x4294e5:  close(0x3)		 = 0 0
48727/0x4294e5:  access("/AppleInternal/XBS/.isChrooted\0", 0x0, 0x0)		 = -1 Err#2
48727/0x4294e5:  thread_selfid(0x0, 0x0, 0x0)		 = 4363493 0
48727/0x4294e5:  bsdthread_register(0x7FFF5D23ABEC, 0x7FFF5D23ABDC, 0x2000)		 = 1073742047 0
48727/0x4294e5:  mprotect(0x108B93000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108B9A000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108B9B000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108BA2000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  mprotect(0x108B91000, 0x88, 0x1)		 = 0 0
48727/0x4294e5:  mprotect(0x108BA3000, 0x1000, 0x1)		 = 0 0
48727/0x4294e5:  mprotect(0x108B91000, 0x88, 0x3)		 = 0 0
48727/0x4294e5:  mprotect(0x108B91000, 0x88, 0x1)		 = 0 0
48727/0x4294e5:  issetugid(0x0, 0x0, 0x0)		 = 0 0
48727/0x4294e5:  getpid(0x0, 0x0, 0x0)		 = 48727 0
48727/0x4294e5:  stat64("/AppleInternal/XBS/.isChrooted\0", 0x7FFEE70CE2A8, 0x0)		 = -1 Err#2
48727/0x4294e5:  stat64("/AppleInternal\0", 0x7FFEE70CE340, 0x0)		 = -1 Err#2
48727/0x4294e5:  csops(0xBE57, 0x7, 0x7FFEE70CDDE0)		 = -1 Err#22
dtrace: error on enabled probe ID 2198 (ID 557: syscall::sysctl:return): invalid kernel access in action #11 at DIF offset 28
48727/0x4294e5:  csops(0xBE57, 0x7, 0x7FFEE70CD6D0)		 = -1 Err#22
48727/0x4294e5:  sigaction(0xD, 0x7FFEE70CFA48, 0x7FFEE70CFA70)		 = 0 0
dtrace: error on enabled probe ID 2198 (ID 557: syscall::sysctl:return): invalid kernel access in action #11 at DIF offset 28
48727/0x4294e5:  getrlimit(0x1003, 0x7FFEE70CFA60, 0x0)		 = 0 0
48727/0x4294e5:  mmap(0x7FFEE68D0000, 0x1000, 0x3, 0x1012, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x7FFEE68D0000 0
48727/0x4294e5:  mprotect(0x7FFEE68D0000, 0x1000, 0x0)		 = 0 0
48727/0x4294e5:  sigaction(0xB, 0x7FFEE70CFA78, 0x0)		 = 0 0
48727/0x4294e5:  sigaction(0xA, 0x7FFEE70CFA78, 0x0)		 = 0 0
48727/0x4294e5:  sigaltstack(0x0, 0x7FFEE70CFAD0, 0x0)		 = 0 0
48727/0x4294e5:  mmap(0x0, 0x20000, 0x3, 0x1002, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x108BA4000 0
48727/0x4294e5:  sigaltstack(0x7FFEE70CFAD0, 0x0, 0x0)		 = 0 0
48727/0x4294e5:  bsdthread_create(0x108B431C0, 0x7FDDC9C02730, 0x200000)		 = 104935424 0
48727/0x4294ec:  thread_selfid(0x0, 0x0, 0x0)		 = 4363500 0
48727/0x4294ec:  sigaltstack(0x0, 0x700006412EC0, 0x0)		 = 0 0
48727/0x4294e5:  bsdthread_create(0x108B431C0, 0x7FDDC9C029D0, 0x200000)		 = 107044864 0
48727/0x4294ec:  mmap(0x0, 0x20000, 0x3, 0x1002, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x108BC4000 0
48727/0x4294ec:  sigaltstack(0x700006412EC0, 0x0, 0x0)		 = 0 0
48727/0x4294ed:  thread_selfid(0x0, 0x0, 0x0)		 = 4363501 0
48727/0x4294ed:  sigaltstack(0x0, 0x700006615EC0, 0x0)		 = 0 0
48727/0x4294ed:  mmap(0x0, 0x20000, 0x3, 0x1002, 0xFFFFFFFFFFFFFFFF, 0x0)		 = 0x108BE4000 0
48727/0x4294ed:  sigaltstack(0x700006615EC0, 0x0, 0x0)		 = 0 0
dtrace: error on enabled probe ID 2172 (ID 161: syscall::write:return): invalid kernel access in action #13 at DIF offset 68
48727/0x4294ec:  psynch_mutexdrop(0x7FDDC9D00110, 0x203, 0x100)		 = 0 0
48727/0x4294ed:  psynch_mutexwait(0x7FDDC9D00110, 0x203, 0x0)		 = 515 0
dtrace: error on enabled probe ID 2172 (ID 161: syscall::write:return): invalid kernel access in action #13 at DIF offset 68
48727/0x4294ec:  gettimeofday(0x700006412988, 0x0, 0x0)		 = 0 0
48727/0x4294ed:  gettimeofday(0x700006615988, 0x0, 0x0)		 = 0 0
48727/0x4294ec:  psynch_cvwait(0x7FDDC9C02830, 0x100000100, 0x0)		 = -1 Err#316
48727/0x4294ed:  psynch_cvwait(0x7FDDC9C02920, 0x100000100, 0x0)		 = -1 Err#316
^C
48727/0x4294e5:  __semwait_signal(0xD03, 0x0, 0x0)		 = -1 Err#4

@sfackler sfackler added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. C-bug Category: This is a bug. O-macos Operating system: macOS I-nominated and removed regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Mar 8, 2019
@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

It's a deadlock on the ONCE here: https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/time.rs#L241

* thread #2
  * frame #0: 0x00007fff5d073a46 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff5d23bb9d libsystem_pthread.dylib`_pthread_mutex_lock_wait + 83
    frame #2: 0x00007fff5d2394c8 libsystem_pthread.dylib`_pthread_mutex_lock_slow + 253
    frame #3: 0x0000000100008c05 foo`std::thread::park::h40cbaa717417ef95 [inlined] std::sys::unix::mutex::Mutex::lock::h1c54f0b7e37dd0a3 at mutex.rs:56 [opt]
    frame #4: 0x0000000100008c00 foo`std::thread::park::h40cbaa717417ef95 [inlined] std::sys_common::mutex::Mutex::raw_lock::hecb01a8a1467a3f9 at mutex.rs:36 [opt]
    frame #5: 0x0000000100008c00 foo`std::thread::park::h40cbaa717417ef95 [inlined] _$LT$std..sync..mutex..Mutex$LT$T$GT$$GT$::lock::h012f35fcc06840a7 at mutex.rs:220 [opt]
    frame #6: 0x0000000100008bfc foo`std::thread::park::h40cbaa717417ef95 at mod.rs:891 [opt]
    frame #7: 0x000000010000d4ed foo`std::sync::once::Once::call_inner::h0b29d64214dba756 at once.rs:421 [opt]
    frame #8: 0x000000010000d7ed foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 [inlined] std::sync::once::Once::call_once::h15f56e63f2b5f3bd at once.rs:220 [opt]
    frame #9: 0x000000010000d7c8 foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 [inlined] std::sys::unix::time::inner::info::he0ee845c3bb0a3ac at time.rs:242 [opt]
    frame #10: 0x000000010000d7c8 foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 [inlined] std::sys::unix::time::inner::Instant::sub_instant::h2f90455c1f5f50e3 at time.rs:154 [opt]
    frame #11: 0x000000010000d7c8 foo`std::time::Instant::duration_since::h6b1dd60fca9d26e8 at time.rs:218 [opt]
    frame #12: 0x0000000100012255 foo`std::sys::unix::condvar::Condvar::wait_timeout::hdd4a51d3f762f929 [inlined] _$LT$std..time..Instant$u20$as$u20$core..ops..arith..Sub$GT$::sub::h0703e52941cc9b4d at time.rs:307 [opt]
    frame #13: 0x000000010001224d foo`std::sys::unix::condvar::Condvar::wait_timeout::hdd4a51d3f762f929 [inlined] std::time::Instant::elapsed::h544a171a1c00e632 at time.rs:242 [opt]
    frame #14: 0x0000000100012240 foo`std::sys::unix::condvar::Condvar::wait_timeout::hdd4a51d3f762f929 at condvar.rs:162 [opt]
    frame #15: 0x0000000100008fe8 foo`std::thread::park_timeout::h730ea3969ae2f6b8 [inlined] std::sys_common::condvar::Condvar::wait_timeout::h8f2d65bc554e47d5 at condvar.rs:51 [opt]
    frame #16: 0x0000000100008fdd foo`std::thread::park_timeout::h730ea3969ae2f6b8 [inlined] std::sync::condvar::Condvar::wait_timeout::h552dfa042e4361fb at condvar.rs:405 [opt]
    frame #17: 0x0000000100008fc3 foo`std::thread::park_timeout::h730ea3969ae2f6b8 at mod.rs:1003 [opt]
    frame #18: 0x0000000100003203 foo`foo::main::_$u7b$$u7b$closure$u7d$$u7d$::h014edfb0ea6d8161((null)=closure {

}) at main.rs:9
    frame #19: 0x0000000100003bed foo`std::sys_common::backtrace::__rust_begin_short_backtrace::hfdb85b8c915ae023(f=closure {

}) at backtrace.rs:135
    frame #20: 0x00000001000068fd foo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hcb677bec99ca4490 at mod.rs:469
    frame #21: 0x00000001000057ed foo`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hc80622d5e14b1c63(self=AssertUnwindSafe<closure>(closure {

}), _args=<unavailable>) at panic.rs:309
    frame #22: 0x00000001000050ca foo`std::panicking::try::do_call::hcae109a146a05cf6(data=&0x70000ef8bd38) at panicking.rs:297
    frame #23: 0x000000010001384f foo`__rust_maybe_catch_panic at lib.rs:92 [opt]
    frame #24: 0x0000000100004f33 foo`std::panicking::try::h4483321badd4a86e(f=AssertUnwindSafe<closure>(closure {

})) at panicking.rs:276
    frame #25: 0x000000010000583d foo`std::panic::catch_unwind::h9d60a569083fc382(f=AssertUnwindSafe<closure>(closure {

})) at panic.rs:388
    frame #26: 0x0000000100006766 foo`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::hed8c8ea5ca4cd51c at mod.rs:468
    frame #27: 0x0000000100006abf foo`_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h781bdd77c151bf87(self=&0x100300380, args=<unavailable>) at boxed.rs:734
    frame #28: 0x000000010001324c foo`std::sys::unix::thread::Thread::new::thread_start::h2ccd51efe09a6d88 [inlined] _$LT$alloc..boxed..Box$LT$$LP$dyn$u20$alloc..boxed..FnBox$LT$A$C$$u20$Output$u3d$R$GT$$u20$$u2b$$u20$$u27$a$RP$$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hf680f30793070da1 at boxed.rs:744 [opt]
    frame #29: 0x0000000100013249 foo`std::sys::unix::thread::Thread::new::thread_start::h2ccd51efe09a6d88 [inlined] std::sys_common::thread::start_thread::h30434b5681199bc5 at thread.rs:14 [opt]
    frame #30: 0x00000001000131ce foo`std::sys::unix::thread::Thread::new::thread_start::h2ccd51efe09a6d88 at thread.rs:81 [opt]
    frame #31: 0x00007fff5d23b661 libsystem_pthread.dylib`_pthread_body + 340
    frame #32: 0x00007fff5d23b50d libsystem_pthread.dylib`_pthread_start + 377
    frame #33: 0x00007fff5d23abf9 libsystem_pthread.dylib`thread_start + 13

@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

As a workaround, everything should work fine as long as you construct an Instant at least once in your process before parking.

@lespea
Copy link
Author

lespea commented Mar 8, 2019

Doesn't seem to help :(

Code

use std::thread;
use std::time::Duration;
use std::time;

fn main() {
    println!("{:?}", time::Instant::now());

    let t1 = thread::spawn(|| {
        println!("{:?}", time::Instant::now());
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking1");
            thread::park_timeout(sleep);
        }
    });

    let t2 = thread::spawn(|| {
        println!("{:?}", time::Instant::now());
        let sleep = Duration::new(0,100_000);
        for _ in 0..100 {
            println!("Parking2");
            thread::park_timeout(sleep);
        }
    });

    t1.join().expect("Couldn't join thread 1");
    t2.join().expect("Couldn't join thread 2");
}

Output

Instant { t: 18865250239116 }
Instant { t: 18865250405775 }
Parking1
Instant { t: 18865250430637 }
Parking2

@sfackler
Copy link
Member

sfackler commented Mar 8, 2019

Oh right sorry, you actually have to subtract two Instants to force that Once to complete. If you instead do time::Instant::now().elapsed() at the start of main, it prevents the deadlock.

jonhoo added a commit to jonhoo/bus that referenced this issue Mar 8, 2019
@lespea
Copy link
Author

lespea commented Mar 8, 2019

Works perfectly now! Thanks!

alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 3, 2019
This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (rust-lang#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes rust-lang#59020
Centril added a commit to Centril/rust that referenced this issue Apr 3, 2019
std: Avoid usage of `Once` in `Instant`

This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (rust-lang#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes rust-lang#59020
alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 4, 2019
This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (rust-lang#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes rust-lang#59020
bors added a commit that referenced this issue Apr 4, 2019
std: Avoid usage of `Once` in `Instant`

This commit removes usage of `Once` from the internal implementation of
time utilities on OSX and Windows. It turns out that we accidentally hit
a deadlock today (#59020) via events that look like:

* A thread invokes `park_timeout`
* Internally, only on OSX, `park_timeout` calls `Instant::elapsed`
* Inside of `Instant::elapsed` on OSX we enter a `Once` to initialize
  global timer data
* Inside of `Once`, it attempts to `park`

This means on the same stack frame, when there's contention, we're
calling `park` from inside `park_timeout`, causing a deadlock!

The solution implemented in this commit was to remove usage of `Once`
and instead just do a small dance with atomics. There's no real need we
need to guarantee that the global information is only learned once, only
that it's only *stored* once. This implementation may have multiple
threads invoke `mach_timebase_info`, but only one will store the global
information which will amortize the cost for all other threads.

A similar fix has been applied to windows to be uniform across our
implementations, but looking at the code on Windows no deadlock was
possible. This is purely just a consistency update for Windows and in
theory a slightly leaner implementation.

Closes #59020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. O-macos Operating system: macOS T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants