Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove object lifetime cast #564

Merged
merged 3 commits into from
Mar 31, 2025

Conversation

BoxyUwU
Copy link
Contributor

@BoxyUwU BoxyUwU commented Mar 13, 2025

Hi there o/

I'm a member of the Rust Language Types Team; as part of stabilizing the arbitrary_self_types and derive_coerce_pointee features we may need to change what raw pointer casts are legal (rust-lang/rust#136702). Specifically, casting *const dyn Trait + 'a to *const dyn Trait + 'b where it is not able to be proven that 'a outlives 'b may become an error.

Without going into too much detail, the justification for this change be that casting trait object's lifetime bound to a lifetime that lives for a longer time could invalidate the VTable of the trait object, allowing for dispatching to methods that should not be callable. See this example from the linked issue:

#![forbid(unsafe_code)]
#![feature(arbitrary_self_types, derive_coerce_pointee)]

use std::any::TypeId;
use std::marker::{CoercePointee, PhantomData};

#[derive(CoercePointee)]
#[repr(transparent)]
struct SelfPtr<T: ?Sized>(*const T);

impl<T: ?Sized> std::ops::Deref for SelfPtr<T> {
    type Target = T;
    fn deref(&self) -> &T {
        panic!("please don't call me, I just want the `Receiver` impl!");
    }
}

trait GetTypeId {
    fn get_type_id(self: SelfPtr<Self>) -> TypeId
    where
        Self: 'static;
}

impl<T: ?Sized> GetTypeId for PhantomData<T> {
    fn get_type_id(self: SelfPtr<Self>) -> TypeId
    where
        Self: 'static,
    {
        TypeId::of::<T>()
    }
}

// no `T: 'static` bound necessary
fn type_id_of<T: ?Sized>() -> TypeId {
    let ptr = SelfPtr(
        &PhantomData::<T> as *const (dyn GetTypeId + '_) as *const (dyn GetTypeId + 'static),
    );
    ptr.get_type_id()
}

Unfortunately, going through the usual "future compatibility warning" process may turn out to not be possible as checking lifetime constraints without emitting an error is prohibitively difficult to do in the implementation of the borrow checker. This means that you may wind up never receiving a FCW and its associated grace period to update your code to be compatible with the new rules.

I have attempted to update your codebase to the new rules for you so that it will still compile if we make this change. I hope this potential breakage won't cause you too much trouble and that you might even find the post-migration code to be better than how it was previously :-)

Finally, it has not been decided yet whether we are to go through with this breaking change. If you feel that your code should continue to work as-is then please let me know and we'll try to take that into account when evaluating whether to go through with this breakage. Additionally if you have any questions about why this change is required please let me know and I'll do my best to further explain the justification.

@tobz
Copy link
Member

tobz commented Mar 16, 2025

@BoxyUwU Hi there~

Reading through the issue you linked (thanks for that!), I believe I grok the underlying problem, which appears to be that that raw pointer casts trivially allow extending the lifetime of trait object types behind a pointer, and in particular that this allows violating where clauses in specific cases.

Looking at the change in this PR, my two questions are:

  • is the proposed change simply one of being future-proofed? In the case that we changed the trait (Recorder) to introduce a lifetime bound on Self on one (or more) of the methods? (It seems like the answer is "yes", but just making sure.)
  • is there something that explains what the rules are for implicit lifetimes of trait objects?

Specific to the second question, I was caught up on your initial mention of casting *const dyn Trait + 'a to *const dyn Trait + 'b. In this code, we take in &dyn Recorder, which if I write it out in its full form would be &'a (dyn Recorder + 'b), I believe? Is that a correct way to think about what's actually going on with the current code?

@BoxyUwU
Copy link
Contributor Author

BoxyUwU commented Mar 22, 2025

In your case this breakage would indeed be more "preventative" than due to an actual soundness issue in your crate. Technically speaking your trait methods do already have lifetime bounds involving Self, though only implicitly and not really in ways that matter1.

The risk with casting lifetimes of dyn types only really comes into play when raw pointers (directly or indirectly) can be used as the receiver of an object safe trait. For example if Recorder introduced a new method such as fn foo(self: *const Self) where Self: 'static you could run into problems with being able to call foo with your pointer to the dyn Recorder with the lifetime having been extended via a pointer cast (note that this is only possible once arbitrary_self_types has been stabilized).

is there something that explains what the rules are for implicit lifetimes of trait objects?

The Rust Reference has a page on lifetime elision for dyn types, it might have some of the information you're looking for.

I can work through all the elided/unspecified lifetimes involved in your current implementation of LocalRecorderGuard::new which might help demonstrate things:

  • the LOCAL_RECORDER thread local's type is written as Cell<Option<NonNull<dyn Recorder>>> which is desugared as Cell<Option<NonNull<dyn Recorder + 'static>>>
  • fn new(recorder: &'a dyn Recorder) desugars to fn new(recorder: &'a (dyn Recorder + 'a))
  • recorder as *const _ is casting &'a (dyn Recorder + 'a) to *const (dyn Recorder + 'a). So far this is completely fine, the lifetime of the dyn type hasn't been changed.
  • then, the as *mut _ is casting the pointer of type *const (dyn Recorder + 'a) to *mut (dyn Recorder + 'static). The _ is inferred to be (dyn Recorder + 'static) based off the fact that you assign this pointer to dyn Recorder to the LOCAL_RECORDER thread local which has the dyn Recorder + 'static type. This is where the borrow checker may start erroring in the future as 'a: 'static does not hold

I realise in my updated version of your code I left some of the lifetimes elided still, it could alternatively be written like so (I can update the PR to be written like this instead if you'd prefer):

        let recorder_ptr = unsafe {
            std::mem::transmute::<*const (dyn Recorder + 'a), *mut (dyn Recorder + 'static)>(
                recorder as &'a (dyn Recorder + 'a),
            )
        };

Hopefully all of this helps you understand exactly what's going on with all of the elided lifetimes in your current code? Please let me know if you have questions :-)

Footnotes

  1. fn describe_counter(&self, ...) desugars to something like fn describe_counter<'a>(self: &'a Self, ...) where Self: 'a. This doesn't really matter as &Self means you're already in "unsafe code has created references that violate lifetime rules" territory and need to be careful about not leaking the reference to safe code. My understanding from briefly skimming your code is that you take care of this by only over handing safe code an &'a (dyn Recorder + 'b) where 'a and 'b are arbitrary lifetimes chosen by your unsafe code in fn with_recorder.

@tobz tobz added C-core Component: core functionality such as traits, etc. E-complex Effort: complex. T-refactor Type: refactor. labels Mar 25, 2025
@tobz
Copy link
Member

tobz commented Mar 25, 2025

Thank you so much for the explanation!

That roughly fits with my existing understanding, but it was helpful to see it confirmed with your step-by-step example.

I'll give this another pass, since I think we'll want to tweak the placeholder SAFETY comment you left.

@tobz
Copy link
Member

tobz commented Mar 25, 2025

Meant to write in the review message that you can feel free to desugar the lifetimes on the reference we pass to transmute since we're already trying to be explicit... so why not be as explicit as possible? :)

I'm also curious if you agree with my suggested explanation of the safety comment above the transmute usage.

@BoxyUwU
Copy link
Contributor Author

BoxyUwU commented Mar 27, 2025

I've updated to specify a safety comment and explicitly write out the lifetimes 👍

When looking over the safety comment I did realise something which is that I believe metrics has a pre-existing soundness bug:

#[test]
fn unsoundness() {
  struct Foo<'a>(&'a str);
  
  impl<'a> Recorder for Foo<'a> {
    fn describe_counter(
      &self,
      key: crate::KeyName,
      unit: Option<crate::Unit>,
      description: crate::SharedString,
    ) {
      dbg!(self.0);
    }
  
    /* snip... */
  }

  let s = "foo".to_string();
  let recorder = Foo(&s);
  let guard = set_default_local_recorder(&recorder);
  std::mem::forget(guard);
  drop(s);
  with_recorder(|rec| rec.describe_counter(KeyName::from_const_str(""), None, "".into()));
}

Running this under MIRI results in the following output:

test recorder::tests::unsoundness ... error: Undefined Behavior: constructing invalid value: encountered a dangling reference (use-after-free)
   --> metrics/src/recorder/mod.rs:264:17
    |
264 |                 dbg!(self.0);
    |                 ^^^^^^^^^^^^ constructing invalid value: encountered a dangling reference (use-after-free)
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
    = note: BACKTRACE on thread `recorder::tests`:
    = note: inside `<recorder::tests::unsoundness::Foo<'_> as recorder::Recorder>::describe_counter` at /home/boxy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/macros.rs:364:13: 364:16
note: inside closure
   --> metrics/src/recorder/mod.rs:315:29
    |
315 |         with_recorder(|rec| rec.describe_counter(KeyName::from_const_str(""), None, "".into()));
    |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure
   --> metrics/src/recorder/mod.rs:234:22
    |
234 |             unsafe { f(recorder.as_ref()) }
    |                      ^^^^^^^^^^^^^^^^^^^^
    = note: inside `std::thread::LocalKey::<std::cell::Cell<std::option::Option<std::ptr::NonNull<dyn recorder::Recorder>>>>::try_with::<{closure@metrics/src/recorder/mod.rs:228:25: 228:41}, ()>` at /home/boxy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:311:12: 311:27
    = note: inside `std::thread::LocalKey::<std::cell::Cell<std::option::Option<std::ptr::NonNull<dyn recorder::Recorder>>>>::with::<{closure@metrics/src/recorder/mod.rs:228:25: 228:41}, ()>` at /home/boxy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:275:15: 275:31
note: inside `recorder::with_recorder::<(), {closure@metrics/src/recorder/mod.rs:315:23: 315:28}>`
   --> metrics/src/recorder/mod.rs:228:5
    |
228 | /     LOCAL_RECORDER.with(|local_recorder| {
229 | |         if let Some(recorder) = local_recorder.get() {
...   |
240 | |     })
    | |______^
note: inside `recorder::tests::unsoundness`
   --> metrics/src/recorder/mod.rs:315:9
    |
315 |         with_recorder(|rec| rec.describe_counter(KeyName::from_const_str(""), None, "".into()));
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure
   --> metrics/src/recorder/mod.rs:254:21
    |
253 |     #[test]
    |     ------- in this procedural macro expansion
254 |     fn unsoundness() {
    |                     ^
    = note: this error originates in the macro `dbg` which comes from the expansion of the attribute macro `test` (in Nightly builds, run with -Z macro-backtrace for more info)

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to 1 previous error; 15 warnings emitted

It seems that set_default_local_recorder is written assuming that LocalRecorderGuard is guaranteed to have its Drop implementation run, allowing with_recorder to assume that if LOCAL_RECORDER contains a recorder it is valid to use it. You can read a bit about unsafe code and relying on values having Drop run on the mem::forget docs, aria also has a good blog post on this topic from back when this was first discovered and std had to migrate off of this pattern.

Unfortunately I don't have enough context about how metrics works to evaluate what a fix for this soundness bug would be. Accepting a &(dyn Recorder + 'static) instead of &'a (dyn Recorder + 'a) on set_default_local_recorder would resolve this but, as I said, I don't know metrics well enough to determine if that would just default the point of the function entirely :)

This should probably be broken out into its own issue as it's orthogonal to this PR 🤔 I do believe your safety comment is correct and the fault lies in with_recorder's safety comment:

// SAFETY: If we have a local recorder, we know that it is valid because it can only be set during the
// duration of a closure that is passed to `with_local_recorder`, which is the only time this method can be
// called and have a local recorder set. This ensures that the lifetime of the recorder is valid for the
// duration of this method call.

Which makes no note of set_default_local_recorder and implicitly relies on the default local recorder always being valid.

@BoxyUwU
Copy link
Contributor Author

BoxyUwU commented Mar 27, 2025

I also wanted to check if metrics would be willing to release patch versions for the old releases, 0.22, 0.23. A significant amount of the regressions we detected involving metrics were crates that depended on these old versions which wouldn't be able to upgrade to a fixed version of metrics without potential breaking changes. If that's not possible or just too much work then that's totally fine and something we are able to work with.

There's been some discussion on the lang side about whether to try and keep this code working in older editions and only break it over an edition so that people depending on older versions of metrics aren't broken. The motivation for this would likely be a lot less strong if these crates can just run a trivial cargo update ^^

@tobz
Copy link
Member

tobz commented Mar 27, 2025

I've updated to specify a safety comment and explicitly write out the lifetimes 👍

Thank you. 🙏🏻

When looking over the safety comment I did realise something which is that I believe metrics has a pre-existing soundness bug:

Ah, yeah.. I know my personal feelings here don't change the objective nature of that qualifying as a soundness hole, although if someone is explicitly forgetting the guard, I'd say they've dug their own UB grave. 😅 Overall, though, I agree: a real, albeit orthogonal, issue that warrants its own issue.

I'll open an issue in the next day or two once I can think through it a little more.

I also wanted to check if metrics would be willing to release patch versions for the old releases, 0.22, 0.23.

We could definitely do this, yes. I opened #570 to track this.

@tobz tobz merged commit 1e39283 into metrics-rs:main Mar 31, 2025
13 checks passed
@tobz tobz added the S-awaiting-release Status: awaiting a release to be considered fixed/implemented. label Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-core Component: core functionality such as traits, etc. E-complex Effort: complex. S-awaiting-release Status: awaiting a release to be considered fixed/implemented. T-refactor Type: refactor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants