Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ToString implementation for integers #136264

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions library/alloc/src/string.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2795,7 +2795,54 @@ impl SpecToString for bool {
}
}

macro_rules! impl_to_string {
($($signed:ident, $unsigned:ident,)*) => {
$(
#[cfg(not(no_global_oom_handling))]
#[cfg(not(feature = "optimize_for_size"))]
impl SpecToString for $signed {
#[inline]
fn spec_to_string(&self) -> String {
const SIZE: usize = $signed::MAX.ilog(10) as usize + 1;
let mut buf = [core::mem::MaybeUninit::<u8>::uninit(); SIZE];
// Only difference between signed and unsigned are these 8 lines.
let mut out;
if *self < 0 {
out = String::with_capacity(SIZE + 1);
out.push('-');
} else {
out = String::with_capacity(SIZE);
}

out.push_str(self.unsigned_abs()._fmt(&mut buf));
out
}
}
#[cfg(not(no_global_oom_handling))]
#[cfg(not(feature = "optimize_for_size"))]
impl SpecToString for $unsigned {
#[inline]
fn spec_to_string(&self) -> String {
const SIZE: usize = $unsigned::MAX.ilog(10) as usize + 1;
let mut buf = [core::mem::MaybeUninit::<u8>::uninit(); SIZE];

self._fmt(&mut buf).to_string()
}
}
)*
}
}

impl_to_string! {
i8, u8,
i16, u16,
i32, u32,
i64, u64,
isize, usize,
}

#[cfg(not(no_global_oom_handling))]
#[cfg(feature = "optimize_for_size")]
impl SpecToString for u8 {
#[inline]
fn spec_to_string(&self) -> String {
Expand All @@ -2815,6 +2862,7 @@ impl SpecToString for u8 {
}

#[cfg(not(no_global_oom_handling))]
#[cfg(feature = "optimize_for_size")]
impl SpecToString for i8 {
#[inline]
fn spec_to_string(&self) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it looks like there are two separate places we now have size and non-size optimized code for printing integers (cfgs in core/src/fmt/num.rs, and here). Could we perhaps unify on just one place where the full set of code lives?

Part of why I'm asking is that it seems like there is some amount of strange choices (IMO):

  • fast ToString for i{8 to 64} => String with capacity for maximum sized integer (e.g., 0u64.to_string() will give me a String with capacity for at least 20 bytes)
  • fast ToString for u{8 to 64} => dispatches through &str to String, so will perfectly size the heap buffer based on the actual length
  • small ToString for u8/i8 => maximum sized integer allocations
    • these override the support in core for _fmt on u8/i8

So for the byte types (u8/i8) there's actually 4 separate pieces of code that we are maintaining:

  • Size-optimized core::fmt::num Display impl (used IIUC for {} formatting)
  • Fast-optimized core::fmt::num Display impl (used IIUC for {} formatting)
  • Size-optimized alloc SpecToString impl (for .to_string())
  • Fast-optimized alloc SpecToString impl (for .to_string()) -- defers now to core::fmt::num

Plus, IIUC the signed core::fmt::num impl is now only reachable via Display, never via .to_string(), which also seems like an odd decision.

I also don't see much in the way of rationale for why we are making certain tradeoffs (e.g., why single-byte types are special cased here, but not for Display). Maybe we can file a tracking issue of some kind and layout a plan for what we're envisioning the end state to be? The individual changes here are I guess fine, but it doesn't seem like we're moving towards a specific vision, rather tweaking to optimize a particular metric.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want it to be part of this PR or as follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally rather see a plan and cleanup work followed by "random" changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the plan is: making integer to string conversion faster. There were a few problems when I started working on this:

  1. The buffer used to store the string output for the integer was always the size of the biggest integer (64 bits), which is suboptimal for smaller integers.
  2. We had an extra loop which was never entered for smaller integers (i8/u8), but since we were casting all integers into u64 before converting to string, this optimization was missed.
  3. The ToString implementation uses the same code, which relies on Formatter, meaning that all the Formatter code (checking the internal flags in short) was still run even though it was never actually used.

The points 1. and 2. were fixed in #128204. This PR is fixing the last one.

Now about the optimize_for_size feature usage: considering these optimizations require specialized code, it also means that it needs a lot more code. And because the code to convert integers to string isn't the same depending on whether or not the optimize_for_size feature is enabled, I can't have the same code because the internal API changes.

Does it answer your question? Don't hesitate if something isn't clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).

Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?

I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa to avoid heap allocations etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).

I posted a comparison with and without these changes in this comment. A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).

Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?

No and unless you add a new field to Formatter allowing you to know that no field was updated and that the checks should be skipped, I don't see how you could get this optimization.

I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa to avoid heap allocations etc.

I'm not convinced either but since the optimize_for_size feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size so on this side we're good.

Also, I'm not looking for the fastest implementation, I'm looking at improving the current situation which is really suboptimal. We could add a new write_into<W: Write>(self, &mut W) method to have something as fast as itoa, but that's a whole other discussion and I don't plan to start it. My plan ends with this PR. Also to be noted: with this PR, the only remaining difference with itoa is that we don't allow to write an integer into a String, everything else is the exact same.

Anyway, the PR is ready, it has a visible impact on at least the generated assembly which is notably smaller by allowing to skip all Formatter code. It doesn't change the behaviour of optimize_for_size and adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).

Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.

Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.

I'm not convinced either but since the optimize_for_size feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size so on this side we're good.

This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.

adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.

I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.

Considering how specific this is, not sure it's worth adding to rustc-perf. As for adding them into the codebase, I'll need to ensure that they're correctly written first.

Here is the code I used:

#![feature(test)]

extern crate test;

use test::{Bencher, black_box};

#[inline(always)]
fn convert_to_string<T: ToString>(n: T) -> String {
    n.to_string()
}

macro_rules! decl_benches {
    ($($name:ident: $ty:ident,)+) => {
        $(
	    #[bench]
            fn $name(c: &mut Bencher) {
                c.iter(|| convert_to_string(black_box({ let nb: $ty = 20; nb })));
            }
	)+
    }
}

decl_benches! {
    bench_u8: u8,
    bench_i8: i8,
    bench_u16: u16,
    bench_i16: i16,
    bench_u32: u32,
    bench_i32: i32,
    bench_u64: u64,
    bench_i64: i64,
}

The results are:

name 1.87.0-nightly (3ea711f 2025-03-09) With this PR diff
bench_i16 32.06 ns/iter (+/- 0.12) 17.62 ns/iter (+/- 0.03) -45%
bench_i32 31.61 ns/iter (+/- 0.04) 15.10 ns/iter (+/- 0.06) -52%
bench_i64 31.71 ns/iter (+/- 0.07) 15.02 ns/iter (+/- 0.20) -52%
bench_i8 13.21 ns/iter (+/- 0.14) 14.93 ns/iter (+/- 0.16) +13%
bench_u16 31.20 ns/iter (+/- 0.06) 16.14 ns/iter (+/- 0.11) -48%
bench_u32 33.27 ns/iter (+/- 0.05) 16.18 ns/iter (+/- 0.10) -51%
bench_u64 31.44 ns/iter (+/- 0.06) 16.62 ns/iter (+/- 0.21) -47%
bench_u8 10.57 ns/iter (+/- 0.30) 13.00 ns/iter (+/- 0.43) +22%

I have to admit I'm a bit surprised as I didn't remember the difference to be this big... But in any case, seeing how big the difference is, I wonder if the benches are correctly written (hence why I asked help for it).

Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.

Yep, hence why I wrote benches. :)

This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.

There is no complete test for this as far as I can see, but some small checks like tests/ui/traits/to-str.rs and test_simple_types in library/alloc/tests/string.rs.

Might be worth adding one?

I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).

I can help with maintaining this code. The plan is not to be as good as itoa (which isn't possible anyway with the current API) but to be as good as possible in the current limitations. The improvements seem noticeable and I think are worth it. I also think that doing integers to string conversion is very common, and I even if we provide new APIs to handle them (which would be nice), this code is common enough to make a nice impact in existing codebases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding #[inline] to fmt::Display for $unsigned seems to allow removing at least some of the indirection through the formatting infrastructure (down to a direct invocation of _fmt in the user-written code calling .to_string(), as I'd expect). It's unavoidable that we still retain the fmt::Formatter in the current version of things, but that would go away if we applied some of the changes in this PR which makes _fmt take a &mut [...] rather than the fmt::Formatter.

IMO, the complexity this adds, especially without thorough testing being added for the new code (which it sounds like you agree may be missing!), is not something I'm prepared to r+ -- I agree that there is opportunity here, but I don't think this PR is the right shape for it. For the users that do really care about formatting integers at speed, this PR probably does very little, since it retains the inability to inline fmt calls on integers. Only when calling .to_string() is this maybe a win, and I continue to maintain that's just not something you need to be particularly fast at -- the allocator-per-integer will quite possibly dominate the cost there.

If you want to find a libs reviewer willing to merge these changes, I'm not going to stop them; feel free to re-roll. Otherwise revising this PR to add the test cases (or adding them separately in a different PR that's just tests) would help build confidence that this change is at least correct (and will continue to be so). But I continue to feel uncomfortable with the duplication this introduces, so I don't think it would be enough for me to approve this (but I consider it minimum necessary for std to accept a change like this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the test in another PR as a first step would be a great change in any case so let's pause this one until tests have been added.

Expand Down
28 changes: 19 additions & 9 deletions library/core/src/fmt/num.rs
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,11 @@ macro_rules! impl_Display {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
#[cfg(not(feature = "optimize_for_size"))]
{
self._fmt(true, f)
const MAX_DEC_N: usize = $unsigned::MAX.ilog(10) as usize + 1;
// Buffer decimals for $unsigned with right alignment.
let mut buf = [MaybeUninit::<u8>::uninit(); MAX_DEC_N];

f.pad_integral(true, "", self._fmt(&mut buf))
}
#[cfg(feature = "optimize_for_size")]
{
Expand All @@ -222,7 +226,11 @@ macro_rules! impl_Display {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
#[cfg(not(feature = "optimize_for_size"))]
{
return self.unsigned_abs()._fmt(*self >= 0, f);
const MAX_DEC_N: usize = $unsigned::MAX.ilog(10) as usize + 1;
// Buffer decimals for $unsigned with right alignment.
let mut buf = [MaybeUninit::<u8>::uninit(); MAX_DEC_N];

f.pad_integral(*self >= 0, "", self.unsigned_abs()._fmt(&mut buf))
}
#[cfg(feature = "optimize_for_size")]
{
Expand All @@ -233,10 +241,13 @@ macro_rules! impl_Display {

#[cfg(not(feature = "optimize_for_size"))]
impl $unsigned {
fn _fmt(self, is_nonnegative: bool, f: &mut fmt::Formatter<'_>) -> fmt::Result {
const MAX_DEC_N: usize = $unsigned::MAX.ilog(10) as usize + 1;
// Buffer decimals for $unsigned with right alignment.
let mut buf = [MaybeUninit::<u8>::uninit(); MAX_DEC_N];
#[doc(hidden)]
#[unstable(
feature = "fmt_internals",
reason = "internal routines only exposed for testing",
issue = "none"
)]
pub fn _fmt<'a>(self, buf: &'a mut [MaybeUninit::<u8>]) -> &'a str {
// Count the number of bytes in buf that are not initialized.
let mut offset = buf.len();
// Consume the least-significant decimals from a working copy.
Expand Down Expand Up @@ -301,13 +312,12 @@ macro_rules! impl_Display {
// SAFETY: All buf content since offset is set.
let written = unsafe { buf.get_unchecked(offset..) };
// SAFETY: Writes use ASCII from the lookup table exclusively.
let as_str = unsafe {
unsafe {
str::from_utf8_unchecked(slice::from_raw_parts(
MaybeUninit::slice_as_ptr(written),
written.len(),
))
};
f.pad_integral(is_nonnegative, "", as_str)
}
}
})*

Expand Down
2 changes: 1 addition & 1 deletion tests/ui/codegen/equal-pointers-unequal/as-cast/inline2.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ fn main() {
let v = 0;
&v as *const _ as usize
};
assert_eq!(a.to_string(), b.to_string());
assert_eq!(format!("{a}"), format!("{b}"));
assert_eq!(format!("{}", a == b), "true");
assert_eq!(format!("{}", cmp_in(a, b)), "true");
assert_eq!(format!("{}", cmp(a, b)), "true");
Expand Down
2 changes: 1 addition & 1 deletion tests/ui/codegen/equal-pointers-unequal/as-cast/zero.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ fn main() {
// It's not zero, which means `a` and `b` are not equal.
assert_ne!(i, 0);
// But it looks like zero...
assert_eq!(i.to_string(), "0");
assert_eq!(format!("{i}"), "0");
// ...and now it *is* zero?
assert_eq!(i, 0);
// So `a` and `b` are equal after all?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ fn main() {
let v = 0;
ptr::from_ref(&v).expose_provenance()
};
assert_eq!(a.to_string(), b.to_string());
assert_eq!(format!("{a}"), format!("{b}"));
assert_eq!(format!("{}", a == b), "true");
assert_eq!(format!("{}", cmp_in(a, b)), "true");
assert_eq!(format!("{}", cmp(a, b)), "true");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ fn main() {
// It's not zero, which means `a` and `b` are not equal.
assert_ne!(i, 0);
// But it looks like zero...
assert_eq!(i.to_string(), "0");
assert_eq!(format!("{i}"), "0");
// ...and now it *is* zero?
assert_eq!(i, 0);
// So `a` and `b` are equal after all?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ fn main() {
let v = 0;
ptr::from_ref(&v).addr()
};
assert_eq!(a.to_string(), b.to_string());
assert_eq!(format!("{a}"), format!("{b}"));
assert_eq!(format!("{}", a == b), "true");
assert_eq!(format!("{}", cmp_in(a, b)), "true");
assert_eq!(format!("{}", cmp(a, b)), "true");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ fn main() {
// It's not zero, which means `a` and `b` are not equal.
assert_ne!(i, 0);
// But it looks like zero...
assert_eq!(i.to_string(), "0");
assert_eq!(format!("{i}"), "0");
// ...and now it *is* zero?
assert_eq!(i, 0);
// So `a` and `b` are equal after all?
Expand Down
Loading