Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make std::fs::initial_buffer_size a method, rename it and make it public #85084

Closed
wooster0 opened this issue May 8, 2021 · 12 comments
Closed
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR.

Comments

@wooster0
Copy link
Contributor

wooster0 commented May 8, 2021

There is cases where using std::fs::read_to_string or std::fs::read isn't the best thing to do or really cumbersome to do when your code is already shaped to actually create its own String and read a file into that. For instance imagine you have a file that you opened a while ago, did some processing on it like reading the metadata and now you decide to read it all into a String. std::fs::read_to_string would obviously be a very bad choice because it would open the file again. Instead it would be better to use std::Read::read_to_string which lets me read the content of an already existing File to a String. But then I wouldn't have the benefit that std::fs::read_to_string provides: an optimally allocated String.
In that case I would really like to use the private std::fs::initial_buffer_size function that the std already provides and uses internally in std::fs::read_to_string and std::fs::read so that I can efficiently preallocate my very own String in order to be as efficient as std::fs::read_to_string would be.
The biggest reason that ultimately drove me to open this issue is the fact that std::fs::read_to_string doesn't let me keep my file that it opened. There might even be people that would open the file twice to solve this. It's obviously very bad and leads to inefficiencies.

So, in summary, there is definitely cases where you want to do this:

use std::io::Read;
use std::fs;

fn main() -> std::io::Result<()> {
    let mut file = fs::File::open("hello")?;
    let mut string = String::new();
    file.read_to_string(&mut string)?;
    
    Ok(())
}

The problem with this is that it's not efficient because String is not preallocating with the file's size. But as I mentioned the std already provides us with a function that lets us compute just that, the optimal size, but we can't use it because it's private.

All in all what I'm saying is that initial_buffer_size being private leads to inefficiency, inconvenience, restrictiveness and possible boilerplate code because you might just copy-paste initial_buffer_size into your code just to be able to preallocate efficiently. And with that, it can lead to error-proneness as well.

I propose to

  • Make initial_buffer_size a method on fs::File taking &self because currently it takes &fs::File as its argument and I think it's just nicer like that.
  • Rename it to possibly optimal_buffer_size because I feel like that's going to be a bit clearer. Or get_optimal_buffer_size or get_optimal_buffer_capacity or optimal_buffer_capacity? I'm not 100% sure on which would be the most correct or appropriate here so I appreciate suggestions.
    In those names buffer refers to whatever structure you want to optimally allocate, not just limited to String of course.
  • Make it public.

If this is accepted, I would be willing to work on it.

@wooster0
Copy link
Contributor Author

wooster0 commented May 8, 2021

@rustbot label C-feature-request

@rustbot rustbot added the C-feature-request Category: A feature request, i.e: not implemented / a PR. label May 8, 2021
@the8472
Copy link
Member

the8472 commented May 8, 2021

Instead it would be better to use std::Read::read_to_string which lets me read the content of an already existing File to a String. But then I wouldn't have the benefit that std::fs::read_to_string provides: an optimally allocated String.

Instead the Read trait implementation for File could be improved to do the precise allocation. That way we don't have to expand the API surface of std.

@wooster0
Copy link
Contributor Author

wooster0 commented May 9, 2021

@the8472

Instead it would be better to use std::Read::read_to_string which lets me read the content of an already existing File to a String. But then I wouldn't have the benefit that std::fs::read_to_string provides: an optimally allocated String.

Instead the Read trait implementation for File could be improved to do the precise allocation. That way we don't have to expand the API surface of std.

If I'm guessing correctly here, is https://github.com/rust-lang/rust/blob/master/library/std/src/io/mod.rs#L733 the correct method to be changed? At that point we already have a buffer being passed to us so we would have to modify the existing one.
How would this work? Are you thinking of String::reserve_exact?
The problem with String::reserve_exact is that it is not as efficient as immediately preallocating with with_capacity would be.
Compare String::with_capacity to String::reserve_exact.
We need to create our own buffer.
I still believe we should give the user freedom and make it public.

@the8472
Copy link
Member

the8472 commented May 9, 2021

The problem with String::reserve_exact is that it is not as efficient as immediately preallocating with with_capacity would be.

Well, it creates a bit more assembly, sure. But since you start with an empty string it would still boil down to a single allocation and no byte-movement. The code for reallocation would be there but it wouldn't actually run.

If the goal is to avoid the incremental growth of the string then that can be achieved with reserve_exact (or reserve for that matter).

@wooster0
Copy link
Contributor Author

wooster0 commented May 9, 2021

it would still boil down to a single allocation and no byte-movement.

Unfortunately it does seem to boil down to a bit more than with_capacity as seen in this benchmark.

use criterion::{criterion_group, criterion_main, Criterion};

pub fn new_string1() -> String {
    String::with_capacity(10)
}

pub fn new_string2() -> String {
    let mut string = String::new();
    string.reserve_exact(10);
    string
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("new_string1", |b| b.iter(|| new_string1()));
    c.bench_function("new_string2", |b| b.iter(|| new_string2()));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
cargo bench
new_string1             time:   [9.5486 ns 9.5614 ns 9.5740 ns]                         
                        change: [-1.3351% -1.1752% -1.0227%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

new_string2             time:   [14.134 ns 14.157 ns 14.185 ns]                         
                        change: [+0.1010% +0.3440% +0.5897%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high severe

I ran it three times and new_string2 was always slower.

@the8472
Copy link
Member

the8472 commented May 9, 2021

That's for allocating and deallocating 10 bytes in a loop which is likely served from a thread-local allocation pool. I don't think that's realistic for the scenario given in OP since it omits the file IO and most likely involves much larger strings which will be served either from global pools or via yet another syscall to mmap + page faults which will dwarf all other costs.

@wooster0
Copy link
Contributor Author

wooster0 commented May 9, 2021

@the8472 can you make a diff demonstrating how exactly you imagine the new method to look like?

@the8472
Copy link
Member

the8472 commented May 9, 2021

Looking through impl Read for File it implements neither read_to_end nor read_to_string. So this would take some refactoring, not a simple one-method diff.
You'd have to take the default implementations from Read but change the internal free-standing function read_to_end which only ever reserves 32 bytes. So for this to be done some of the methods have to become pub(crate) and take additional arguments or something.

Also, initial_buffer_size wouldn't be appropriate for File::read_to_string anyway because the file may have been seeked, which means the number of bytes read may be much lower than its total size. So a separate implementation is needed anyway.

@wooster0
Copy link
Contributor Author

wooster0 commented May 9, 2021

I'm not sure I have the required expertise to implement what you proposed.
And there might be cases where you have an entirely different structure (other than the standard String or Vec) that you want an optimal size for. I believe in that case exposing initial_buffer_size allows for more flexibility and possibly more performance.

@the8472
Copy link
Member

the8472 commented May 9, 2021

And there might be cases where you have an entirely different structure (other than the standard String or Vec) that you want an optimal size for. I believe in that case exposing initial_buffer_size allows for more flexibility and possibly more performance.

That is quite a different use-case from File::read_to_string not having optimal allocation behavior.

Many uses of File already involve calling metadata() for different reasons and thus have access to its length without additional cost. A standard library function that introduces an additional call to metadata() may be convenient and thus appropriate for convenience functions such as fs::read_to_string but they are leading down the wrong path for code where you end up customizing behavior more.

In those cases existing metadata (if present) plus the seek position of the file would be more appropriate to get a size hint.

initial_buffer_size is only correct for things that read the whole file. For cases where you're operating on an already opened file you have to take the seek position into account.

@wooster0
Copy link
Contributor Author

wooster0 commented May 9, 2021

I agree. It's better to just let the user allocate by themself. Thanks for your input.

@wooster0 wooster0 closed this as completed May 9, 2021
@wooster0
Copy link
Contributor Author

wooster0 commented Oct 6, 2021

Instead it would be better to use std::Read::read_to_string which lets me read the content of an already existing File to a String. But then I wouldn't have the benefit that std::fs::read_to_string provides: an optimally allocated String.

Instead the Read trait implementation for File could be improved to do the precise allocation. That way we don't have to expand the API surface of std.

Just FYI, it seems this is now happening! #89582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR.
Projects
None yet
Development

No branches or pull requests

3 participants