Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Read::read_to_end and Read::read_to_string #89516

Closed
jkugelman opened this issue Oct 4, 2021 · 12 comments
Closed

Improve performance of Read::read_to_end and Read::read_to_string #89516

jkugelman opened this issue Oct 4, 2021 · 12 comments

Comments

@jkugelman
Copy link
Contributor

jkugelman commented Oct 4, 2021

In #89165 I updated Read::read_to_end to try to detect EOF when the input buffer is full by reading into a small "probe" buffer. If that read returns Ok(0) then it avoids unnecessarily doubling the capacity of the input buffer.

I think there'd be some value in trying to eliminate even the "probe buffer" call, but in the meantime, this seems like an improvement.

Originally posted by @joshtriplett in #89165 (comment)

Josh, did you have something in mind?

I thought a way: use read_vectored. I could add the probe buffer to a vectorized readv, which would eliminate the extra read syscall.

  1. Is that idea worth pursuing?

  2. Is it okay to simply switch the read call(s) to read_vectored, or would it be better to check is_read_vectored and have separate vectorized and non-vectorized code paths? I'm inclined to keep it simple and do the former.

@joshtriplett
Copy link
Member

joshtriplett commented Oct 4, 2021

My original thought was, if we have an ordinary file with a file size and it doesn't look like a placeholder value, we should just read that many bytes and stop.

@the8472
Copy link
Member

the8472 commented Oct 4, 2021

You're probably aware of the edge-case around procfs hence "it doesn't look like a placeholder value", but that isn't the only one. For example FUSE filesystems can make up any metadata. So we can't treat the size as anything but a hint. Only a zero-length read is the true indicator of EOF.

@joshtriplett
Copy link
Member

joshtriplett commented Oct 4, 2021

I'm aware that filesystems can give incorrect information. However, there's a lot of code out there that just does stat/read or stat/mmap, all of which breaks with an incorrect size. I don't think it would be unreasonable to make the same assumption.

FUSE filesystems presenting incorrect metadata that doesn't look like a placeholder value seem like a pathological case.

In any case, I think using readv seems like a good optimization. I do think you would need to make sure vectored reads are possible, since without that, it would just look like a partial read from the stub implementation rather than from the OS.

@the8472
Copy link
Member

the8472 commented Oct 4, 2021

However, there's a lot of code out there that just does stat/read or stat/mmap, all of which breaks with an incorrect size.

At least in the mmap case they're already accepting that this won't work on unusual files. If you're writing a rust program that replaces a bash script that reads from strange sources then you're more likely to reach for standard methods.

FUSE filesystems presenting incorrect metadata that doesn't look like a placeholder value seem like a pathological case.

In my case it was a bug caused by an incorrect size calculation that only affected the metadata but not the reads.

In any case, I think using readv seems like a good optimization.

That's probably going to work in practice (at least for regular files on linux on non-network filesystems?) but afaik POSIX doesn't guarantee that short reads can't happen on regular files. So readv is no different from reading with a larger buffer which is not the same as a zero-length read.

That's a 99% solution. If we want to accept the 1% of edge-cases, ok, but I think we should document when we're not spec-compliant.

@tavianator
Copy link
Contributor

If a process receives a signal in the middle of read() then it's likely to get a short read. I feel like read_to_end() should not misbehave in this case.

@joshtriplett
Copy link
Member

I think if we see that the file has an advertised length of N, and then we do a readv with an N-byte buffer and a 32-byte buffer, and the readv comes back with exactly N bytes, it doesn't seem unreasonable to assume that we've read the whole file.

(One of these days, I think the Linux kernel needs to add an interface for returning an affirmative EOF indication within the same syscall. Or perhaps we'll just use io_uring and submit two read syscalls.)

@jkugelman
Copy link
Contributor Author

That's probably going to work in practice (at least for regular files on linux on non-network filesystems?) but afaik POSIX doesn't guarantee that short reads can't happen on regular files. So readv is no different from reading with a larger buffer which is not the same as a zero-length read.

That's a good point. readv won't treat the spillover buffer specially. A short read isn't necessarily caused by EOF. It's really unlikely but within the realm of possibility that a read could fill up the buffer, run short, and yet not signal EOF. I don't think I'll be pursuing the readv idea.

@jkugelman
Copy link
Contributor Author

jkugelman commented Oct 5, 2021

On a different note, it'd be a big win to have read_to_end query the file size and set the initial capacity to match. Looking through the rust codebase it is an incredibly common pattern to do:

let mut data = Vec::new();
file.read_to_end(&mut data)?;

It would take an extra stat syscall but it would greatly reduce the number of reads when the empty vector is "warming up". I checked strace: if you pass an empty vector read_to_end currently reads 32, 32, 64, 128, 256, 512, etc. Two 32-byte reads and then it slowly doubles the read sizes. That's really suboptimal.

And the capacity is just a hint so guessing the wrong size wouldn't affect correctness at all.

Additionally:

  • std::fs::read already has this exact optimization, yet there is a substantial amount of code that calls file.read_to_end instead.
  • Read::read_to_end can't perform this optimization. It could only be done on Read + Seek. Or on File.

@joshtriplett
Copy link
Member

joshtriplett commented Oct 5, 2021

@jkugelman I looked through the entire compiler and standard library for every instance of .read_to_end and .read_to_string, and found only one case (outside of tests) where it seemed to be reading from something reasonably predictable rather than a generic reader or a buffer or similar but without using with_capacity. I submitted #89546 for that case.

@jkugelman
Copy link
Contributor Author

jkugelman commented Oct 5, 2021

Specializing read_to_end for File would provide a lot of benefit, not just for the rust codebase but for the whole ecosystem. I haven't looked but I expect there are lots of examples of reading into an empty Vec out in the wild.

I spotted a number of other places in rust that use this pattern. Here's what I found in case I get hit by a bus before I can work on the PR:

File

compiler/rustc_data_structures/src/memmap.rs
    pub unsafe fn map(mut file: File) -> io::Result<Self> {
---
        let mut data = Vec::new();
        file.read_to_end(&mut data)?;

src/tools/rls/rls-vfs/src/lib.rs
        let mut file = match fs::File::open(file_name) {
--
        let mut buf = vec![];
        if file.read_to_end(&mut buf).is_err() {

src/tools/cargo/crates/cargo-util/src/paths.rs
        let mut f = OpenOptions::new()
--
            .open(&path)?;
        let mut orig = Vec::new();
        f.read_to_end(&mut orig)?;

BufReader

BufReader defers directly to the std::io module's private read_to_end free function. It ought to call inner.read_to_end in case inner has a specialized implementation.

src/tools/cargo/src/cargo/util/lockserver.rs
                    let mut dst = Vec::new();
                    drop(client.read_to_end(&mut dst));

&dyn Read

&dyn Read could benefit the reader is a File or other type with a specialized implementation.

compiler/rustc_serialize/src/json.rs
pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> {
    let mut contents = Vec::new();
    match rdr.read_to_end(&mut contents) {

Examples

Read's examples and read_to_end's examples show this very pattern!

library/std/src/io/mod.rs
///     let mut f = File::open("foo.txt")?;
--
///     let mut buffer = Vec::new();
///     // read the whole file
///     f.read_to_end(&mut buffer)?;
--
    ///     let mut f = File::open("foo.txt")?;
--
    ///     let mut buffer = Vec::new();
    ///
    ///     // read the whole file
    ///     f.read_to_end(&mut buffer)?;

library/stdarch/examples/hex.rs
    let mut input = Vec::new();
    io::stdin().read_to_end(&mut input).unwrap();

@jkugelman jkugelman changed the title Remove EOF "probe" read in Read::read_to_end Improve Read::read_to_end performance Oct 5, 2021
@jkugelman jkugelman changed the title Improve Read::read_to_end performance Improve performance of Read::read_to_end and Read::read_to_string Oct 5, 2021
@jkugelman
Copy link
Contributor Author

jkugelman commented Oct 6, 2021

PR #89582 submitted.

jkugelman added a commit to jkugelman/rust that referenced this issue Oct 7, 2021
Reading a file into an empty vector or string buffer can incur
unnecessary `read` syscalls and memory re-allocations as the buffer
"warms up" and grows to its final size. This is perhaps a necessary evil
with generic readers, but files can be read in smarter by checking the
file size and reserving that much capacity.

`std::fs::read` and `read_to_string` already perform this optimization:
they open the file, reads its metadata, and call `with_capacity` with
the file size. This ensures that the buffer does not need to be resized
and an initial string of small `read` syscalls.

However, if a user opens the `File` themselves and calls
`file.read_to_end` or `file.read_to_string` they do not get this
optimization.

```rust
let mut buf = Vec::new();
file.read_to_end(&mut buf)?;
```

I searched through this project's codebase and even here are a *lot* of
examples of this. They're found all over in unit tests, which isn't a
big deal, but there are also several real instances in the compiler and
in Cargo. I've documented the ones I found in a comment here:

rust-lang#89516 (comment)

Most telling, the `Read` trait and the `read_to_end` method both show
this exact pattern as examples of how to use readers. What this says to
me is that this shouldn't be solved by simply fixing the instances of it
in this codebase. If it's here it's certain to be prevalent in the wider
Rust ecosystem.

To that end, this commit adds specializations of `read_to_end` and
`read_to_string` directly on `File`. This way it's no longer a minor
footgun to start with an empty buffer when reading a file in.

A nice side effect of this change is that code that accesses a `File` as
a bare `Read` constraint or via a `dyn Read` trait object will benefit.
For example, this code from `compiler/rustc_serialize/src/json.rs`:

```rust
pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> {
    let mut contents = Vec::new();
    match rdr.read_to_end(&mut contents) {
```

Related changes:

- I also added specializations to `BufReader` to delegate to
  `self.inner`'s methods. That way it can call `File`'s optimized
  implementations if the inner reader is a file.

- The private `std::io::append_to_string` function is now marked
  `unsafe`.

- `File::read_to_string` being more efficient means that the performance
  note for `io::read_to_string` can be softened. I've added @camelid's
  suggested wording from:

  rust-lang#80218 (comment)
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 9, 2021
…r=joshtriplett

Optimize File::read_to_end and read_to_string

Reading a file into an empty vector or string buffer can incur unnecessary `read` syscalls and memory re-allocations as the buffer "warms up" and grows to its final size. This is perhaps a necessary evil with generic readers, but files can be read in smarter by checking the file size and reserving that much capacity.

`std::fs::read` and `std::fs::read_to_string` already perform this optimization: they open the file, reads its metadata, and call `with_capacity` with the file size. This ensures that the buffer does not need to be resized and an initial string of small `read` syscalls.

However, if a user opens the `File` themselves and calls `file.read_to_end` or `file.read_to_string` they do not get this optimization.

```rust
let mut buf = Vec::new();
file.read_to_end(&mut buf)?;
```

I searched through this project's codebase and even here are a *lot* of examples of this. They're found all over in unit tests, which isn't a big deal, but there are also several real instances in the compiler and in Cargo. I've documented the ones I found in a comment here:

rust-lang#89516 (comment)

Most telling, the documentation for both the `Read` trait and the `Read::read_to_end` method both show this exact pattern as examples of how to use readers. What this says to me is that this shouldn't be solved by simply fixing the instances of it in this codebase. If it's here it's certain to be prevalent in the wider Rust ecosystem.

To that end, this commit adds specializations of `read_to_end` and `read_to_string` directly on `File`. This way it's no longer a minor footgun to start with an empty buffer when reading a file in.

A nice side effect of this change is that code that accesses a `File` as `impl Read` or `dyn Read` will benefit. For example, this code from `compiler/rustc_serialize/src/json.rs`:

```rust
pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> {
    let mut contents = Vec::new();
    match rdr.read_to_end(&mut contents) {
```

Related changes:

- I also added specializations to `BufReader` to delegate to `self.inner`'s methods. That way it can call `File`'s optimized  implementations if the inner reader is a file.

- The private `std::io::append_to_string` function is now marked `unsafe`.

- `File::read_to_string` being more efficient means that the performance note for `io::read_to_string` can be softened. I've added `@camelid's` suggested wording from rust-lang#80218 (comment).

r? `@joshtriplett`
@noproto
Copy link
Contributor

noproto commented Oct 25, 2021

Regression: #90263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants