Improve performance of Read::read_to_end and Read::read_to_string #89516

jkugelman · 2021-10-04T11:12:46Z

In #89165 I updated Read::read_to_end to try to detect EOF when the input buffer is full by reading into a small "probe" buffer. If that read returns Ok(0) then it avoids unnecessarily doubling the capacity of the input buffer.

I think there'd be some value in trying to eliminate even the "probe buffer" call, but in the meantime, this seems like an improvement.

Originally posted by @joshtriplett in #89165 (comment)

Josh, did you have something in mind?

I thought a way: use read_vectored. I could add the probe buffer to a vectorized readv, which would eliminate the extra read syscall.

Is that idea worth pursuing?
Is it okay to simply switch the read call(s) to read_vectored, or would it be better to check is_read_vectored and have separate vectorized and non-vectorized code paths? I'm inclined to keep it simple and do the former.

The text was updated successfully, but these errors were encountered:

joshtriplett · 2021-10-04T15:13:59Z

My original thought was, if we have an ordinary file with a file size and it doesn't look like a placeholder value, we should just read that many bytes and stop.

the8472 · 2021-10-04T17:24:11Z

You're probably aware of the edge-case around procfs hence "it doesn't look like a placeholder value", but that isn't the only one. For example FUSE filesystems can make up any metadata. So we can't treat the size as anything but a hint. Only a zero-length read is the true indicator of EOF.

joshtriplett · 2021-10-04T17:37:36Z

I'm aware that filesystems can give incorrect information. However, there's a lot of code out there that just does stat/read or stat/mmap, all of which breaks with an incorrect size. I don't think it would be unreasonable to make the same assumption.

FUSE filesystems presenting incorrect metadata that doesn't look like a placeholder value seem like a pathological case.

In any case, I think using readv seems like a good optimization. I do think you would need to make sure vectored reads are possible, since without that, it would just look like a partial read from the stub implementation rather than from the OS.

the8472 · 2021-10-04T18:04:51Z

However, there's a lot of code out there that just does stat/read or stat/mmap, all of which breaks with an incorrect size.

At least in the mmap case they're already accepting that this won't work on unusual files. If you're writing a rust program that replaces a bash script that reads from strange sources then you're more likely to reach for standard methods.

FUSE filesystems presenting incorrect metadata that doesn't look like a placeholder value seem like a pathological case.

In my case it was a bug caused by an incorrect size calculation that only affected the metadata but not the reads.

In any case, I think using readv seems like a good optimization.

That's probably going to work in practice (at least for regular files on linux on non-network filesystems?) but afaik POSIX doesn't guarantee that short reads can't happen on regular files. So readv is no different from reading with a larger buffer which is not the same as a zero-length read.

That's a 99% solution. If we want to accept the 1% of edge-cases, ok, but I think we should document when we're not spec-compliant.

tavianator · 2021-10-04T18:11:24Z

If a process receives a signal in the middle of read() then it's likely to get a short read. I feel like read_to_end() should not misbehave in this case.

joshtriplett · 2021-10-04T22:06:13Z

I think if we see that the file has an advertised length of N, and then we do a readv with an N-byte buffer and a 32-byte buffer, and the readv comes back with exactly N bytes, it doesn't seem unreasonable to assume that we've read the whole file.

(One of these days, I think the Linux kernel needs to add an interface for returning an affirmative EOF indication within the same syscall. Or perhaps we'll just use io_uring and submit two read syscalls.)

jkugelman · 2021-10-05T00:36:13Z

That's probably going to work in practice (at least for regular files on linux on non-network filesystems?) but afaik POSIX doesn't guarantee that short reads can't happen on regular files. So readv is no different from reading with a larger buffer which is not the same as a zero-length read.

That's a good point. readv won't treat the spillover buffer specially. A short read isn't necessarily caused by EOF. It's really unlikely but within the realm of possibility that a read could fill up the buffer, run short, and yet not signal EOF. I don't think I'll be pursuing the readv idea.

jkugelman · 2021-10-05T00:38:06Z

On a different note, it'd be a big win to have read_to_end query the file size and set the initial capacity to match. Looking through the rust codebase it is an incredibly common pattern to do:

let mut data = Vec::new();
file.read_to_end(&mut data)?;

It would take an extra stat syscall but it would greatly reduce the number of reads when the empty vector is "warming up". I checked strace: if you pass an empty vector read_to_end currently reads 32, 32, 64, 128, 256, 512, etc. Two 32-byte reads and then it slowly doubles the read sizes. That's really suboptimal.

And the capacity is just a hint so guessing the wrong size wouldn't affect correctness at all.

Additionally:

std::fs::read already has this exact optimization, yet there is a substantial amount of code that calls file.read_to_end instead.
Read::read_to_end can't perform this optimization. It could only be done on Read + Seek. Or on File.

joshtriplett · 2021-10-05T04:52:26Z

@jkugelman I looked through the entire compiler and standard library for every instance of .read_to_end and .read_to_string, and found only one case (outside of tests) where it seemed to be reading from something reasonably predictable rather than a generic reader or a buffer or similar but without using with_capacity. I submitted #89546 for that case.

jkugelman · 2021-10-05T13:38:06Z

Specializing read_to_end for File would provide a lot of benefit, not just for the rust codebase but for the whole ecosystem. I haven't looked but I expect there are lots of examples of reading into an empty Vec out in the wild.

I spotted a number of other places in rust that use this pattern. Here's what I found in case I get hit by a bus before I can work on the PR:

File

compiler/rustc_data_structures/src/memmap.rs
    pub unsafe fn map(mut file: File) -> io::Result<Self> {
---
        let mut data = Vec::new();
        file.read_to_end(&mut data)?;

src/tools/rls/rls-vfs/src/lib.rs
        let mut file = match fs::File::open(file_name) {
--
        let mut buf = vec![];
        if file.read_to_end(&mut buf).is_err() {

src/tools/cargo/crates/cargo-util/src/paths.rs
        let mut f = OpenOptions::new()
--
            .open(&path)?;
        let mut orig = Vec::new();
        f.read_to_end(&mut orig)?;

BufReader

BufReader defers directly to the std::io module's private read_to_end free function. It ought to call inner.read_to_end in case inner has a specialized implementation.

src/tools/cargo/src/cargo/util/lockserver.rs
                    let mut dst = Vec::new();
                    drop(client.read_to_end(&mut dst));

&dyn Read

&dyn Read could benefit the reader is a File or other type with a specialized implementation.

compiler/rustc_serialize/src/json.rs
pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> {
    let mut contents = Vec::new();
    match rdr.read_to_end(&mut contents) {

Examples

Read's examples and read_to_end's examples show this very pattern!

library/std/src/io/mod.rs
///     let mut f = File::open("foo.txt")?;
--
///     let mut buffer = Vec::new();
///     // read the whole file
///     f.read_to_end(&mut buffer)?;
--
    ///     let mut f = File::open("foo.txt")?;
--
    ///     let mut buffer = Vec::new();
    ///
    ///     // read the whole file
    ///     f.read_to_end(&mut buffer)?;

library/stdarch/examples/hex.rs
    let mut input = Vec::new();
    io::stdin().read_to_end(&mut input).unwrap();

jkugelman · 2021-10-06T00:18:39Z

PR #89582 submitted.

@camelid

Reading a file into an empty vector or string buffer can incur unnecessary `read` syscalls and memory re-allocations as the buffer "warms up" and grows to its final size. This is perhaps a necessary evil with generic readers, but files can be read in smarter by checking the file size and reserving that much capacity. `std::fs::read` and `read_to_string` already perform this optimization: they open the file, reads its metadata, and call `with_capacity` with the file size. This ensures that the buffer does not need to be resized and an initial string of small `read` syscalls. However, if a user opens the `File` themselves and calls `file.read_to_end` or `file.read_to_string` they do not get this optimization. ```rust let mut buf = Vec::new(); file.read_to_end(&mut buf)?; ``` I searched through this project's codebase and even here are a *lot* of examples of this. They're found all over in unit tests, which isn't a big deal, but there are also several real instances in the compiler and in Cargo. I've documented the ones I found in a comment here: rust-lang#89516 (comment) Most telling, the `Read` trait and the `read_to_end` method both show this exact pattern as examples of how to use readers. What this says to me is that this shouldn't be solved by simply fixing the instances of it in this codebase. If it's here it's certain to be prevalent in the wider Rust ecosystem. To that end, this commit adds specializations of `read_to_end` and `read_to_string` directly on `File`. This way it's no longer a minor footgun to start with an empty buffer when reading a file in. A nice side effect of this change is that code that accesses a `File` as a bare `Read` constraint or via a `dyn Read` trait object will benefit. For example, this code from `compiler/rustc_serialize/src/json.rs`: ```rust pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> { let mut contents = Vec::new(); match rdr.read_to_end(&mut contents) { ``` Related changes: - I also added specializations to `BufReader` to delegate to `self.inner`'s methods. That way it can call `File`'s optimized implementations if the inner reader is a file. - The private `std::io::append_to_string` function is now marked `unsafe`. - `File::read_to_string` being more efficient means that the performance note for `io::read_to_string` can be softened. I've added @camelid's suggested wording from: rust-lang#80218 (comment)

…r=joshtriplett Optimize File::read_to_end and read_to_string Reading a file into an empty vector or string buffer can incur unnecessary `read` syscalls and memory re-allocations as the buffer "warms up" and grows to its final size. This is perhaps a necessary evil with generic readers, but files can be read in smarter by checking the file size and reserving that much capacity. `std::fs::read` and `std::fs::read_to_string` already perform this optimization: they open the file, reads its metadata, and call `with_capacity` with the file size. This ensures that the buffer does not need to be resized and an initial string of small `read` syscalls. However, if a user opens the `File` themselves and calls `file.read_to_end` or `file.read_to_string` they do not get this optimization. ```rust let mut buf = Vec::new(); file.read_to_end(&mut buf)?; ``` I searched through this project's codebase and even here are a *lot* of examples of this. They're found all over in unit tests, which isn't a big deal, but there are also several real instances in the compiler and in Cargo. I've documented the ones I found in a comment here: rust-lang#89516 (comment) Most telling, the documentation for both the `Read` trait and the `Read::read_to_end` method both show this exact pattern as examples of how to use readers. What this says to me is that this shouldn't be solved by simply fixing the instances of it in this codebase. If it's here it's certain to be prevalent in the wider Rust ecosystem. To that end, this commit adds specializations of `read_to_end` and `read_to_string` directly on `File`. This way it's no longer a minor footgun to start with an empty buffer when reading a file in. A nice side effect of this change is that code that accesses a `File` as `impl Read` or `dyn Read` will benefit. For example, this code from `compiler/rustc_serialize/src/json.rs`: ```rust pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> { let mut contents = Vec::new(); match rdr.read_to_end(&mut contents) { ``` Related changes: - I also added specializations to `BufReader` to delegate to `self.inner`'s methods. That way it can call `File`'s optimized implementations if the inner reader is a file. - The private `std::io::append_to_string` function is now marked `unsafe`. - `File::read_to_string` being more efficient means that the performance note for `io::read_to_string` can be softened. I've added `@camelid's` suggested wording from rust-lang#80218 (comment). r? `@joshtriplett`

noproto · 2021-10-25T17:51:09Z

Regression: #90263

jkugelman mentioned this issue Oct 4, 2021

Fix read_to_end to not grow an exact size buffer #89165

Merged

jkugelman changed the title ~~Remove EOF "probe" read in Read::read_to_end~~ Improve Read::read_to_end performance Oct 5, 2021

jkugelman changed the title ~~Improve Read::read_to_end performance~~ Improve performance of Read::read_to_end and Read::read_to_string Oct 5, 2021

jkugelman mentioned this issue Oct 6, 2021

Optimize File::read_to_end and read_to_string #89582

Merged

jkugelman closed this as completed Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of Read::read_to_end and Read::read_to_string #89516

Improve performance of Read::read_to_end and Read::read_to_string #89516

jkugelman commented Oct 4, 2021 •

edited

Loading

joshtriplett commented Oct 4, 2021 •

edited

Loading

the8472 commented Oct 4, 2021

joshtriplett commented Oct 4, 2021 •

edited

Loading

the8472 commented Oct 4, 2021

tavianator commented Oct 4, 2021

joshtriplett commented Oct 4, 2021

jkugelman commented Oct 5, 2021

jkugelman commented Oct 5, 2021 •

edited

Loading

joshtriplett commented Oct 5, 2021 •

edited

Loading

jkugelman commented Oct 5, 2021 •

edited

Loading

jkugelman commented Oct 6, 2021 •

edited

Loading

noproto commented Oct 25, 2021

Improve performance of Read::read_to_end and Read::read_to_string #89516

Improve performance of Read::read_to_end and Read::read_to_string #89516

Comments

jkugelman commented Oct 4, 2021 • edited Loading

joshtriplett commented Oct 4, 2021 • edited Loading

the8472 commented Oct 4, 2021

joshtriplett commented Oct 4, 2021 • edited Loading

the8472 commented Oct 4, 2021

tavianator commented Oct 4, 2021

joshtriplett commented Oct 4, 2021

jkugelman commented Oct 5, 2021

jkugelman commented Oct 5, 2021 • edited Loading

joshtriplett commented Oct 5, 2021 • edited Loading

jkugelman commented Oct 5, 2021 • edited Loading

File

BufReader

&dyn Read

Examples

jkugelman commented Oct 6, 2021 • edited Loading

noproto commented Oct 25, 2021

jkugelman commented Oct 4, 2021 •

edited

Loading

joshtriplett commented Oct 4, 2021 •

edited

Loading

joshtriplett commented Oct 4, 2021 •

edited

Loading

jkugelman commented Oct 5, 2021 •

edited

Loading

joshtriplett commented Oct 5, 2021 •

edited

Loading

jkugelman commented Oct 5, 2021 •

edited

Loading

jkugelman commented Oct 6, 2021 •

edited

Loading