-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of Read::read_to_end and Read::read_to_string #89516
Comments
My original thought was, if we have an ordinary file with a file size and it doesn't look like a placeholder value, we should just read that many bytes and stop. |
You're probably aware of the edge-case around procfs hence "it doesn't look like a placeholder value", but that isn't the only one. For example FUSE filesystems can make up any metadata. So we can't treat the size as anything but a hint. Only a zero-length read is the true indicator of EOF. |
I'm aware that filesystems can give incorrect information. However, there's a lot of code out there that just does stat/read or stat/mmap, all of which breaks with an incorrect size. I don't think it would be unreasonable to make the same assumption. FUSE filesystems presenting incorrect metadata that doesn't look like a placeholder value seem like a pathological case. In any case, I think using readv seems like a good optimization. I do think you would need to make sure vectored reads are possible, since without that, it would just look like a partial read from the stub implementation rather than from the OS. |
At least in the mmap case they're already accepting that this won't work on unusual files. If you're writing a rust program that replaces a bash script that reads from strange sources then you're more likely to reach for standard methods.
In my case it was a bug caused by an incorrect size calculation that only affected the metadata but not the reads.
That's probably going to work in practice (at least for regular files on linux on non-network filesystems?) but afaik POSIX doesn't guarantee that short reads can't happen on regular files. So That's a 99% solution. If we want to accept the 1% of edge-cases, ok, but I think we should document when we're not spec-compliant. |
If a process receives a signal in the middle of read() then it's likely to get a short read. I feel like |
I think if we see that the file has an advertised length of N, and then we do a readv with an N-byte buffer and a 32-byte buffer, and the readv comes back with exactly N bytes, it doesn't seem unreasonable to assume that we've read the whole file. (One of these days, I think the Linux kernel needs to add an interface for returning an affirmative EOF indication within the same syscall. Or perhaps we'll just use |
That's a good point. |
On a different note, it'd be a big win to have
It would take an extra And the capacity is just a hint so guessing the wrong size wouldn't affect correctness at all. Additionally:
|
@jkugelman I looked through the entire compiler and standard library for every instance of |
Specializing I spotted a number of other places in Filecompiler/rustc_data_structures/src/memmap.rs
pub unsafe fn map(mut file: File) -> io::Result<Self> {
---
let mut data = Vec::new();
file.read_to_end(&mut data)?;
src/tools/rls/rls-vfs/src/lib.rs
let mut file = match fs::File::open(file_name) {
--
let mut buf = vec![];
if file.read_to_end(&mut buf).is_err() {
src/tools/cargo/crates/cargo-util/src/paths.rs
let mut f = OpenOptions::new()
--
.open(&path)?;
let mut orig = Vec::new();
f.read_to_end(&mut orig)?; BufReader
src/tools/cargo/src/cargo/util/lockserver.rs
let mut dst = Vec::new();
drop(client.read_to_end(&mut dst)); &dyn Read
compiler/rustc_serialize/src/json.rs
pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> {
let mut contents = Vec::new();
match rdr.read_to_end(&mut contents) { Examples
library/std/src/io/mod.rs
/// let mut f = File::open("foo.txt")?;
--
/// let mut buffer = Vec::new();
/// // read the whole file
/// f.read_to_end(&mut buffer)?;
--
/// let mut f = File::open("foo.txt")?;
--
/// let mut buffer = Vec::new();
///
/// // read the whole file
/// f.read_to_end(&mut buffer)?;
library/stdarch/examples/hex.rs
let mut input = Vec::new();
io::stdin().read_to_end(&mut input).unwrap(); |
PR #89582 submitted. |
Reading a file into an empty vector or string buffer can incur unnecessary `read` syscalls and memory re-allocations as the buffer "warms up" and grows to its final size. This is perhaps a necessary evil with generic readers, but files can be read in smarter by checking the file size and reserving that much capacity. `std::fs::read` and `read_to_string` already perform this optimization: they open the file, reads its metadata, and call `with_capacity` with the file size. This ensures that the buffer does not need to be resized and an initial string of small `read` syscalls. However, if a user opens the `File` themselves and calls `file.read_to_end` or `file.read_to_string` they do not get this optimization. ```rust let mut buf = Vec::new(); file.read_to_end(&mut buf)?; ``` I searched through this project's codebase and even here are a *lot* of examples of this. They're found all over in unit tests, which isn't a big deal, but there are also several real instances in the compiler and in Cargo. I've documented the ones I found in a comment here: rust-lang#89516 (comment) Most telling, the `Read` trait and the `read_to_end` method both show this exact pattern as examples of how to use readers. What this says to me is that this shouldn't be solved by simply fixing the instances of it in this codebase. If it's here it's certain to be prevalent in the wider Rust ecosystem. To that end, this commit adds specializations of `read_to_end` and `read_to_string` directly on `File`. This way it's no longer a minor footgun to start with an empty buffer when reading a file in. A nice side effect of this change is that code that accesses a `File` as a bare `Read` constraint or via a `dyn Read` trait object will benefit. For example, this code from `compiler/rustc_serialize/src/json.rs`: ```rust pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> { let mut contents = Vec::new(); match rdr.read_to_end(&mut contents) { ``` Related changes: - I also added specializations to `BufReader` to delegate to `self.inner`'s methods. That way it can call `File`'s optimized implementations if the inner reader is a file. - The private `std::io::append_to_string` function is now marked `unsafe`. - `File::read_to_string` being more efficient means that the performance note for `io::read_to_string` can be softened. I've added @camelid's suggested wording from: rust-lang#80218 (comment)
…r=joshtriplett Optimize File::read_to_end and read_to_string Reading a file into an empty vector or string buffer can incur unnecessary `read` syscalls and memory re-allocations as the buffer "warms up" and grows to its final size. This is perhaps a necessary evil with generic readers, but files can be read in smarter by checking the file size and reserving that much capacity. `std::fs::read` and `std::fs::read_to_string` already perform this optimization: they open the file, reads its metadata, and call `with_capacity` with the file size. This ensures that the buffer does not need to be resized and an initial string of small `read` syscalls. However, if a user opens the `File` themselves and calls `file.read_to_end` or `file.read_to_string` they do not get this optimization. ```rust let mut buf = Vec::new(); file.read_to_end(&mut buf)?; ``` I searched through this project's codebase and even here are a *lot* of examples of this. They're found all over in unit tests, which isn't a big deal, but there are also several real instances in the compiler and in Cargo. I've documented the ones I found in a comment here: rust-lang#89516 (comment) Most telling, the documentation for both the `Read` trait and the `Read::read_to_end` method both show this exact pattern as examples of how to use readers. What this says to me is that this shouldn't be solved by simply fixing the instances of it in this codebase. If it's here it's certain to be prevalent in the wider Rust ecosystem. To that end, this commit adds specializations of `read_to_end` and `read_to_string` directly on `File`. This way it's no longer a minor footgun to start with an empty buffer when reading a file in. A nice side effect of this change is that code that accesses a `File` as `impl Read` or `dyn Read` will benefit. For example, this code from `compiler/rustc_serialize/src/json.rs`: ```rust pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> { let mut contents = Vec::new(); match rdr.read_to_end(&mut contents) { ``` Related changes: - I also added specializations to `BufReader` to delegate to `self.inner`'s methods. That way it can call `File`'s optimized implementations if the inner reader is a file. - The private `std::io::append_to_string` function is now marked `unsafe`. - `File::read_to_string` being more efficient means that the performance note for `io::read_to_string` can be softened. I've added `@camelid's` suggested wording from rust-lang#80218 (comment). r? `@joshtriplett`
Regression: #90263 |
In #89165 I updated
Read::read_to_end
to try to detect EOF when the input buffer is full by reading into a small "probe" buffer. If that read returnsOk(0)
then it avoids unnecessarily doubling the capacity of the input buffer.Originally posted by @joshtriplett in #89165 (comment)
Josh, did you have something in mind?
I thought a way: use
read_vectored
. I could add the probe buffer to a vectorizedreadv
, which would eliminate the extraread
syscall.Is that idea worth pursuing?
Is it okay to simply switch the
read
call(s) toread_vectored
, or would it be better to checkis_read_vectored
and have separate vectorized and non-vectorized code paths? I'm inclined to keep it simple and do the former.The text was updated successfully, but these errors were encountered: