Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Per-line IO Extension #2101

Closed
wants to merge 7 commits into from
Closed

RFC: Per-line IO Extension #2101

wants to merge 7 commits into from

Conversation

PENGUINLIONG
Copy link

@PENGUINLIONG PENGUINLIONG commented Aug 9, 2017

This RFC extends std::io to suit the need for uniform per-line reading/writing interface.

Rendered

# Motivation
[motivation]: #motivation

Programs sometimes need to know about their working environment to do their job properly. For example, linebreak convention differs for Windows and *NIX. Such discrepancy can lead to problems easily, especially when a program needs to communicate with aged third-party libraries.
Copy link
Member

@kennytm kennytm Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please escape the * here ("\*NIX"), the rest of the document is badly highlighted on GitHub's diff view because of this.

```rust
/// Platform architecture.
/// e.g. `arm`, `x86_64` and `i686`.
pub fn arch() -> String;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn linebreak() -> String;
/// Current operating system.
/// e.g. `windows`, `linux` and `darwin`.
pub fn os() -> String;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn os() -> String;
/// Word size in bits the program has been compiled into.
/// Commonly 32 or 64.
pub fn word_size() -> u32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::mem::size_of::<usize>() * 8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it, though? usize is defined as a 'pointer-sized unsigned integer type', which connects it to the bit width of the memory address space, not of a general-purpose register (which need not be the same). Given this definition, under the x32 ABI for x86-64 usize ought to be 32-bit, even though general-purpose registers are 64-bit. I also wonder how usize should be defined on 32-bit x86 with segmented pointers...

I think there should be a separate integer type with the bit width of a general-purpose/arithmetic register.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fstirlitz If we use #1339 (comment), the usize on x32 should be 32-bit. If it was 64-bit, the pointer-cast (x: usize) as *const T would not be a one-to-one mapping, which existing code do rely on.

A separate integer type would require a change of the provided cfg flags (e.g. #[cfg(target_feature = "64-bit register")] type ifast = i64;, which is out-of-scope for this RFC.

pub const LINEBREAK: &'static str;
/// Word size in bits the program has been compiled into.
/// Commonly 32 or 64.
pub const WORD_SIZE: &'static u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a u32 and not a &u32.

Also I think it's much more straightforward to use std::mem::size_of<usize>() (which gives you the bytes, but that's usually what you want anyway I think, also a *8 doesn't warrant a new constant in the stdlib imo.

# Unresolved questions
[unresolved]: #unresolved-questions

The datatype returned by `word_size()` is not yet determined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a function anymore but a constant.

@nrc nrc added A-libs T-libs-api Relevant to the library API team, which will review and decide on the RFC. and removed A-libs labels Aug 10, 2017
@mark-i-m
Copy link
Member

This makes me nervous. I would rather this information come through an API. Doing it through the compiler means that building a library on two different machines could now be guaranteed to not be reproducible.

@kennytm
Copy link
Member

kennytm commented Aug 15, 2017

@mark-i-m I don't think that's what "reproducible" means. Besides, what OP proposed is no more than adding

#[cfg(windows)]
pub const LINEBREAK: &str = "\r\n";
#[cfg(not(windows))]
pub const LINEBREAK: &str = "\n";

to libstd. The standard library already has a lot of #[cfg] stuff that reproducible builds must take into account.

@mark-i-m
Copy link
Member

@kennytm Hmm... I see what you mean. That's not how I understood the RFC. My understanding was that these values would be more like language items filled in magically by the compiler (which would not be reproducible).

@Evrey
Copy link

Evrey commented Aug 16, 2017

Why not off-load such things to OsString? Have a file that might use \r\n on Win or \n on Tux? Load it into an OsString and make an explicit "with line-breaks"-conversion to String. This way, no Rust-library will have to put mental load into being careful about line-break-style in normal String/str code, meaning that cross platform code is easier to do. (Or just let \r die already, just as \v.)

@jan-hudec
Copy link

@mark-i-m, it does not matter how it is set, it only matters what it depends on. This shall clearly only depend on host (-target) platform and that is not a problem for reproducible builds, because binaries for different platforms are obviously not expected to be equal.

@Evrey, OsString is the type passed to and from native system calls and that is very different from what is stored in files by convention.

@PENGUINLIONG, the usual way to use the platform newline is via the text flag when opening files.

In Rust standard library design it does not fit in std::fs::File, but is handled in a wrapper. So either a configuration should be added to std::io::BufReader and std::io::BufWriter to set the newline to use, or extended wrappers (std::io::TextReader and std::io::TextWriter) should be added to handle newline conversion and in future also character set conversion.

@mark-i-m
Copy link
Member

@jan-hudec That's true. Thanks for the correction 👍

Still, I would prefer to minimize magic in the compiler.

@scottmcm
Copy link
Member

FWIW, C# soured me on platform-specific newlines. Using WriteLine on a System.IO.StringWriter put CRLF into my string, even though I was just making a string that had nothing to do with my current OS.

I feel like some kind of newline normalizing reader/writer wrapper might be a better option here, so nearly all code can continue to just use \n as normal, but places that need something specific can choose which kind of newline to output, be that LineEnd::Lf, LineEnd::CrLf, LineEnd::Cr, or LineEnd::Platform.

@PENGUINLIONG
Copy link
Author

@jan-hudec Sometimes I just met the problem that the remote app only receive a certain type of line end. Communicating with many of these is just annoying. I'm pretty sure it's absurd and is not common, but it is an really existing problem.

The wrapper provided an uniform interface for line end processing and it seems to work well, @scottmcm . Through @kennytm 's method we don't need to do any magic in the compiler, that's great!

One more question. Should we enumerate all the Unicode line terminators?

BTW, the discussion is on line ends now. Should I re-propose an RFC or just edit it?

@PENGUINLIONG PENGUINLIONG changed the title Proposed more-env-info RFC: Unify string linebreaks with str::break_lines_with() Aug 29, 2017
# Detailed design
[design]: #detailed-design

This RFC mainly introduces method `break_lines_with()` to primitive `str`. The method replaces all recognizable linebreaks with new linebreak specified, and it will not consume any linebreaks. Here is a simplified implementation of it:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What advantage does this have over plain old replace? I don't see any.

Also, most practical code never wants to have the complete text in a single string anyway. Usually one either pulls lines by one from a BufRead or push them by one to a BufWrite.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say performance, especially when the incoming string's libebreaks are unknown. e.g. I want all LF in my string to be replaced by CRLF but the method accepts not only LF-broken strings. In this case I can't simply replace LF with CRLF, since there is an LF in CRLF and the result will be like CRCRLF. Then I have to replace all CRLF to LF temporary beforehand.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about your suggestion also. I've tried to make a trait WriteLn that allows users to specify linebreaks and write a line without using writeln!, as it's linebreak is always '\n'.

But my opinion is, Rust didn't explicitly distinguish text and binary reading/writing, since OpenOption didn't have a method like text() and Read, Write have both binary and text reading/writing methods. It's not Read or Write's job to break lines. That's why I added the method to str.

I will update the RFC when I come up with something better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PENGUINLIONG There is BufRead::read_line.

I don't see any point introducing WriteLn since you could just write

write!(output, "blah{newline}", newline="\r\n");

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kennytm Sorry for missing that out.

But I think, it somehow reflected the need for WriteLn and ReadLn: We have a way to input per-line and we are seeking for a way to output per-line conveniently. This abstraction of per-line functionalities might be able to solve the problem.

@PENGUINLIONG PENGUINLIONG changed the title RFC: Unify string linebreaks with str::break_lines_with() RFC: Per-line IO Extension Aug 31, 2017
@mark-i-m
Copy link
Member

mark-i-m commented Sep 3, 2017

@PENGUINLIONG Perhaps a pre-rfc issue might be useful to explore the design space?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants