Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Make EOL handling more general #19877

Closed
wants to merge 10 commits into from
Closed

Conversation

mpastell
Copy link
Contributor

@mpastell mpastell commented Jan 5, 2017

This aims to solve #19785. I would like to get feedback if this is going to the right direction and can be merged when ready before working further on this.

  • Consider ['\n', '\r', '\r\n'] as line ends in readline, readlines and eachline
  • Add option to readline etc. to omit EOL (chomp::Bool=false)
  • Make chomp to remove any of the above line feeds
  • Update documentation
  • Add tests

@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re
the end of the input), from the given I/O stream or file (defaults to `STDIN`).
When reading from a file, the text is assumed to be encoded in UTF-8.
"""
readline(filename::AbstractString) = open(readline, filename)
function readline(filename::AbstractString, chomp = false; nl2lf = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keyword argument currently have a significant performance cost: probably better make nl2lf a positional argument.

I'm also not sure we need that argument. Does it exist in other languages? It would seem to me that either you just want to extract the text content of the line and don't care about newline characters (chomp=true), or you care about them and you wouldn't want them to be replaced with \n.

Finally, arguments must be described in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has universal newlines (original suggestion in #19785), which does exactly the same as nl2lf. It is convenient way to do ensure consistent EOL from the input.

Copy link
Contributor Author

@mpastell mpastell Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix the docstrings once we can agree on the implementation and arguments.

@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re
the end of the input), from the given I/O stream or file (defaults to `STDIN`).
When reading from a file, the text is assumed to be encoded in UTF-8.
"""
readline(filename::AbstractString) = open(readline, filename)
function readline(filename::AbstractString, chomp = false; nl2lf = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the greatest keyword argument name. I wonder if it'd be better to do something like:

const LINEFEEDS = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']

function readline(file, chomp; newlines::Vector{Char}=linefeeds)
...

This has the dual-benefit of making the linefeeds vector a const, as well as allowing the user to specify a custom set of linefeed characters if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that before this PR readline was based on readuntil. We could make the latter more general and still implement the former over it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nl2lf has the effect of converting all linefeeds to \n in output, it doesn't allow you to set which characters are considered as a newline. Adding newlines argument makes sense, but I'd like to keep to nl2lf or something with a better name to do normalization of line ends conveniently and efficiently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to accept a newlines argument to readline if it can't be made efficient. People can use readuntil for special needs.

@@ -448,7 +456,32 @@ function readuntil(s::IO, t::AbstractString)
end

readline() = readline(STDIN)
readline(s::IO) = readuntil(s, '\n')

function readline(s::IO, chomp = false; nl2lf=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better (and more performant) to actually make readuntil be able to accept a Vector{Char}. The problem here is that readuntil itself is creating an IOBuffer and take!ing the output at the end, and this approach layers yet another IOBuffer on top of that.

The difficulty will be how you deal with \r\n, since it it's doing a special case here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed \r\n is probably hard to handle efficiently from readuntil, so readline probably still deserves a custom implementation. That would also allow for an efficient version working on bytes for standard UTF-8 String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is readuntil(s::IO, t::AbstractString), but I would expect this to be more efficient. It's modified from readuntil.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 5, 2017

This is about 4x slower than the current implementation due to considering also \u202x. Maybe it would be better to consider only \r, \r\n and \n and put the more general code in a package.

@kshyatt kshyatt added needs docs Documentation for this change is required needs tests Unit tests are required for this change strings "Strings!" labels Jan 5, 2017
@mpastell
Copy link
Contributor Author

mpastell commented Jan 5, 2017

I managed to make it faster by only considering \r, \r\n and \n as newlines. This still makes ~2x as many allocations as the old version. I would appreciate any hints how to make it faster.

I also dropped nl2lf as it doesn't seem others see at as a core feature.

readline(s::IO) = readuntil(s, '\n')

function readline(s::IO, chomp::Bool = false)
out = UInt8[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use an IOBuffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is significantly faster, readuntil uses the same approach.

@@ -448,7 +456,50 @@ function readuntil(s::IO, t::AbstractString)
end

readline() = readline(STDIN)
readline(s::IO) = readuntil(s, '\n')

function readline(s::IO, chomp::Bool = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the common style is not to have spaces around =. Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

while !eof(s)
c = read(s, Char)
if c in newlines
if c == '\r' && !eof(s) && Base.peek(s) == 0x0a
Copy link
Member

@nalimilan nalimilan Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use == '\n', comparison with integers is deprecated. This will be as efficient anyway.

EDIT: Nevermind, I misread that code, and looks like peekchar doesn't support all IO subtypes.

@nalimilan
Copy link
Member

I managed to make it faster by only considering \r, \r\n and \n as newlines. This still makes ~2x as many allocations as the old version. I would appreciate any hints how to make it faster.

Not sure why allocations would happen. Use the --track-allocation command line argument to see where they come from.

Once you manage to get it fast for \n and\r, I think a strategy to support checking for non-ASCII character separators would be to have a branch for checking the first byte, and only then check for the second byte. Anyway that code assumes the input is UTF-8, so you can take advantage of this.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 6, 2017

I realized that the original version calls readuntil from "iostream.jl", which uses ccall

function readuntil(s::IOStream, delim::UInt8)
    ccall(:jl_readuntil, Array{UInt8,1}, (Ptr{Void}, UInt8), s.ios, delim)
end

So it looks the current state is as fast as I can make it without implementing it in C, which I don't think I have time do right now.

Should I implement the Julia version anyway? It would mean that readlines becomes 2x slower than before.

@nalimilan
Copy link
Member

Ah, right. Unfortunately it's probably unacceptable to make such an essential function twice slower. Though you could adapt the C code without too much work. Actually there's already a jl_readline C function which calls ios_copyuntil. You could copy that code and adapt it to look for multiple characters (using strpbrk instead of memchr, and checking second bytes manually when needed).

@mpastell
Copy link
Contributor Author

mpastell commented Jan 6, 2017

Yes, that seems like the right way to go. I will look into it, but my C is a bit rusty and I have limited amount of time so it can take a while.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 7, 2017

Modifying the C code was quite simple I managed to implement implement readline in C, the performance is now same as the old version.

if (ch == '\r'){
r = p;
if (chomp) nchomp = 1;
ntowrite = r - (from->buf+from->bpos) + 1 - nchomp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing whitespace here and on L854

EachLine(stream, ondone) = new(stream, ondone)
chomp::Bool
EachLine(stream, chomp) = EachLine(stream, ()->nothing, chomp)
EachLine(stream, ondone, chomp) = new(stream, ondone, chomp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have ondone be the last field/argument?

@@ -448,7 +456,29 @@ function readuntil(s::IO, t::AbstractString)
end

readline() = readline(STDIN)
readline(s::IO) = readuntil(s, '\n')

function readline(s::IO, chomp::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still needed. C version reads from IOStreams, but not for instance IOBuffers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this means it has to be relatively efficient too. Is that the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say its relatively efficient and anyway it has the same performance as before.

@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re
the end of the input), from the given I/O stream or file (defaults to `STDIN`).
When reading from a file, the text is assumed to be encoded in UTF-8.
"""
readline(filename::AbstractString) = open(readline, filename)
function readline(filename::AbstractString, chomp = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to mention chomp in docstrings.

@@ -831,6 +831,66 @@ size_t ios_copyuntil(ios_t *to, ios_t *from, char delim)
return total;
}

//Copy until '\r', '\n' or '\r\n'
size_t ios_copyline(ios_t *to, ios_t *from, int chomp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use bool_t? Also, space after //.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing code uses int for Julia Bools

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's just that I found a bool_t in the same file, but I have no idea what's the most common convention.

char *r = NULL;
char *n = NULL;

for (size_t i = 0; i < avail; i++){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler and possibly more efficient to use strpbrk? Then you just need to check the next byte.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the first thing I tried and it's simpler and significantly slower.

You have to look for both \r and \n in available bytes, but you only want to break on the first one found. e.g. If you have a large file and you look for \r both in avail (using memch) and never find it the code gets very slow.

This has the same performance as readuntil which uses memchr.

Copy link
Member

@nalimilan nalimilan Jan 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to look for both \r and \n in available bytes, but you only want to break on the first one found. e.g. If you have a large file and you look for \r both in avail (using memch) and never find it the code gets very slow.

I don't follow. What difference does it make from the manual loop? I just suggest breaking on the first match. Then you only have to call strpbrk again if there's a \r character without a following \n, which shouldn't be too frequent and isn't worth optimizing for.

This has the same performance as readuntil which uses memchr.

Even, for relatively long strings? I'm asking because it looks like glibc contains hand-written assembly code here and explicit SSE instructions here for this, so I figure there must be a reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I tried using two calls to memchr as the input is not a string, but a buffer. I guess I could cast it to a string and use strpbrk.

Most of the time in readlines is anyway spent in Julia (I think in allocating the String array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Anyway I've just realized strpbrk requires null-terminated strings, and there doesn't seem to exist any variant taking the string length.

@nalimilan
Copy link
Member

Great. Would checking for non-ASCII newline characters incur a significant slowdown? Though it could easily be added later without breaking too much existing code.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 7, 2017

I don't think checking for non-ASCII characters would make a noticeable difference.

I think I will try to finish this pull request with *lines and chomp working for \r, \n and \r\n which would give Julia the same behavior as most other languages have and other newlines can be added later if there is a need.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 7, 2017

Many thanks to all for all the help and reviews! I think this starts be ready to merge after final comments and tweaks.

@tkelman tkelman removed needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Jan 7, 2017
@mpastell mpastell changed the title WIP: Make EOL handling more general RFC: Make EOL handling more general Jan 8, 2017
end

"""
eachline(stream::IO)
eachline(filename::AbstractString)
eachline(stream::IO, chomp::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One space is enough.

readline(stream::IO=STDIN, chomp::Bool=false)
readline(filename::AbstractString, chomp::Bool=false)

Read a single line of text including from the given I/O stream or file (defaults to `STDIN`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "including" (typo?).

char *r = NULL;
char *n = NULL;

for (size_t i = 0; i < avail; i++){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Anyway I've just realized strpbrk requires null-terminated strings, and there doesn't seem to exist any variant taking the string length.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is almost ready AFAICT. Though you'll have to rebase this since #19449 has just been merged (which is a good thing since we had to wait for it anyway).

Then I'd like other people to comment too. @StefanKarpinski, @stevengj?

ntowrite = n - (from->buf+from->bpos) + 1 - nchomp;
break;
}
if (ch == '\r'){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing else.

char *p = (char*)from->buf+from->bpos+i;
char ch = from->buf[from->bpos+i];

if (ch == '\n'){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before { here and elsewhere.

if (chomp) nchomp = 1;
ntowrite = r - (from->buf+from->bpos) + 1 - nchomp;
if (i < avail){
char ch2 = from->buf[from->bpos+i+1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can combine the two if and get rid of the intermediate ch2 variable.

write(io,"pancakes\r\nwaffles\n\rblueberries\r")
@test readlines(io) == String["pancakes\r\n","waffles\n","\r","blueberries\r"]
write(io,"pancakes\r\nwaffles\n\rblueberries\r")
@test readlines(io, true) == String["pancakes","waffles","","blueberries"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add tests for potentially problematic cases like \r\n\r\n, \n\r\n, \r\r\n, etc.? Same for chomp.

function chomp(s::String)
i = endof(s)
if i < 1 || s.data[i] != 0x0a
if i < 1 || (s.data[i] != 0x0a && s.data[i] != 0x0d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're making the function more complex, I would find it clearer to replace the integer codes with UInt8('\n') and UInt8('\r'). Same in readline.

"""
function chomp(s::AbstractString)
i = endof(s)
if (i < 1 || s[i] != '\n') return SubString(s, 1, i) end
if (i < 1 || (s[i] != '\n' && s[i] != '\r')) return SubString(s, 1, i) end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space here.

j = prevind(s,i)
if (j < 1 || s[j] != '\r') return SubString(s, 1, i-1) end
if (j < 1 || (s[j] != '\r' && s[i] == '\n')) return SubString(s, 1, i-1) end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to check that the s[i] == '\n' here since it's implied by the previous checks. So just put this code under an else of s[i] == '\r' for clarity, with a comment like # s[i] == '\n'.

@nalimilan
Copy link
Member

Actually, I've just had a deeper look at what Rust and Swift do, and they seem to adopt a simpler approach than here: they only break on \n and \r\n, nor on \r. This can be implemented more efficiently since you only need to look for \n, and then check whether the previous character is \r.

There's a comment in the Swift source saying that they'd like to support more newline characters, but so far I guess they couldn't find an efficient approach, or didn't consider it was worth it:
https://github.com/apple/swift/blob/74d03b9dd94b862602b975682b30f53f416b219a/stdlib/public/core/InputStream.swift#L43
https://bugs.swift.org/browse/SR-1280

Rust's implementation is here:
https://github.com/rust-lang/rust/blob/cbf88730e755d099c854f84dd0f1990490bf0088/src/libstd/io/mod.rs#L1775-L1790

Both languages use an interesting implementation which we might be able to adopt: they use the general (equivalent of the) readuntil function looking for \n, and then truncate the resulting string to remove \n and \r. (Swift even uses getline on POSIX systems to replace readuntil, not sure what's the advantage over a manual implementation calling memchr.).

Overall the benefits of only supporting CR and CR+LF are quite clear. OTOH, supporting exotic newline sequences may not be worth it given that according to Wikipedia other software do not seem to care, and that OSes using other conventions are really old.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 8, 2017

Do you mean only support CRLF and LF? There is very little extra cost in handling CR too and Python for one supports it. I will do some benchmarks to show how this PR performs in comparison to readuntil.

It's quite easy to get trailing CR left in files when processing things with CRLF and this approach helps dealing with that. We still also have some instruments in the lab that use CR as terminator.

@stevengj
Copy link
Member

stevengj commented Jan 8, 2017

It sounds like LF and CRLF could be supported in much less code, because it could just use readuntil. I doubt that other line endings are worth the additional code complexity, even if there is no performance penalty.

@mpastell
Copy link
Contributor Author

mpastell commented Jan 8, 2017

I think breaking on CR in a utf-8 encoded string is the right thing to do (according to the standard), but I'm not sure how significant the issue is in practise.

If you think the old behavior using readuntil already works as it should there is probably no point in working further on this PR.

However, If you anyway wan't to add the chomp option as implemented here then the complexity of supporting also \r is only few extra if statements.

@stevengj
Copy link
Member

stevengj commented Jan 8, 2017

The old behavior using readuntil didn't handle CRLF, no? @nalimilan's suggestion was to readuntil \n, and then check whether the preceding character was \r, which we weren't doing before.

I agree that the chomp option was useful. (I almost think it should be the default, though that is a breaking change.)

@mpastell
Copy link
Contributor Author

mpastell commented Jan 9, 2017

The old behavior handles CRLF as well, because it always breaks after \n. Further this also currently works:

julia> chomp.(readlines(IOBuffer("hello\r\njulia\r\n")))
2-element Array{SubString{String},1}:
 "hello"
 "julia"

I was just wondering what you mean by additional complexity. I think if we just want to add chomp option that could be done efficiently without adding C-code to master. A good approach could be to do:

julia> readuntil(IOBuffer("hello\r\njulia\r\n"), 0x0a)
7-element Array{UInt8,1}:
 0x68
 0x65
 0x6c
 0x6c
 0x6f
 0x0d
 0x0a

And check for the two last bytes before calling String.

@nalimilan
Copy link
Member

What @stevengj and I refer to by "code complexity" is the need for special methods instead of simply using readuntil(s, '\n') and then stripping the trailing '\r' if present. So basically this suggests taking the approach you describe in your last comment (sorry for not noticing before you didn't have to work with the C code).

@mpastell
Copy link
Contributor Author

mpastell commented Jan 9, 2017

I created a new pull request (#19944) based on the discussion. It shows that performance of chomp=true could still be improved (because readlines for Streams was improved in #19449) using custom C code for chomp.

@mpastell mpastell closed this Jan 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants