RFC: Make EOL handling more general #19877

mpastell · 2017-01-05T16:53:41Z

This aims to solve #19785. I would like to get feedback if this is going to the right direction and can be merged when ready before working further on this.

Consider ['\n', '\r', '\r\n'] as line ends in readline, readlines and eachline
Add option to readline etc. to omit EOL (chomp::Bool=false)
Make chomp to remove any of the above line feeds
Update documentation
Add tests

nalimilan · 2017-01-05T17:10:58Z

base/io.jl

@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re
 the end of the input), from the given I/O stream or file (defaults to `STDIN`).
 When reading from a file, the text is assumed to be encoded in UTF-8.
 """
-readline(filename::AbstractString) = open(readline, filename)
+function readline(filename::AbstractString, chomp = false; nl2lf = false)


Keyword argument currently have a significant performance cost: probably better make nl2lf a positional argument.

I'm also not sure we need that argument. Does it exist in other languages? It would seem to me that either you just want to extract the text content of the line and don't care about newline characters (chomp=true), or you care about them and you wouldn't want them to be replaced with \n.

Finally, arguments must be described in the docstring.

Python has universal newlines (original suggestion in #19785), which does exactly the same as nl2lf. It is convenient way to do ensure consistent EOL from the input.

I will fix the docstrings once we can agree on the implementation and arguments.

quinnj · 2017-01-05T17:13:48Z

base/io.jl

@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re
 the end of the input), from the given I/O stream or file (defaults to `STDIN`).
 When reading from a file, the text is assumed to be encoded in UTF-8.
 """
-readline(filename::AbstractString) = open(readline, filename)
+function readline(filename::AbstractString, chomp = false; nl2lf = false)


Not the greatest keyword argument name. I wonder if it'd be better to do something like:

const LINEFEEDS = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029'] function readline(file, chomp; newlines::Vector{Char}=linefeeds) ...

This has the dual-benefit of making the linefeeds vector a const, as well as allowing the user to specify a custom set of linefeed characters if desired.

Note that before this PR readline was based on readuntil. We could make the latter more general and still implement the former over it.

nl2lf has the effect of converting all linefeeds to \n in output, it doesn't allow you to set which characters are considered as a newline. Adding newlines argument makes sense, but I'd like to keep to nl2lf or something with a better name to do normalization of line ends conveniently and efficiently.

I don't think we need to accept a newlines argument to readline if it can't be made efficient. People can use readuntil for special needs.

quinnj · 2017-01-05T17:17:35Z

base/io.jl

@@ -448,7 +456,32 @@ function readuntil(s::IO, t::AbstractString)
 end

 readline() = readline(STDIN)
-readline(s::IO) = readuntil(s, '\n')
+
+function readline(s::IO, chomp = false; nl2lf=false)


I think it'd be better (and more performant) to actually make readuntil be able to accept a Vector{Char}. The problem here is that readuntil itself is creating an IOBuffer and take!ing the output at the end, and this approach layers yet another IOBuffer on top of that.

The difficulty will be how you deal with \r\n, since it it's doing a special case here.

Ah, indeed \r\n is probably hard to handle efficiently from readuntil, so readline probably still deserves a custom implementation. That would also allow for an efficient version working on bytes for standard UTF-8 String.

There is readuntil(s::IO, t::AbstractString), but I would expect this to be more efficient. It's modified from readuntil.

mpastell · 2017-01-05T21:08:19Z

This is about 4x slower than the current implementation due to considering also \u202x. Maybe it would be better to consider only \r, \r\n and \n and put the more general code in a package.

mpastell · 2017-01-05T23:25:05Z

I managed to make it faster by only considering \r, \r\n and \n as newlines. This still makes ~2x as many allocations as the old version. I would appreciate any hints how to make it faster.

I also dropped nl2lf as it doesn't seem others see at as a core feature.

nalimilan · 2017-01-06T10:13:54Z

base/io.jl

-readline(s::IO) = readuntil(s, '\n')
+
+function readline(s::IO, chomp::Bool = false)
+    out = UInt8[]


Why not use an IOBuffer?

This is significantly faster, readuntil uses the same approach.

nalimilan · 2017-01-06T10:16:00Z

base/io.jl

@@ -448,7 +456,50 @@ function readuntil(s::IO, t::AbstractString)
 end

 readline() = readline(STDIN)
-readline(s::IO) = readuntil(s, '\n')
+
+function readline(s::IO, chomp::Bool = false)


I think the common style is not to have spaces around =. Same below.

nalimilan · 2017-01-06T10:25:14Z

base/io.jl

+    while !eof(s)
+        c = read(s, Char)
+        if c in newlines
+            if c == '\r' && !eof(s) && Base.peek(s) == 0x0a


Use == '\n', comparison with integers is deprecated. This will be as efficient anyway.

EDIT: Nevermind, I misread that code, and looks like peekchar doesn't support all IO subtypes.

nalimilan · 2017-01-06T10:44:29Z

I managed to make it faster by only considering \r, \r\n and \n as newlines. This still makes ~2x as many allocations as the old version. I would appreciate any hints how to make it faster.

Not sure why allocations would happen. Use the --track-allocation command line argument to see where they come from.

Once you manage to get it fast for \n and\r, I think a strategy to support checking for non-ASCII character separators would be to have a branch for checking the first byte, and only then check for the second byte. Anyway that code assumes the input is UTF-8, so you can take advantage of this.

mpastell · 2017-01-06T14:33:11Z

I realized that the original version calls readuntil from "iostream.jl", which uses ccall

function readuntil(s::IOStream, delim::UInt8)
    ccall(:jl_readuntil, Array{UInt8,1}, (Ptr{Void}, UInt8), s.ios, delim)
end

So it looks the current state is as fast as I can make it without implementing it in C, which I don't think I have time do right now.

Should I implement the Julia version anyway? It would mean that readlines becomes 2x slower than before.

nalimilan · 2017-01-06T14:55:36Z

Ah, right. Unfortunately it's probably unacceptable to make such an essential function twice slower. Though you could adapt the C code without too much work. Actually there's already a jl_readline C function which calls ios_copyuntil. You could copy that code and adapt it to look for multiple characters (using strpbrk instead of memchr, and checking second bytes manually when needed).

mpastell · 2017-01-06T15:27:20Z

Yes, that seems like the right way to go. I will look into it, but my C is a bit rusty and I have limited amount of time so it can take a while.

mpastell · 2017-01-07T15:40:26Z

Modifying the C code was quite simple I managed to implement implement readline in C, the performance is now same as the old version.

tkelman · 2017-01-07T15:46:22Z

src/support/ios.c

+            if (ch == '\r'){
+                r = p;
+                if (chomp) nchomp = 1;
+                ntowrite = r - (from->buf+from->bpos) + 1 - nchomp;    


trailing whitespace here and on L854

nalimilan · 2017-01-07T15:55:49Z

base/io.jl

-    EachLine(stream, ondone) = new(stream, ondone)
+    chomp::Bool
+    EachLine(stream, chomp) = EachLine(stream, ()->nothing, chomp)
+    EachLine(stream, ondone, chomp) = new(stream, ondone, chomp)


Maybe have ondone be the last field/argument?

nalimilan · 2017-01-07T15:56:16Z

base/io.jl

@@ -448,7 +456,29 @@ function readuntil(s::IO, t::AbstractString)
 end

 readline() = readline(STDIN)
-readline(s::IO) = readuntil(s, '\n')
+
+function readline(s::IO, chomp::Bool=false)


This is no longer needed now, right?

It's still needed. C version reads from IOStreams, but not for instance IOBuffers.

Ah, so this means it has to be relatively efficient too. Is that the case?

I'd say its relatively efficient and anyway it has the same performance as before.

nalimilan · 2017-01-07T15:56:40Z

base/io.jl

@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re
 the end of the input), from the given I/O stream or file (defaults to `STDIN`).
 When reading from a file, the text is assumed to be encoded in UTF-8.
 """
-readline(filename::AbstractString) = open(readline, filename)
+function readline(filename::AbstractString, chomp = false)


Still need to mention chomp in docstrings.

nalimilan · 2017-01-07T15:58:16Z

src/support/ios.c

@@ -831,6 +831,66 @@ size_t ios_copyuntil(ios_t *to, ios_t *from, char delim)
    return total;
 }

+//Copy until '\r', '\n' or '\r\n'
+size_t ios_copyline(ios_t *to, ios_t *from, int chomp)


Use bool_t? Also, space after //.

Existing code uses int for Julia Bools

OK, it's just that I found a bool_t in the same file, but I have no idea what's the most common convention.

nalimilan · 2017-01-07T16:01:05Z

src/support/ios.c

+        char *r = NULL;
+        char *n = NULL;
+
+        for (size_t i = 0; i < avail; i++){


Wouldn't it be simpler and possibly more efficient to use strpbrk? Then you just need to check the next byte.

That's the first thing I tried and it's simpler and significantly slower.

You have to look for both \r and \n in available bytes, but you only want to break on the first one found. e.g. If you have a large file and you look for \r both in avail (using memch) and never find it the code gets very slow.

This has the same performance as readuntil which uses memchr.

You have to look for both \r and \n in available bytes, but you only want to break on the first one found. e.g. If you have a large file and you look for \r both in avail (using memch) and never find it the code gets very slow.

I don't follow. What difference does it make from the manual loop? I just suggest breaking on the first match. Then you only have to call strpbrk again if there's a \r character without a following \n, which shouldn't be too frequent and isn't worth optimizing for.

This has the same performance as readuntil which uses memchr.

Even, for relatively long strings? I'm asking because it looks like glibc contains hand-written assembly code here and explicit SSE instructions here for this, so I figure there must be a reason.

Actually I tried using two calls to memchr as the input is not a string, but a buffer. I guess I could cast it to a string and use strpbrk.

Most of the time in readlines is anyway spent in Julia (I think in allocating the String array)

OK. Anyway I've just realized strpbrk requires null-terminated strings, and there doesn't seem to exist any variant taking the string length.

nalimilan · 2017-01-07T16:01:32Z

Great. Would checking for non-ASCII newline characters incur a significant slowdown? Though it could easily be added later without breaking too much existing code.

mpastell · 2017-01-07T17:57:36Z

I don't think checking for non-ASCII characters would make a noticeable difference.

I think I will try to finish this pull request with *lines and chomp working for \r, \n and \r\n which would give Julia the same behavior as most other languages have and other newlines can be added later if there is a need.

mpastell · 2017-01-07T23:05:18Z

Many thanks to all for all the help and reviews! I think this starts be ready to merge after final comments and tweaks.

nalimilan · 2017-01-08T10:58:20Z

base/io.jl

 end

 """
-    eachline(stream::IO)
-    eachline(filename::AbstractString)
+    eachline(stream::IO,  chomp::Bool=false)


One space is enough.

nalimilan · 2017-01-08T11:01:29Z

base/io.jl

+    readline(stream::IO=STDIN, chomp::Bool=false)
+    readline(filename::AbstractString, chomp::Bool=false)
+
+Read a single line of text including from the given I/O stream or file (defaults to `STDIN`).


Remove "including" (typo?).

nalimilan · 2017-01-08T11:30:54Z

src/support/ios.c

+        char *r = NULL;
+        char *n = NULL;
+
+        for (size_t i = 0; i < avail; i++){


OK. Anyway I've just realized strpbrk requires null-terminated strings, and there doesn't seem to exist any variant taking the string length.

nalimilan

Thanks. This is almost ready AFAICT. Though you'll have to rebase this since #19449 has just been merged (which is a good thing since we had to wait for it anyway).

Then I'd like other people to comment too. @StefanKarpinski, @stevengj?

nalimilan · 2017-01-08T11:31:42Z

src/support/ios.c

+                ntowrite = n - (from->buf+from->bpos) + 1 - nchomp;
+                break;
+            }
+            if (ch == '\r'){


Missing else.

nalimilan · 2017-01-08T11:32:06Z

src/support/ios.c

+            char *p = (char*)from->buf+from->bpos+i;
+            char ch = from->buf[from->bpos+i];
+
+            if (ch == '\n'){


Space before { here and elsewhere.

nalimilan · 2017-01-08T11:35:59Z

src/support/ios.c

+                if (chomp) nchomp = 1;
+                ntowrite = r - (from->buf+from->bpos) + 1 - nchomp;
+                if (i < avail){
+                    char ch2 = from->buf[from->bpos+i+1];


You can combine the two if and get rid of the intermediate ch2 variable.

nalimilan · 2017-01-08T11:39:42Z

test/iobuffer.jl

+write(io,"pancakes\r\nwaffles\n\rblueberries\r")
+@test readlines(io) == String["pancakes\r\n","waffles\n","\r","blueberries\r"]
+write(io,"pancakes\r\nwaffles\n\rblueberries\r")
+@test readlines(io, true) == String["pancakes","waffles","","blueberries"]


Could you also add tests for potentially problematic cases like \r\n\r\n, \n\r\n, \r\r\n, etc.? Same for chomp.

nalimilan · 2017-01-08T11:43:35Z

base/strings/util.jl

 function chomp(s::String)
    i = endof(s)
-    if i < 1 || s.data[i] != 0x0a
+    if i < 1 || (s.data[i] != 0x0a && s.data[i] != 0x0d)


Since you're making the function more complex, I would find it clearer to replace the integer codes with UInt8('\n') and UInt8('\r'). Same in readline.

nalimilan · 2017-01-08T11:44:27Z

base/strings/util.jl

 """
 function chomp(s::AbstractString)
    i = endof(s)
-    if (i < 1 || s[i] != '\n') return SubString(s, 1, i) end
+    if (i < 1 || (s[i] != '\n' &&  s[i] != '\r')) return SubString(s, 1, i) end


Double space here.

nalimilan · 2017-01-08T11:51:25Z

base/strings/util.jl

    j = prevind(s,i)
-    if (j < 1 || s[j] != '\r') return SubString(s, 1, i-1) end
+    if (j < 1 || (s[j] != '\r' && s[i] == '\n')) return SubString(s, 1, i-1) end


You don't need to check that the s[i] == '\n' here since it's implied by the previous checks. So just put this code under an else of s[i] == '\r' for clarity, with a comment like # s[i] == '\n'.

nalimilan · 2017-01-08T15:03:34Z

Actually, I've just had a deeper look at what Rust and Swift do, and they seem to adopt a simpler approach than here: they only break on \n and \r\n, nor on \r. This can be implemented more efficiently since you only need to look for \n, and then check whether the previous character is \r.

There's a comment in the Swift source saying that they'd like to support more newline characters, but so far I guess they couldn't find an efficient approach, or didn't consider it was worth it:
https://github.com/apple/swift/blob/74d03b9dd94b862602b975682b30f53f416b219a/stdlib/public/core/InputStream.swift#L43
https://bugs.swift.org/browse/SR-1280

Rust's implementation is here:
https://github.com/rust-lang/rust/blob/cbf88730e755d099c854f84dd0f1990490bf0088/src/libstd/io/mod.rs#L1775-L1790

Both languages use an interesting implementation which we might be able to adopt: they use the general (equivalent of the) readuntil function looking for \n, and then truncate the resulting string to remove \n and \r. (Swift even uses getline on POSIX systems to replace readuntil, not sure what's the advantage over a manual implementation calling memchr.).

Overall the benefits of only supporting CR and CR+LF are quite clear. OTOH, supporting exotic newline sequences may not be worth it given that according to Wikipedia other software do not seem to care, and that OSes using other conventions are really old.

mpastell · 2017-01-08T19:24:29Z

Do you mean only support CRLF and LF? There is very little extra cost in handling CR too and Python for one supports it. I will do some benchmarks to show how this PR performs in comparison to readuntil.

It's quite easy to get trailing CR left in files when processing things with CRLF and this approach helps dealing with that. We still also have some instruments in the lab that use CR as terminator.

stevengj · 2017-01-08T20:08:10Z

It sounds like LF and CRLF could be supported in much less code, because it could just use readuntil. I doubt that other line endings are worth the additional code complexity, even if there is no performance penalty.

mpastell · 2017-01-08T21:02:48Z

I think breaking on CR in a utf-8 encoded string is the right thing to do (according to the standard), but I'm not sure how significant the issue is in practise.

If you think the old behavior using readuntil already works as it should there is probably no point in working further on this PR.

However, If you anyway wan't to add the chomp option as implemented here then the complexity of supporting also \r is only few extra if statements.

stevengj · 2017-01-08T21:36:55Z

The old behavior using readuntil didn't handle CRLF, no? @nalimilan's suggestion was to readuntil \n, and then check whether the preceding character was \r, which we weren't doing before.

I agree that the chomp option was useful. (I almost think it should be the default, though that is a breaking change.)

mpastell · 2017-01-09T06:10:40Z

The old behavior handles CRLF as well, because it always breaks after \n. Further this also currently works:

julia> chomp.(readlines(IOBuffer("hello\r\njulia\r\n")))
2-element Array{SubString{String},1}:
 "hello"
 "julia"

I was just wondering what you mean by additional complexity. I think if we just want to add chomp option that could be done efficiently without adding C-code to master. A good approach could be to do:

julia> readuntil(IOBuffer("hello\r\njulia\r\n"), 0x0a)
7-element Array{UInt8,1}:
 0x68
 0x65
 0x6c
 0x6c
 0x6f
 0x0d
 0x0a

And check for the two last bytes before calling String.

nalimilan · 2017-01-09T08:38:58Z

What @stevengj and I refer to by "code complexity" is the need for special methods instead of simply using readuntil(s, '\n') and then stripping the trailing '\r' if present. So basically this suggests taking the approach you describe in your last comment (sorry for not noticing before you didn't have to work with the C code).

mpastell · 2017-01-09T16:04:31Z

I created a new pull request (#19944) based on the discussion. It shows that performance of chomp=true could still be improved (because readlines for Streams was improved in #19449) using custom C code for chomp.

Make readline etc. more general

81d7c38

nalimilan reviewed Jan 5, 2017

View reviewed changes

quinnj reviewed Jan 5, 2017

View reviewed changes

kshyatt added needs docs Documentation for this change is required needs tests Unit tests are required for this change strings "Strings!" labels Jan 5, 2017

Make readlines faster and less general

a04d81b

nalimilan reviewed Jan 6, 2017

View reviewed changes

Implement readline in C

fb30c92

tkelman reviewed Jan 7, 2017

View reviewed changes

nalimilan reviewed Jan 7, 2017

View reviewed changes

mpastell added 3 commits January 7, 2017 20:05

Minor fixes according to comments

80c94b8

Add docstrings for readline

e3cfa28

Make chomp remove \r, update docstrings

cffdf0d

mpastell added 4 commits January 8, 2017 00:25

Tweak docstring

603baae

Add test for readlines(s::IO)

7e9cedd

Add tests for readlines and chomp

bb7475e

Fix trailing whitespace

3538aff

tkelman removed needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Jan 7, 2017

mpastell changed the title ~~WIP: Make EOL handling more general~~ RFC: Make EOL handling more general Jan 8, 2017

mpastell mentioned this pull request Jan 8, 2017

Julep: Support universal newlines and make it default for Text IO #19785

Open

nalimilan reviewed Jan 8, 2017

View reviewed changes

mpastell mentioned this pull request Jan 9, 2017

Add chomp option to readline(s) #19944

Closed

mpastell closed this Jan 10, 2017

RFC: Make EOL handling more general #19877

RFC: Make EOL handling more general #19877

Conversation

mpastell commented Jan 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpastell Jan 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpastell commented Jan 5, 2017

mpastell commented Jan 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan Jan 6, 2017 • edited Loading

Choose a reason for hiding this comment

nalimilan commented Jan 6, 2017

mpastell commented Jan 6, 2017

nalimilan commented Jan 6, 2017

mpastell commented Jan 6, 2017

mpastell commented Jan 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan Jan 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jan 7, 2017

mpastell commented Jan 7, 2017

mpastell commented Jan 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jan 8, 2017

mpastell commented Jan 8, 2017

stevengj commented Jan 8, 2017

mpastell commented Jan 8, 2017 • edited Loading

stevengj commented Jan 8, 2017 • edited Loading

mpastell commented Jan 9, 2017

nalimilan commented Jan 9, 2017

mpastell commented Jan 9, 2017

mpastell commented Jan 5, 2017 •

edited

Loading

mpastell Jan 5, 2017 •

edited

Loading

mpastell commented Jan 5, 2017 •

edited

Loading

nalimilan Jan 6, 2017 •

edited

Loading

nalimilan Jan 7, 2017 •

edited

Loading

mpastell commented Jan 8, 2017 •

edited

Loading

stevengj commented Jan 8, 2017 •

edited

Loading