-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Make EOL handling more general #19877
Conversation
@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re | |||
the end of the input), from the given I/O stream or file (defaults to `STDIN`). | |||
When reading from a file, the text is assumed to be encoded in UTF-8. | |||
""" | |||
readline(filename::AbstractString) = open(readline, filename) | |||
function readline(filename::AbstractString, chomp = false; nl2lf = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keyword argument currently have a significant performance cost: probably better make nl2lf
a positional argument.
I'm also not sure we need that argument. Does it exist in other languages? It would seem to me that either you just want to extract the text content of the line and don't care about newline characters (chomp=true
), or you care about them and you wouldn't want them to be replaced with \n
.
Finally, arguments must be described in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python has universal newlines (original suggestion in #19785), which does exactly the same as nl2lf
. It is convenient way to do ensure consistent EOL from the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix the docstrings once we can agree on the implementation and arguments.
@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re | |||
the end of the input), from the given I/O stream or file (defaults to `STDIN`). | |||
When reading from a file, the text is assumed to be encoded in UTF-8. | |||
""" | |||
readline(filename::AbstractString) = open(readline, filename) | |||
function readline(filename::AbstractString, chomp = false; nl2lf = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the greatest keyword argument name. I wonder if it'd be better to do something like:
const LINEFEEDS = ['\n', '\r', '\u85', '\u0B', '\u0c', '\u2028', '\u2029']
function readline(file, chomp; newlines::Vector{Char}=linefeeds)
...
This has the dual-benefit of making the linefeeds vector a const
, as well as allowing the user to specify a custom set of linefeed characters if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that before this PR readline
was based on readuntil
. We could make the latter more general and still implement the former over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nl2lf
has the effect of converting all linefeeds to \n
in output, it doesn't allow you to set which characters are considered as a newline. Adding newlines argument makes sense, but I'd like to keep to nl2lf
or something with a better name to do normalization of line ends conveniently and efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to accept a newlines
argument to readline
if it can't be made efficient. People can use readuntil
for special needs.
@@ -448,7 +456,32 @@ function readuntil(s::IO, t::AbstractString) | |||
end | |||
|
|||
readline() = readline(STDIN) | |||
readline(s::IO) = readuntil(s, '\n') | |||
|
|||
function readline(s::IO, chomp = false; nl2lf=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better (and more performant) to actually make readuntil
be able to accept a Vector{Char}
. The problem here is that readuntil
itself is creating an IOBuffer and take!
ing the output at the end, and this approach layers yet another IOBuffer on top of that.
The difficulty will be how you deal with \r\n
, since it it's doing a special case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indeed \r\n
is probably hard to handle efficiently from readuntil
, so readline
probably still deserves a custom implementation. That would also allow for an efficient version working on bytes for standard UTF-8 String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is readuntil(s::IO, t::AbstractString)
, but I would expect this to be more efficient. It's modified from readuntil.
This is about 4x slower than the current implementation due to considering also |
I managed to make it faster by only considering I also dropped |
readline(s::IO) = readuntil(s, '\n') | ||
|
||
function readline(s::IO, chomp::Bool = false) | ||
out = UInt8[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use an IOBuffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is significantly faster, readuntil uses the same approach.
@@ -448,7 +456,50 @@ function readuntil(s::IO, t::AbstractString) | |||
end | |||
|
|||
readline() = readline(STDIN) | |||
readline(s::IO) = readuntil(s, '\n') | |||
|
|||
function readline(s::IO, chomp::Bool = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the common style is not to have spaces around =
. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
while !eof(s) | ||
c = read(s, Char) | ||
if c in newlines | ||
if c == '\r' && !eof(s) && Base.peek(s) == 0x0a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use == '\n'
, comparison with integers is deprecated. This will be as efficient anyway.
EDIT: Nevermind, I misread that code, and looks like peekchar
doesn't support all IO
subtypes.
Not sure why allocations would happen. Use the Once you manage to get it fast for |
I realized that the original version calls
So it looks the current state is as fast as I can make it without implementing it in C, which I don't think I have time do right now. Should I implement the Julia version anyway? It would mean that |
Ah, right. Unfortunately it's probably unacceptable to make such an essential function twice slower. Though you could adapt the C code without too much work. Actually there's already a |
Yes, that seems like the right way to go. I will look into it, but my C is a bit rusty and I have limited amount of time so it can take a while. |
Modifying the C code was quite simple I managed to implement implement |
if (ch == '\r'){ | ||
r = p; | ||
if (chomp) nchomp = 1; | ||
ntowrite = r - (from->buf+from->bpos) + 1 - nchomp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing whitespace here and on L854
EachLine(stream, ondone) = new(stream, ondone) | ||
chomp::Bool | ||
EachLine(stream, chomp) = EachLine(stream, ()->nothing, chomp) | ||
EachLine(stream, ondone, chomp) = new(stream, ondone, chomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have ondone
be the last field/argument?
@@ -448,7 +456,29 @@ function readuntil(s::IO, t::AbstractString) | |||
end | |||
|
|||
readline() = readline(STDIN) | |||
readline(s::IO) = readuntil(s, '\n') | |||
|
|||
function readline(s::IO, chomp::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still needed. C version reads from IOStreams, but not for instance IOBuffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so this means it has to be relatively efficient too. Is that the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say its relatively efficient and anyway it has the same performance as before.
@@ -174,7 +174,12 @@ Read a single line of text, including a trailing newline character (if one is re | |||
the end of the input), from the given I/O stream or file (defaults to `STDIN`). | |||
When reading from a file, the text is assumed to be encoded in UTF-8. | |||
""" | |||
readline(filename::AbstractString) = open(readline, filename) | |||
function readline(filename::AbstractString, chomp = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to mention chomp
in docstrings.
@@ -831,6 +831,66 @@ size_t ios_copyuntil(ios_t *to, ios_t *from, char delim) | |||
return total; | |||
} | |||
|
|||
//Copy until '\r', '\n' or '\r\n' | |||
size_t ios_copyline(ios_t *to, ios_t *from, int chomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use bool_t
? Also, space after //
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing code uses int
for Julia Bool
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it's just that I found a bool_t
in the same file, but I have no idea what's the most common convention.
char *r = NULL; | ||
char *n = NULL; | ||
|
||
for (size_t i = 0; i < avail; i++){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler and possibly more efficient to use strpbrk
? Then you just need to check the next byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the first thing I tried and it's simpler and significantly slower.
You have to look for both \r
and \n
in available bytes, but you only want to break on the first one found. e.g. If you have a large file and you look for \r
both in avail
(using memch) and never find it the code gets very slow.
This has the same performance as readuntil
which uses memchr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to look for both \r and \n in available bytes, but you only want to break on the first one found. e.g. If you have a large file and you look for \r both in avail (using memch) and never find it the code gets very slow.
I don't follow. What difference does it make from the manual loop? I just suggest breaking on the first match. Then you only have to call strpbrk
again if there's a \r
character without a following \n
, which shouldn't be too frequent and isn't worth optimizing for.
This has the same performance as readuntil which uses memchr.
Even, for relatively long strings? I'm asking because it looks like glibc contains hand-written assembly code here and explicit SSE instructions here for this, so I figure there must be a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I tried using two calls to memchr
as the input is not a string, but a buffer. I guess I could cast it to a string and use strpbrk
.
Most of the time in readlines
is anyway spent in Julia (I think in allocating the String array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Anyway I've just realized strpbrk
requires null-terminated strings, and there doesn't seem to exist any variant taking the string length.
Great. Would checking for non-ASCII newline characters incur a significant slowdown? Though it could easily be added later without breaking too much existing code. |
I don't think checking for non-ASCII characters would make a noticeable difference. I think I will try to finish this pull request with |
Many thanks to all for all the help and reviews! I think this starts be ready to merge after final comments and tweaks. |
end | ||
|
||
""" | ||
eachline(stream::IO) | ||
eachline(filename::AbstractString) | ||
eachline(stream::IO, chomp::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One space is enough.
readline(stream::IO=STDIN, chomp::Bool=false) | ||
readline(filename::AbstractString, chomp::Bool=false) | ||
|
||
Read a single line of text including from the given I/O stream or file (defaults to `STDIN`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "including" (typo?).
char *r = NULL; | ||
char *n = NULL; | ||
|
||
for (size_t i = 0; i < avail; i++){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Anyway I've just realized strpbrk
requires null-terminated strings, and there doesn't seem to exist any variant taking the string length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is almost ready AFAICT. Though you'll have to rebase this since #19449 has just been merged (which is a good thing since we had to wait for it anyway).
Then I'd like other people to comment too. @StefanKarpinski, @stevengj?
ntowrite = n - (from->buf+from->bpos) + 1 - nchomp; | ||
break; | ||
} | ||
if (ch == '\r'){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing else
.
char *p = (char*)from->buf+from->bpos+i; | ||
char ch = from->buf[from->bpos+i]; | ||
|
||
if (ch == '\n'){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before {
here and elsewhere.
if (chomp) nchomp = 1; | ||
ntowrite = r - (from->buf+from->bpos) + 1 - nchomp; | ||
if (i < avail){ | ||
char ch2 = from->buf[from->bpos+i+1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can combine the two if
and get rid of the intermediate ch2
variable.
write(io,"pancakes\r\nwaffles\n\rblueberries\r") | ||
@test readlines(io) == String["pancakes\r\n","waffles\n","\r","blueberries\r"] | ||
write(io,"pancakes\r\nwaffles\n\rblueberries\r") | ||
@test readlines(io, true) == String["pancakes","waffles","","blueberries"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add tests for potentially problematic cases like \r\n\r\n
, \n\r\n
, \r\r\n
, etc.? Same for chomp
.
function chomp(s::String) | ||
i = endof(s) | ||
if i < 1 || s.data[i] != 0x0a | ||
if i < 1 || (s.data[i] != 0x0a && s.data[i] != 0x0d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're making the function more complex, I would find it clearer to replace the integer codes with UInt8('\n')
and UInt8('\r')
. Same in readline
.
""" | ||
function chomp(s::AbstractString) | ||
i = endof(s) | ||
if (i < 1 || s[i] != '\n') return SubString(s, 1, i) end | ||
if (i < 1 || (s[i] != '\n' && s[i] != '\r')) return SubString(s, 1, i) end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space here.
j = prevind(s,i) | ||
if (j < 1 || s[j] != '\r') return SubString(s, 1, i-1) end | ||
if (j < 1 || (s[j] != '\r' && s[i] == '\n')) return SubString(s, 1, i-1) end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to check that the s[i] == '\n'
here since it's implied by the previous checks. So just put this code under an else
of s[i] == '\r'
for clarity, with a comment like # s[i] == '\n'
.
Actually, I've just had a deeper look at what Rust and Swift do, and they seem to adopt a simpler approach than here: they only break on There's a comment in the Swift source saying that they'd like to support more newline characters, but so far I guess they couldn't find an efficient approach, or didn't consider it was worth it: Rust's implementation is here: Both languages use an interesting implementation which we might be able to adopt: they use the general (equivalent of the) Overall the benefits of only supporting CR and CR+LF are quite clear. OTOH, supporting exotic newline sequences may not be worth it given that according to Wikipedia other software do not seem to care, and that OSes using other conventions are really old. |
Do you mean only support CRLF and LF? There is very little extra cost in handling CR too and Python for one supports it. I will do some benchmarks to show how this PR performs in comparison to readuntil. It's quite easy to get trailing CR left in files when processing things with CRLF and this approach helps dealing with that. We still also have some instruments in the lab that use CR as terminator. |
It sounds like LF and CRLF could be supported in much less code, because it could just use readuntil. I doubt that other line endings are worth the additional code complexity, even if there is no performance penalty. |
I think breaking on CR in a utf-8 encoded string is the right thing to do (according to the standard), but I'm not sure how significant the issue is in practise. If you think the old behavior using However, If you anyway wan't to add the |
The old behavior using I agree that the |
The old behavior handles CRLF as well, because it always breaks after julia> chomp.(readlines(IOBuffer("hello\r\njulia\r\n")))
2-element Array{SubString{String},1}:
"hello"
"julia" I was just wondering what you mean by additional complexity. I think if we just want to add chomp option that could be done efficiently without adding C-code to master. A good approach could be to do: julia> readuntil(IOBuffer("hello\r\njulia\r\n"), 0x0a)
7-element Array{UInt8,1}:
0x68
0x65
0x6c
0x6c
0x6f
0x0d
0x0a And check for the two last bytes before calling |
What @stevengj and I refer to by "code complexity" is the need for special methods instead of simply using |
This aims to solve #19785. I would like to get feedback if this is going to the right direction and can be merged when ready before working further on this.
['\n', '\r', '\r\n']
as line ends inreadline
,readlines
andeachline
readline
etc. to omit EOL (chomp::Bool=false
)