Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: string overhaul #24439

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
1917811
strings: some cosmetic tweaks
StefanKarpinski Nov 10, 2017
80cb480
`Char` representation: UTF-8 bytes with most significant padding
StefanKarpinski Nov 3, 2017
40ac089
`Char` representation: UTF-8 bytes with least significant padding
StefanKarpinski Nov 3, 2017
8ba3bba
iswellformed(c::Char) to test if `c` represents a code point
StefanKarpinski Nov 9, 2017
2aa2c8c
remove internal chomp! function
StefanKarpinski Nov 11, 2017
806441e
convert(String, ::Vector{Char}): don't normalize surrogate pairs
StefanKarpinski Nov 11, 2017
5ab5eef
delete unused `unescape_chars` function
StefanKarpinski Nov 12, 2017
08656a3
malformed chars are always grapheme breaks
StefanKarpinski Nov 13, 2017
b8f7306
iswellformed => !ismalformed; make test stricter
StefanKarpinski Nov 13, 2017
747ce23
slightly more efficient character checking and decoding
StefanKarpinski Nov 13, 2017
d1e83e8
wip
StefanKarpinski Nov 15, 2017
1972d46
wip [ci skip]
StefanKarpinski Nov 15, 2017
0024056
wip
StefanKarpinski Nov 27, 2017
a84e666
wip
StefanKarpinski Nov 27, 2017
358ce5d
Revert "wip"
StefanKarpinski Nov 28, 2017
912779e
wip
StefanKarpinski Nov 29, 2017
cbbee08
wip
StefanKarpinski Nov 29, 2017
8a22a96
wip
StefanKarpinski Nov 29, 2017
829aba2
wip
StefanKarpinski Nov 29, 2017
b2d231b
wip
StefanKarpinski Nov 29, 2017
f82c793
wip
StefanKarpinski Nov 29, 2017
c55cca0
wip
StefanKarpinski Nov 29, 2017
68467ad
my dirty laundry, you filthy voyeur [ci skip]
StefanKarpinski Nov 29, 2017
29dc1c3
wip [ci skip]
StefanKarpinski Dec 5, 2017
d68eb07
wip: added a doc string for AbstractString
StefanKarpinski Dec 6, 2017
2fadfb0
wip
StefanKarpinski Dec 7, 2017
5aad731
wip
StefanKarpinski Dec 7, 2017
cad41c5
wip
StefanKarpinski Dec 7, 2017
f849593
wip
StefanKarpinski Dec 7, 2017
61dbb90
wip
StefanKarpinski Dec 7, 2017
931b289
wip
StefanKarpinski Dec 8, 2017
0794487
wip
StefanKarpinski Dec 8, 2017
b674bc1
wip
StefanKarpinski Dec 8, 2017
b802606
wip
StefanKarpinski Dec 8, 2017
8d414fd
fix [ci skip]
StefanKarpinski Dec 8, 2017
120f9ca
docstring typo fix [ci skip]
StefanKarpinski Dec 8, 2017
1861238
test for more method errors [ci skip]
StefanKarpinski Dec 8, 2017
1c722f1
cosmetic tweaks
StefanKarpinski Dec 8, 2017
e54e4c0
wip
StefanKarpinski Dec 8, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 75 additions & 11 deletions base/char.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,58 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

convert(::Type{Char}, x::UInt32) = reinterpret(Char, x)
struct MalformedCharError <: Exception
char::Char
end
struct CodePointError <: Exception
code::Integer
end
@noinline malformed_char(c::Char) = throw(MalformedCharError(c))
@noinline code_point_err(u::UInt32) = throw(CodePointError(u))

function ismalformed(c::Char)
u = reinterpret(UInt32, c)
l1 = leading_ones(u) << 3
t0 = trailing_zeros(u) & 56
(l1 == 8) | (l1 + t0 > 32) |
(((u & 0x00c0c0c0) ⊻ 0x00808080) >> t0 != 0)
end

function convert(::Type{UInt32}, c::Char)
# TODO: use optimized inline LLVM
u = reinterpret(UInt32, c)
u < 0x80000000 && return reinterpret(UInt32, u >> 24)
l1 = leading_ones(u)
t0 = trailing_zeros(u) & 56
(l1 == 1) | (8l1 + t0 > 32) |
(((u & 0x00c0c0c0) ⊻ 0x00808080) >> t0 != 0) &&
malformed_char(c)::Union{}
u &= 0xffffffff >> l1
u >>= t0
(u & 0x0000007f >> 0) | (u & 0x00007f00 >> 2) |
(u & 0x007f0000 >> 4) | (u & 0x7f000000 >> 6)
end

function convert(::Type{Char}, u::UInt32)
u < 0x80 && return reinterpret(Char, u << 24)
u < 0x00200000 || code_point_err(u)::Union{}
c = ((u << 0) & 0x0000003f) | ((u << 2) & 0x00003f00) |
((u << 4) & 0x003f0000) | ((u << 6) & 0x3f000000)
c = u < 0x00000800 ? (c << 16) | 0xc0800000 :
u < 0x00010000 ? (c << 08) | 0xe0808000 :
(c << 00) | 0xf0808080
reinterpret(Char, c)
end

function convert(::Type{T}, c::Char) where T <: Union{Int8,UInt8}
i = reinterpret(Int32, c)
i ≥ 0 ? ((i >>> 24) % T) : T(UInt32(c))
end

function convert(::Type{Char}, b::Union{Int8,UInt8})
0 ≤ b ≤ 0x7f ? reinterpret(Char, (b % UInt32) << 24) : Char(UInt32(b))
end

convert(::Type{Char}, x::Number) = Char(UInt32(x))
convert(::Type{UInt32}, x::Char) = reinterpret(UInt32, x)
convert(::Type{T}, x::Char) where {T<:Number} = convert(T, UInt32(x))

rem(x::Char, ::Type{T}) where {T<:Number} = rem(UInt32(x), T)
Expand All @@ -29,11 +79,9 @@ done(c::Char, state) = state
isempty(c::Char) = false
in(x::Char, y::Char) = x == y

==(x::Char, y::Char) = UInt32(x) == UInt32(y)
isless(x::Char, y::Char) = UInt32(x) < UInt32(y)

const hashchar_seed = 0xd4d64234
hash(x::Char, h::UInt) = hash_uint64(((UInt64(x)+hashchar_seed)<<32) ⊻ UInt64(h))
==(x::Char, y::Char) = reinterpret(UInt32, x) == reinterpret(UInt32, y)
isless(x::Char, y::Char) = reinterpret(UInt32, x) < reinterpret(UInt32, y)
hash(x::Char, h::UInt) = hash(reinterpret(UInt32, x), hash(Char, h))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to BenchmarkTools this is 5 times slower than the old hash(x::Char). Should we use

hash(x::Char, h::UInt) = hash_uint64(((reinterpret(UInt32, x)+UInt64(hashchar_seed))<<32)  UInt64(h))

or similar instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I worry about that xor being symmetric. I also worry that we're using this kind of pattern all over Base and it is quite inefficient. We really need a better way to express this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with your suggested definition for now. If there's some hash collision issue with the existing definition, that's a totally independent issue to this representation change so can be addressed separately.


-(x::Char, y::Char) = Int(x) - Int(y)
-(x::Char, y::Integer) = Char(Int32(x) - Int32(y))
Expand Down Expand Up @@ -66,21 +114,37 @@ function show(io::IO, c::Char)
end
if isprint(c)
write(io, 0x27, c, 0x27)
else
elseif !ismalformed(c)
u = UInt32(c)
write(io, 0x27, 0x5c, c <= '\x7f' ? 0x78 : c <= '\uffff' ? 0x75 : 0x55)
d = max(2, 8 - (leading_zeros(u) >> 2))
while 0 < d
write(io, hex_chars[((u >> ((d -= 1) << 2)) & 0xf) + 1])
end
write(io, 0x27)
else # malformed
write(io, 0x27)
u = reinterpret(UInt32, c)
while true
a = hex_chars[((u >> 28) & 0xf) + 1]
b = hex_chars[((u >> 24) & 0xf) + 1]
write(io, 0x5c, 'x', a, b)
(u <<= 8) == 0 && break
end
write(io, 0x27)
end
return
end

function show(io::IO, ::MIME"text/plain", c::Char)
show(io, c)
u = UInt32(c)
print(io, ": ", isascii(c) ? "ASCII/" : "", "Unicode U+", hex(u, u > 0xffff ? 6 : 4))
print(io, " (category ", UTF8proc.category_abbrev(c), ": ", UTF8proc.category_string(c), ")")
if !ismalformed(c)
u = UInt32(c)
print(io, ": ", isascii(c) ? "ASCII/" : "", "Unicode U+", hex(u, u > 0xffff ? 6 : 4))
else
print(io, ": Malformed UTF-8")
end
abr = UTF8proc.category_abbrev(c)
str = UTF8proc.category_string(c)
print(io, " (category ", abr, ": ", str, ")")
end
20 changes: 20 additions & 0 deletions base/filesystem.jl
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,26 @@ function read(f::File, ::Type{UInt8})
return ret % UInt8
end

function read(f::File, ::Type{Char})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is function read(s::IO, ::Type{Char}) not sufficient here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is annoying: IO doesn't support position and seek; File doesn't support mark and reset. That could be fixed instead, but this was the easiest way to get this working.

b0 = read(f, UInt8)
l = 8(4-leading_ones(b0))
c = UInt32(b0) << 24
if l < 24
s = 16
while s ≥ l && !eof(f)
p = position(f)
b = read(f, UInt8)
if b & 0xc0 != 0x80
seek(f, p)
break
end
c |= UInt32(b) << s
s -= 8
end
end
return reinterpret(Char, c)
end

function unsafe_read(f::File, p::Ptr{UInt8}, nel::UInt)
check_open(f)
ret = ccall(:jl_fs_read, Int32, (Int32, Ptr{Void}, Csize_t),
Expand Down
4 changes: 2 additions & 2 deletions base/intfuncs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -654,8 +654,8 @@ for sym in (:bin, :oct, :dec, :hex)
@eval begin
($sym)(x::Unsigned, p::Int) = ($sym)(x,p,false)
($sym)(x::Unsigned) = ($sym)(x,1,false)
($sym)(x::Char, p::Int) = ($sym)(unsigned(x),p,false)
($sym)(x::Char) = ($sym)(unsigned(x),1,false)
($sym)(x::Char, p::Int) = ($sym)(UInt32(x),p,false)
($sym)(x::Char) = ($sym)(UInt32(x),1,false)
($sym)(x::Integer, p::Int) = ($sym)(unsigned(abs(x)),p,x<0)
($sym)(x::Integer) = ($sym)(unsigned(abs(x)),1,x<0)
end
Expand Down
66 changes: 25 additions & 41 deletions base/io.jl
Original file line number Diff line number Diff line change
Expand Up @@ -432,25 +432,13 @@ function write(s::IO, a::SubArray{T,N,<:Array}) where {T,N}
end
end


function write(s::IO, ch::Char)
c = reinterpret(UInt32, ch)
if c < 0x80
return write(s, c%UInt8)
elseif c < 0x800
return (write(s, (( c >> 6 ) | 0xC0)%UInt8)) +
(write(s, (( c & 0x3F ) | 0x80)%UInt8))
elseif c < 0x10000
return (write(s, (( c >> 12 ) | 0xE0)%UInt8)) +
(write(s, (((c >> 6) & 0x3F ) | 0x80)%UInt8)) +
(write(s, (( c & 0x3F ) | 0x80)%UInt8))
elseif c < 0x110000
return (write(s, (( c >> 18 ) | 0xF0)%UInt8)) +
(write(s, (((c >> 12) & 0x3F ) | 0x80)%UInt8)) +
(write(s, (((c >> 6) & 0x3F ) | 0x80)%UInt8)) +
(write(s, (( c & 0x3F ) | 0x80)%UInt8))
else
return write(s, '\ufffd')
function write(io::IO, c::Char)
u = bswap(reinterpret(UInt32, c))
n = 1
while true
write(io, u % UInt8)
(u >>= 8) == 0 && return n
n += 1
end
end

Expand Down Expand Up @@ -493,31 +481,28 @@ function read!(s::IO, a::Array{T}) where T
return a
end

function read(s::IO, ::Type{Char})
ch = read(s, UInt8)
if ch < 0x80
return Char(ch)
end

# mimic utf8.next function
trailing = Base.utf8_trailing[ch+1]
c::UInt32 = 0
for j = 1:trailing
c += ch
c <<= 6
ch = read(s, UInt8)
function read(io::IO, ::Type{Char})
b0 = read(io, UInt8)
l = 8(4-leading_ones(b0))
c = UInt32(b0) << 24
if l < 24
s = 16
while s ≥ l && !eof(io)
peek(io) & 0xc0 == 0x80 || break
b = read(io, UInt8)
c |= UInt32(b) << s
s -= 8
end
end
c += ch
c -= Base.utf8_offset[trailing+1]
return Char(c)
return reinterpret(Char, c)
end

# readuntil_string is useful below since it has
# an optimized method for s::IOStream
readuntil_string(s::IO, delim::UInt8) = String(readuntil(s, delim))

function readuntil(s::IO, delim::Char)
if delim < Char(0x80)
if delim ≤ '\x7f'
return readuntil_string(s, delim % UInt8)
end
out = IOBuffer()
Expand Down Expand Up @@ -598,7 +583,7 @@ function readuntil(io::IO, target::AbstractString)
i = start(target)
done(target, i) && return ""
c, i = next(target, start(target))
if done(target, i) && c < Char(0x80)
if done(target, i) && c <= '\x7f'
return readuntil_string(io, c % UInt8)
end
# decide how we can index target
Expand All @@ -625,14 +610,13 @@ function readuntil(io::IO, target::AbstractVector{T}) where T
return out
end


"""
readchomp(x)

Read the entirety of `x` as a string and remove a single trailing newline.
Equivalent to `chomp!(read(x, String))`.
Read the entirety of `x` as a string and remove a single trailing newline
if there is one. Equivalent to `chomp(read(x, String))`.
"""
readchomp(x) = chomp!(read(x, String))
readchomp(x) = chomp(read(x, String))

# read up to nb bytes into nb, returning # bytes read

Expand Down
32 changes: 17 additions & 15 deletions base/iostream.jl
Original file line number Diff line number Diff line change
Expand Up @@ -315,12 +315,13 @@ end

## low-level calls ##

write(s::IOStream, b::UInt8) = Int(ccall(:ios_putc, Cint, (Cint, Ptr{Void}), b, s.ios))
function write(s::IOStream, b::UInt8)
iswritable(s) || throw(ArgumentError("write failed, IOStream is not writeable"))
Int(ccall(:ios_putc, Cint, (Cint, Ptr{Void}), b, s.ios))
end

function unsafe_write(s::IOStream, p::Ptr{UInt8}, nb::UInt)
if !iswritable(s)
throw(ArgumentError("write failed, IOStream is not writeable"))
end
iswritable(s) || throw(ArgumentError("write failed, IOStream is not writeable"))
return Int(ccall(:ios_write, Csize_t, (Ptr{Void}, Ptr{Void}, Csize_t), s.ios, p, nb))
end

Expand Down Expand Up @@ -353,14 +354,6 @@ end

## text I/O ##

function write(s::IOStream, c::Char)
if !iswritable(s)
throw(ArgumentError("write failed, IOStream is not writeable"))
end
Int(ccall(:ios_pututf8, Cint, (Ptr{Void}, UInt32), s.ios, c))
end
read(s::IOStream, ::Type{Char}) = Char(ccall(:jl_getutf8, UInt32, (Ptr{Void},), s.ios))

take!(s::IOStream) =
ccall(:jl_take_buffer, Vector{UInt8}, (Ptr{Void},), s.ios)

Expand Down Expand Up @@ -452,14 +445,23 @@ function read(s::IOStream, nb::Integer; all::Bool=true)
end

## Character streams ##
const _chtmp = Ref{Char}()

function peekchar(s::IOStream)
if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{Char}), s, _chtmp) < 0
chref = Ref{UInt32}()
if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{UInt32}), s, chref) < 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current implementation of ios_peekutf8 also does not check for invalid sequences.

Copy link
Member

@stevengj stevengj Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit wasteful for ios_peekutf8 to convert the UTF-8 encoding to UInt32, and then to convert it back when returning Char(chref[]).

Since this seems to be the only function in all of Julia that calls ios_peekutf8, we should just re-write ios_peekutf8 to return Char. ios_peekutf8 is also used in src/flisp, so we will need to keep that as-is and maybe write a ios_peekrawutf8 function to get Char.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this stuff should be optimized to avoid converting back and forth. This PR is WIP and the first step is just to get everything working and all tests passing.

Copy link
Member Author

@StefanKarpinski StefanKarpinski Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still in my new PR, which is annoying but I think that all the ios_ support code needs an overhaul in the 1.x timeframe and ideally we would move most of the buffering code etc. into Julia and it can understand the new Char representation and avoid the inefficiency here. Until then I think it's ok to just leave this as it is.

return typemax(Char)
end
return _chtmp[]
return Char(chref[])
end

function peek(s::IOStream)
ccall(:ios_peekc, Cint, (Ptr{Void},), s)
end

function peek(s::IO)
mark(s)
try read(s, UInt8)
finally
reset(s)
end
end
4 changes: 2 additions & 2 deletions base/parse.jl
Original file line number Diff line number Diff line change
Expand Up @@ -224,12 +224,12 @@ end
## string to float functions ##

tryparse(::Type{Float64}, s::String) = ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s, 0, sizeof(s))
tryparse(::Type{Float64}, s::SubString{String}) = ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s.string, s.offset, s.endof)
tryparse(::Type{Float64}, s::SubString{String}) = ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s.string, s.offset, s.ncodeunits)
tryparse_internal(::Type{Float64}, s::String, startpos::Int, endpos::Int) = ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s, startpos-1, endpos-startpos+1)
tryparse_internal(::Type{Float64}, s::SubString{String}, startpos::Int, endpos::Int) = ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s.string, s.offset+startpos-1, endpos-startpos+1)

tryparse(::Type{Float32}, s::String) = ccall(:jl_try_substrtof, Nullable{Float32}, (Ptr{UInt8},Csize_t,Csize_t), s, 0, sizeof(s))
tryparse(::Type{Float32}, s::SubString{String}) = ccall(:jl_try_substrtof, Nullable{Float32}, (Ptr{UInt8},Csize_t,Csize_t), s.string, s.offset, s.endof)
tryparse(::Type{Float32}, s::SubString{String}) = ccall(:jl_try_substrtof, Nullable{Float32}, (Ptr{UInt8},Csize_t,Csize_t), s.string, s.offset, s.ncodeunits)
tryparse_internal(::Type{Float32}, s::String, startpos::Int, endpos::Int) = ccall(:jl_try_substrtof, Nullable{Float32}, (Ptr{UInt8},Csize_t,Csize_t), s, startpos-1, endpos-startpos+1)
tryparse_internal(::Type{Float32}, s::SubString{String}, startpos::Int, endpos::Int) = ccall(:jl_try_substrtof, Nullable{Float32}, (Ptr{UInt8},Csize_t,Csize_t), s.string, s.offset+startpos-1, endpos-startpos+1)

Expand Down
8 changes: 6 additions & 2 deletions base/regex.jl
Original file line number Diff line number Diff line change
Expand Up @@ -303,8 +303,12 @@ struct SubstitutionString{T<:AbstractString} <: AbstractString
string::T
end

endof(s::SubstitutionString) = endof(s.string)
next(s::SubstitutionString, idx::Int) = next(s.string, idx)
ncodeunits(s::SubstitutionString) = ncodeunits(s.string)
codeunit(s::SubstitutionString) = codeunit(s.string)
codeunit(s::SubstitutionString, i::Integer) = codeunit(s.string, i)
isvalid(s::SubstitutionString, i::Integer) = isvalid(s.string, i)
next(s::SubstitutionString, i::Integer) = next(s.string, i)

function show(io::IO, s::SubstitutionString)
print(io, "s")
show(io, s.string)
Expand Down
2 changes: 1 addition & 1 deletion base/repl/REPLCompletions.jl
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ const sorted_keywords = [
"primitive type", "quote", "return", "struct",
"true", "try", "using", "while"]

function complete_keyword(s::String)
function complete_keyword(s::Union{String,SubString{String}})
r = searchsorted(sorted_keywords, s)
i = first(r)
n = length(sorted_keywords)
Expand Down
8 changes: 8 additions & 0 deletions base/stream.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1148,6 +1148,14 @@ unmark(x::LibuvStream) = unmark(x.buffer)
reset(x::LibuvStream) = reset(x.buffer)
ismarked(x::LibuvStream) = ismarked(x.buffer)

function peek(s::LibuvStream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is peek(s::IO) not sufficient here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LibuvStream isn't defined when that method is defined; I don't recall if I tried making the definition here. In general, this IO stuff is a bit of a mess because we have Julia objects, Libuv objects, flisp/support objects, and raw system objects. This all needs to be significantly simplified.

mark(s)
try read(s, UInt8)
finally
reset(s)
end
end

# BufferStream's are non-OS streams, backed by a regular IOBuffer
mutable struct BufferStream <: LibuvStream
buffer::IOBuffer
Expand Down
Loading