Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isvalid(Type, value) methods, to replace is_valid_* #11241

Merged
merged 1 commit into from
May 22, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -367,6 +367,13 @@ Deprecated or removed

* Instead of `linrange`, use `linspace` ([#9666]).

* The functions `is_valid_char`, `is_valid_ascii`, `is_valid_utf8`, `is_valid_utf16`, and
`is_valid_utf32` have been replaced by generic `isvalid` methods.
The single argument form `isvalid(value)` can now be used for values of type `Char`, `ASCIIString`,
`UTF8String`, `UTF16String` and `UTF32String`.
The two argument form `isvalid(type, value)` can be used with the above types, with values
of type `Vector{UInt8}`, `Vector{UInt16}`, `Vector{UInt32}`, and `Vector{Char}` ([#11241]).

Julia v0.3.0 Release Notes
==========================

Expand Down Expand Up @@ -1379,6 +1386,7 @@ Too numerous to mention.
[#9779]: https://github.com/JuliaLang/julia/issues/9779
[#9862]: https://github.com/JuliaLang/julia/issues/9862
[#9957]: https://github.com/JuliaLang/julia/issues/9957
[#10008]: https://github.com/JuliaLang/julia/issues/10008
[#10024]: https://github.com/JuliaLang/julia/issues/10024
[#10031]: https://github.com/JuliaLang/julia/issues/10031
[#10075]: https://github.com/JuliaLang/julia/issues/10075
Expand Down Expand Up @@ -1406,5 +1414,9 @@ Too numerous to mention.
[#10888]: https://github.com/JuliaLang/julia/issues/10888
[#10893]: https://github.com/JuliaLang/julia/issues/10893
[#10914]: https://github.com/JuliaLang/julia/issues/10914
[#10955]: https://github.com/JuliaLang/julia/issues/10955
[#10994]: https://github.com/JuliaLang/julia/issues/10994
[#11105]: https://github.com/JuliaLang/julia/issues/11105
[#11145]: https://github.com/JuliaLang/julia/issues/11145
[#11171]: https://github.com/JuliaLang/julia/issues/11171
[#11241]: https://github.com/JuliaLang/julia/issues/11241
2 changes: 1 addition & 1 deletion base/ascii.jl
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ ascii(x) = convert(ASCIIString, x)
convert(::Type{ASCIIString}, s::ASCIIString) = s
convert(::Type{ASCIIString}, s::UTF8String) = ascii(s.data)
convert(::Type{ASCIIString}, a::Vector{UInt8}) = begin
is_valid_ascii(a) || throw(ArgumentError("invalid ASCII sequence"))
isvalid(ASCIIString,a) || throw(ArgumentError("invalid ASCII sequence"))
return ASCIIString(a)
end

Expand Down
14 changes: 14 additions & 0 deletions base/deprecated.jl
Original file line number Diff line number Diff line change
Expand Up @@ -443,3 +443,17 @@ export float32_isvalid, float64_isvalid
@deprecate (&)(x::Char, y::Char) Char(UInt32(x) & UInt32(y))
@deprecate (|)(x::Char, y::Char) Char(UInt32(x) | UInt32(y))
@deprecate ($)(x::Char, y::Char) Char(UInt32(x) $ UInt32(y))

# 11241

@deprecate is_valid_char(ch::Char) isvalid(ch)
@deprecate is_valid_ascii(str::ASCIIString) isvalid(str)
@deprecate is_valid_utf8(str::UTF8String) isvalid(str)
@deprecate is_valid_utf16(str::UTF16String) isvalid(str)
@deprecate is_valid_utf32(str::UTF32String) isvalid(str)

@deprecate is_valid_char(ch) isvalid(Char, ch)
@deprecate is_valid_ascii(str) isvalid(ASCIIString, str)
@deprecate is_valid_utf8(str) isvalid(UTF8String, str)
@deprecate is_valid_utf16(str) isvalid(UTF16String, str)
@deprecate is_valid_utf32(str) isvalid(UTF32String, str)
5 changes: 0 additions & 5 deletions base/exports.jl
Original file line number Diff line number Diff line change
Expand Up @@ -820,11 +820,6 @@ export
ind2chr,
info,
is_assigned_char,
is_valid_ascii,
is_valid_char,
is_valid_utf8,
is_valid_utf16,
is_valid_utf32,
isalnum,
isalpha,
isascii,
Expand Down
2 changes: 1 addition & 1 deletion base/io.jl
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ end

function readall(s::IO)
b = readbytes(s)
return is_valid_ascii(b) ? ASCIIString(b) : UTF8String(b)
return isvalid(ASCIIString, b) ? ASCIIString(b) : UTF8String(b)
end
readall(filename::AbstractString) = open(readall, filename)

Expand Down
4 changes: 2 additions & 2 deletions base/string.jl
Original file line number Diff line number Diff line change
Expand Up @@ -968,8 +968,8 @@ byte_string_classify(s::ByteString) = byte_string_classify(s.data)
# 1: valid ASCII
# 2: valid UTF-8

is_valid_ascii(s::Union(Array{UInt8,1},ByteString)) = byte_string_classify(s) == 1
is_valid_utf8(s::Union(Array{UInt8,1},ByteString)) = byte_string_classify(s) != 0
isvalid(::Type{ASCIIString}, s::Union(Array{UInt8,1},ByteString)) = byte_string_classify(s) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that we could check for valid ASCII much faster than calling byte_string_classify. Something like:

isvalid(::Type{ASCIIString}, s::ByteString) = isvalid(ASCIIString, s.data)
function isvalid(::Type{ASCIIString}, s::Array{UInt8,1})
    for c in s; c >= 128 && return false; end
    return true
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm quite aware that there are many more performance enhancements that can be done, all over the string and character handling code... however, I was told numerous times to keep PRs to single issues... I'd already planned on improving that, as soon as this and #11004 are merged in...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this doesn't have to be in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't think you want to mix two or more major changes in the same PR. However, if in the course of a PR cleaning up some issue, you notice a minor (few line) improvement somewhere in the same functions, it's not usually a problem to combine that into the same PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, as you well know by now, I'm still learning my way around the preferred way of contributing here! I was planning on making another round of performance optimizations, once this and #11004 are merged (hopefully!) in... using some of the stuff I've learned in the last 3 weeks...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please calm down!
I wasn't threatening ceasing contributing - I said that I wouldn't have as much time, which is simply the facts.
I also did NOT say that people haven't been responsive in general to me... and in this very thread I thanked Tony and you for the thorough review. I have been nothing but appreciative, in public and private, for the time people have spent giving constructive criticism, and have tried to respond as quickly as possible (and in fact, lost a lot of time due to Julia bugs, after having followed someone's suggestion to extend convert instead of having separate utfx_to_utfy functions)
Except for that issue, the code for #11004 hasn't changed in over a week.

I have been donating a lot of my time as well, trying to fix bugs in Julia, and solve some rather severe performance issues (whether or not you believe that string handling performance is important or not, is another issue).

I was just describing my situation... simply the facts... the client already gave up a week and a half ago on using Julia for a part of the project, because of issues with ODBC.jl being broken due to tupocalypse, string handling performance, and lack of decimal floating point support (even with DecFP.jl, that's very new, and is not integrated into JSON.jl nor ODBC.jl).
I've got a conference call first thing tomorrow morning, and I'm just worried that they will decide not to go ahead with using Julia for other parts of the project.
If they do (which is not up to me at all), then my involvement for now trying to improve Julia will necessarily drop down to what I can do in my spare time. That's NOT an ultimatum, just the sad situation I'm in... and I've been doing everything in my power to prevent that happening.
I'd much prefer to do nothing but help make Julia the No. 1 language for string/database processing, than be programming in C++ and Python...

About the technical note: by adding a new method to convert, would that be able to override all the code that uses the code in utf8.jl, utf16.jl, and utf32.jl for conversions?
I didn't realize that was possible - or even if possible, recommended (I've heard a lot about "monkey-patching" around here... a term I'd never come across before)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the technical note: yes and no. utf16.jl currently defines convert(::Type{UTF16String}, s::AbstractString). If you define a more specialized/optimized conversion routine Base.convert(::Type{UTF16String}, s::UTF8String), for example, and a function in Base subsequently calls e.g. utf16(....some UTF8String....), then it will dispatch to your method. The exception would be if that function in Base has already been compiled and inlined the old conversion call prior to your new definition. This might have happened, for example, on Windows for a Win32 filesystem call that was already made while booting Julia, but of course the argument string conversion is negligible anyway for a filesystem call.

Adding new methods for more specific argument signatures is not the same thing as "monkey patching" in the Python sense. It's sort of analogous to adding a new subtype of an existing type and passing it to an existing function that now dispatches to the new subtype's method. Except that in Julia you don't have to add a new type to add new methods.

I wish you would stop with the comments to the effect that you are the only one who cares about string-handling performance. As I said in #11004, everyone cares, but we also realize that there are tradeoffs to code bloat for the sake of performance. Sometimes there is no alternative, but some data and effort is required to make that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never said that I was the only one who cares about string-handling performance, a number of people have already told me that they were very happy to see me trying to do something about it...
My point has been all along that a lot of people do care about the string-handling performance, which is why some attention should be paid to it... (I've also said, a number of times, that current poor performance wasn't because the Julia team did a bad job, but rather, it just hasn't been the focus of their time... the places where they have spent their time, is pretty darned remarkable, IMO).
I'm well aware of said trade-offs, I've spent quite a long time developing and maintaining a quite large code base... if you (or anybody) can point me to a better way of writing those conversion functions, so that Julia takes care of spitting out the specialized cases, that doesn't kill the performance gains, then I'm all for it. #11004 was my very first Julia code that wasn't just wrappers, and I've already got some ideas about how I might be able to condense it (although the bug in Julia with inference might have to get fixed first, and I have no idea [yet] how to accomplish that). The code is very simple, and has more inline comments than what was there before... so I don't understand the worries about maintainability.

About adding more specialized methods, is there any way of telling just where the old code might still be getting used? There is so much in Base, that it seems like quite a lot might already be going directly to the old code, no matter what I do, not just some Win32 calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About adding more specialized methods, is there any way of telling just where the old code might still be getting used? There is so much in Base, that it seems like quite a lot might already be going directly to the old code, no matter what I do, not just some Win32 calls.

Maybe just deprecating them? That should print a backtrace for every place the function is called from. Not sure about the interaction with precompilation, though, so you may want (re)move the sysimage before starting Julia.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just profile your code. You don't care if the old (slower) methods are being used unless it is performance critical, so you only care about the routines where it is spending a lot of time. I'm skeptical that you will notice an issue, because Base only uses UTF-16 for calling Windows routines.

@nalimilan, deprecating a posteriori them won't help won't help for methods that have already been compiled.

isvalid(::Type{UTF8String}, s::Union(Array{UInt8,1},ByteString)) = byte_string_classify(s) != 0

## multiline strings ##

Expand Down
8 changes: 3 additions & 5 deletions base/utf16.jl
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ sizeof(s::UTF16String) = sizeof(s.data) - sizeof(UInt16)
unsafe_convert{T<:Union(Int16,UInt16)}(::Type{Ptr{T}}, s::UTF16String) =
convert(Ptr{T}, pointer(s))

function is_valid_utf16(data::AbstractArray{UInt16})
function isvalid(::Type{UTF16String}, data::AbstractArray{UInt16})
i = 1
n = length(data) # this may include NULL termination; that's okay
while i < n # check for unpaired surrogates
Expand All @@ -110,10 +110,8 @@ function is_valid_utf16(data::AbstractArray{UInt16})
return i > n || !utf16_is_surrogate(data[i])
end

is_valid_utf16(s::UTF16String) = is_valid_utf16(s.data)

function convert(::Type{UTF16String}, data::AbstractVector{UInt16})
!is_valid_utf16(data) && throw(ArgumentError("invalid UTF16 data"))
!isvalid(UTF16String, data) && throw(ArgumentError("invalid UTF16 data"))
len = length(data)
d = Array(UInt16, len + 1)
d[end] = 0 # NULL terminate
Expand Down Expand Up @@ -144,7 +142,7 @@ function convert(T::Type{UTF16String}, bytes::AbstractArray{UInt8})
copy!(d,1, data,1, length(data)) # assume native byte order
end
d[end] = 0 # NULL terminate
!is_valid_utf16(d) && throw(ArgumentError("invalid UTF16 data"))
!isvalid(UTF16String, d) && throw(ArgumentError("invalid UTF16 data"))
UTF16String(d)
end

Expand Down
9 changes: 5 additions & 4 deletions base/utf32.jl
Original file line number Diff line number Diff line change
Expand Up @@ -92,13 +92,14 @@ function convert(T::Type{UTF32String}, bytes::AbstractArray{UInt8})
UTF32String(d)
end

function is_valid_utf32(s::Union(Vector{Char}, Vector{UInt32}))
for i=1:length(s)
@inbounds if !is_valid_char(reinterpret(UInt32, s[i])) ; return false ; end
function isvalid(::Type{UTF32String}, str::Union(Vector{Char}, Vector{UInt32}))
for i=1:length(str)
@inbounds if !isvalid(Char, reinterpret(UInt32, str[i])) ; return false ; end
end
return true
end
is_valid_utf32(s::UTF32String) = is_valid_utf32(s.data)
isvalid(str::Vector{Char}) = isvalid(UTF32String, str)
isvalid{T<:Union(ASCIIString,UTF8String,UTF16String,UTF32String)}(str::T) = isvalid(T, str.data)

utf32(p::Ptr{Char}, len::Integer) = utf32(pointer_to_array(p, len))
utf32(p::Union(Ptr{UInt32}, Ptr{Int32}), len::Integer) = utf32(convert(Ptr{Char}, p), len)
Expand Down
2 changes: 1 addition & 1 deletion base/utf8.jl
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ write(io::IO, s::UTF8String) = write(io, s.data)
utf8(x) = convert(UTF8String, x)
convert(::Type{UTF8String}, s::UTF8String) = s
convert(::Type{UTF8String}, s::ASCIIString) = UTF8String(s.data)
convert(::Type{UTF8String}, a::Array{UInt8,1}) = is_valid_utf8(a) ? UTF8String(a) : throw(ArgumentError("invalid UTF-8 sequence"))
convert(::Type{UTF8String}, a::Array{UInt8,1}) = isvalid(UTF8String, a) ? UTF8String(a) : throw(ArgumentError("invalid UTF-8 sequence"))
function convert(::Type{UTF8String}, a::Array{UInt8,1}, invalids_as::AbstractString)
l = length(a)
idx = 1
Expand Down
12 changes: 7 additions & 5 deletions base/utf8proc.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,21 @@
# Various Unicode functionality from the utf8proc library
module UTF8proc

import Base: show, showcompact, ==, hash, string, symbol, isless, length, eltype, start, next, done, convert
import Base: show, showcompact, ==, hash, string, symbol, isless, length, eltype, start, next, done, convert, isvalid

export isgraphemebreak

# also exported by Base:
export normalize_string, graphemes, is_valid_char, is_assigned_char, charwidth,
export normalize_string, graphemes, is_assigned_char, charwidth, isvalid,
islower, isupper, isalpha, isdigit, isnumber, isalnum,
iscntrl, ispunct, isspace, isprint, isgraph, isblank

# whether codepoints are valid Unicode scalar values, i.e. 0-0xd7ff, 0xe000-0x10ffff
is_valid_char(ch::Unsigned) = !Bool((ch-0xd800<0x800)|(ch>0x10ffff))
is_valid_char(ch::Integer) = is_valid_char(Unsigned(ch))
is_valid_char(ch::Char) = is_valid_char(UInt32(ch))
isvalid(::Type{Char}, ch::Unsigned) = !((ch - 0xd800 < 0x800) | (ch > 0x10ffff))
isvalid(::Type{Char}, ch::Integer) = isvalid(Char, Unsigned(ch))
isvalid(::Type{Char}, ch::Char) = isvalid(Char, UInt32(ch))

isvalid(ch::Char) = isvalid(Char, ch)

# utf8 category constants
const UTF8PROC_CATEGORY_CN = 0
Expand Down
4 changes: 2 additions & 2 deletions doc/manual/strings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,14 +99,14 @@ convert an integer value back to a :obj:`Char` just as easily:
Not all integer values are valid Unicode code points, but for
performance, the :func:`Char` conversion does not check that every character
value is valid. If you want to check that each converted value is a
valid code point, use the :func:`is_valid_char` function:
valid code point, use the :func:`isvalid` function:

.. doctest::

julia> Char(0x110000)
'\U110000'

julia> is_valid_char(0x110000)
julia> isvalid(Char, 0x110000)
false

As of this writing, the valid Unicode code points are ``U+00`` through
Expand Down
22 changes: 10 additions & 12 deletions doc/stdlib/strings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,17 +109,19 @@
even though they may contain more than one codepoint; for example
a letter combined with an accent mark is a single grapheme.)

.. function:: is_valid_ascii(s) -> Bool
.. function:: isvalid(value) -> Bool

Returns true if the argument (``ASCIIString``, ``UTF8String``, or byte vector) is valid ASCII, false otherwise.
Returns true if the given value is valid for its type,
which currently can be one of ``Char``, ``ASCIIString``, ``UTF8String``, ``UTF16String``, or ``UTF32String``

.. function:: is_valid_utf8(s) -> Bool
.. function:: isvalid(T, value) -> Bool

Returns true if the argument (``ASCIIString``, ``UTF8String``, or byte vector) is valid UTF-8, false otherwise.

.. function:: is_valid_char(c) -> Bool

Returns true if the given char or integer is a valid Unicode code point.
Returns true if the given value is valid for that type.
Types currently can be ``Char``, ``ASCIIString``, ``UTF8String``, ``UTF16String``, or ``UTF32String``
Values for ``Char`` can be of type ``Char`` or ``UInt32``
Values for ``ASCIIString`` and ``UTF8String`` can be of that type, or ``Vector{UInt8}``
Values for ``UTF16String`` can be ``UTF16String`` or ``Vector{UInt16}``
Values for ``UTF32String`` can be ``UTF32String``, ``Vector{Char}`` or ``Vector{UInt32}``

.. function:: is_assigned_char(c) -> Bool

Expand Down Expand Up @@ -379,10 +381,6 @@

Create a string from the address of a NUL-terminated UTF-16 string. A copy is made; the pointer can be safely freed. If ``length`` is specified, the string does not have to be NUL-terminated.

.. function:: is_valid_utf16(s) -> Bool

Returns true if the argument (``UTF16String`` or ``UInt16`` array) is valid UTF-16.

.. function:: utf32(s)

Create a UTF-32 string from a byte array, array of ``UInt32``, or
Expand Down
Loading