Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for invalid UTF8String in readall() #14545

Closed
wants to merge 1 commit into from

Conversation

samoconnor
Copy link
Contributor

Without this UnicodeError occurs some time later when the string is accessed.
Perhaps it should be an immediate error...

    isvalid(UTF8String, b) || throw(UnicodeError())
    return isvalid(ASCIIString, b) ? ASCIIString(b) : UTF8String(b)

But at least in my use case, it is more convenient to have readall() fall back to returning Array{UInt8} if the data isn't a string.

Without this UnicodeError occurs some time later when the string is accessed.
Perhaps it should be an immediate error...
 
    isvalid(UTF8String, b) || throw(UnicodeError())
    return isvalid(ASCIIString, b) ? ASCIIString(b) : UTF8String(b)

But at least in my use case, it is more convenient to have readall() fall back to returning Array{UInt8} if the data isn't a string.
@nalimilan
Copy link
Member

Makes sense, but I would use convert instead to avoid going over the string twice:

julia> convert(UTF8String, UInt8[0xd8])
ERROR: UnicodeError: invalid UTF-8 sequence starting at index 2 (0xd8) missing one or more continuation bytes)
 in __unsafe_checkstring#21__ at ./unicode/checkstring.jl:76
 in __unsafe_checkstring#19__ at ./unicode/checkstring.jl:66
 [inlined code] from ./unicode/checkstring.jl:66
 in convert at ./unicode/utf8.jl:243
 in eval at ./boot.jl:265

This method should probably be documented, and maybe even be the default, with a parameter to disable checks.

@stevengj What do you think about this?

@nalimilan
Copy link
Member

But at least in my use case, it is more convenient to have readall() fall back to returning Array{UInt8} if the data isn't a string.

Changing the return type depending on the contents of the stream is a no-go. But you can use readbytes if you know the data isn't a string.

@samoconnor
Copy link
Contributor Author

Changing the return type depending on the contents of the stream is a no-go. But you can use readbytes if you know the data isn't a string.

I understand that readbytes() is available, but I'm writing generic library code that reads files from an unknown source. The files could contain anything. The code I'm writing doesn't care if a particular file contains bytes or a string, but it passes the content along to other code that expects the least surprising type for each file.

Maybe my use case is unusual.

@tkelman tkelman added the needs tests Unit tests are required for this change label Jan 3, 2016
@tkelman
Copy link
Contributor

tkelman commented Jan 3, 2016

"least surprising type" is a bit underspecified

@stevengj
Copy link
Member

stevengj commented Jan 4, 2016

@samoconnor, it sounds like you should use readbytes and then layer whatever auto-format-detection you want on top of that.

@stevengj
Copy link
Member

stevengj commented Jan 4, 2016

@nalimilan, you don't need to call convert. UTF8String(d) is already copy-free.

@nalimilan
Copy link
Member

@stevengj The goal isn't to avoid copies: it's to check that the string is valid, which UTF8String doesn't. Do you agree that it should be default?

@samoconnor
Copy link
Contributor Author

@stevengj, agreed. But in that case readbytes should not return invalid strings.
What's better, the explicit throw() per original post, or convert() per @nalimilan's suggestion?

@StefanKarpinski
Copy link
Member

Please see #14383, after which the default string type would be able to hold any kind of data, making it effectively a binary string type (which defaults to interpreting data as UTF-8 characters), and would move actual encoding validation into specialized string types.

@nalimilan
Copy link
Member

@StefanKarpinski I hadn't understood that this was part of your proposal. Did you mention it anywhere?

@StefanKarpinski
Copy link
Member

It's in the explanation of that PR, sort of. To be honest, I was a little put off of posting anything about planned string changes for fear of some rather unwanted discussion that I'm no longer concerned about.

@nalimilan
Copy link
Member

@StefanKarpinski So do you think that it should be possible by default to build a String containing invalid Unicode? That seems to go against what other recent languages do.

@samoconnor samoconnor closed this Jan 21, 2016
@nalimilan
Copy link
Member

I think we should examine this again after the string rework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs tests Unit tests are required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants