-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for invalid UTF8String in readall() #14545
Conversation
Without this UnicodeError occurs some time later when the string is accessed. Perhaps it should be an immediate error... isvalid(UTF8String, b) || throw(UnicodeError()) return isvalid(ASCIIString, b) ? ASCIIString(b) : UTF8String(b) But at least in my use case, it is more convenient to have readall() fall back to returning Array{UInt8} if the data isn't a string.
Makes sense, but I would use julia> convert(UTF8String, UInt8[0xd8])
ERROR: UnicodeError: invalid UTF-8 sequence starting at index 2 (0xd8) missing one or more continuation bytes)
in __unsafe_checkstring#21__ at ./unicode/checkstring.jl:76
in __unsafe_checkstring#19__ at ./unicode/checkstring.jl:66
[inlined code] from ./unicode/checkstring.jl:66
in convert at ./unicode/utf8.jl:243
in eval at ./boot.jl:265 This method should probably be documented, and maybe even be the default, with a parameter to disable checks. @stevengj What do you think about this? |
Changing the return type depending on the contents of the stream is a no-go. But you can use |
I understand that Maybe my use case is unusual. |
"least surprising type" is a bit underspecified |
@samoconnor, it sounds like you should use |
@nalimilan, you don't need to call |
@stevengj The goal isn't to avoid copies: it's to check that the string is valid, which |
@stevengj, agreed. But in that case readbytes should not return invalid strings. |
Please see #14383, after which the default string type would be able to hold any kind of data, making it effectively a binary string type (which defaults to interpreting data as UTF-8 characters), and would move actual encoding validation into specialized string types. |
@StefanKarpinski I hadn't understood that this was part of your proposal. Did you mention it anywhere? |
It's in the explanation of that PR, sort of. To be honest, I was a little put off of posting anything about planned string changes for fear of some rather unwanted discussion that I'm no longer concerned about. |
@StefanKarpinski So do you think that it should be possible by default to build a |
I think we should examine this again after the string rework. |
Without this UnicodeError occurs some time later when the string is accessed.
Perhaps it should be an immediate error...
But at least in my use case, it is more convenient to have readall() fall back to returning Array{UInt8} if the data isn't a string.