Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String constructor is bad #16713

Closed
TotalVerb opened this issue Jun 2, 2016 · 4 comments
Closed

String constructor is bad #16713

TotalVerb opened this issue Jun 2, 2016 · 4 comments
Labels
strings "Strings!"

Comments

@TotalVerb
Copy link
Contributor

TotalVerb commented Jun 2, 2016

The String constructor here leaves a lot to be desired:

"""
    String(v::Vector{UInt8})

Wrap a vector of bytes encoding string data as UTF-8 in a `String` object.
The resulting `String` object takes ownership of the array.
"""
String(s::Vector{UInt8}) =
    ccall(:jl_pchar_to_string, Ref{String}, (Ptr{UInt8},Int), s, length(s))
  • It makes a copy, so it's very slow. At least in Julia 0.3, no copy was made.
  • The behaviour is different from the docstring. The String object doesn't take ownership of anything.
  • It's unsafe. String([0x80]); is allowed, even though is is an invalid string.
  • It overwrites the String inner constructor, making the inner constructor useless and emitting a warning at build time.

Obviously not all of these can be fixed (points 1 and 3 would move it in different directions), but I think it ought to pick a direction:

  • String should be safe, and check its input (along with making a copy). The method is updated to check validity of UTF8. The docstring is updated. An unsafe_array_to_string replaces the current behaviour, and additionally does not make a copy. The inner constructor is deleted.
  • String should not be safe, and does not check input, and does not make a copy. The method is deleted and the docstring moved to the inner constructor.
@ararslan
Copy link
Member

ararslan commented Jun 2, 2016

Ref #16107 (general direction for String) and #16590 (recent developments)

@simonbyrne simonbyrne added the strings "Strings!" label Jun 2, 2016
@stevengj
Copy link
Member

stevengj commented Jun 2, 2016

I think there was a conscious decision by @StefanKarpinski to allow invalid UTF-8 data, so that String containers could be used to pass through arbitrary binary data. This is common to many UTF-8 implementations, including Python's. (The decoding algorithm to extract chars/codepoints must still check for invalid data, of course. There have been a lot of proposals for how to decode such data.)

@stevengj
Copy link
Member

stevengj commented Jun 2, 2016

See also #16470.

@StefanKarpinski StefanKarpinski mentioned this issue Jun 2, 2016
32 tasks
@stevengj
Copy link
Member

stevengj commented Jun 3, 2016

I think @StefanKarpinski may have accidentally changed the String(::Vector{UInt8}) semantics in #16453. When he renamed the bytestring(::Vector{UInt8}) function to String(::Vector{UInt8}), he overrode the inner String constructor (which doesn't make a copy) from #16212.

Probably bytestring(data::Vector{UInt8}) should have been deprecated to String(copy(data)).

stevengj added a commit to stevengj/julia that referenced this issue Jun 3, 2016
…unsafe_string_wrapper, and unsafe_array_wrapper; restore non-copying behavior of String(::Vector{UInt8}) constructor (closes JuliaLang#16470, closes JuliaLang#16713)
stevengj added a commit to stevengj/julia that referenced this issue Jun 4, 2016
…unsafe_string_wrapper, and unsafe_array_wrapper; restore non-copying behavior of String(::Vector{UInt8}) constructor (closes JuliaLang#16470, closes JuliaLang#16713)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

4 participants