-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC, WIP: highlander #14383
RFC, WIP: highlander #14383
Conversation
Sorry, but I think the part of changing |
If substrings are very efficient, to the point that a one-character (read: one-grapheme) string is as efficient as a scalar variable, is there then still a need for a separate "Character" type? On a historical note -- Fortran makes no distinction between characters and strings except via their length. I'm sure there can be various ways to iterate over strings, yielding either grapheme clusters, graphemes, codepoints, or bytes, normalized or not. The default way might be bytes. Or not. I hear that iterating over codepoints is almost never what one wants. |
Another thing to look at is the string support in Swift 2.0 (which is now open source, right here on GitHub) |
While I'm broadly in favour, I think it would be worthwhile getting some good benchmarks together so we have a rough idea of what the cost is going to be in terms of performance.For example, the following function function hasspace(s)
for c in s
isspace(c) && return true
end
false
end is 6x faster for an |
@simonbyrne I suspect we should have a function to check whether a character is an ASCII space, which is much faster to check than Unicode categories. I think it is quite common to have to look for a space in a non-ASCII string, and we want this to be fast too. @StefanKarpinski A big +1 for the first part of the proposal. The second part about |
@simonbyrne: benchmarks are definitely necessary before this change can go forward. I think the new character representation can reduce that performance gap considerably. @nalimilan: the |
I see, makes sense. Sounds much easier than making char and string the same thing. We could imagine having |
Yes, that's a possibility. |
+1 |
a5f275c
to
aeebea9
Compare
@eschnett an @nalimilan have a good point about ... iterating over codepoints is almost never what one wants._ etc. I like the idea of a 32-bit I think that There are so many ways one might want to iterate/index the content of a string...
I think there are three use case classes:
|
Perl 6's string implementation allows constant-time indexing into graphemes. Here are slides from a talk about Unicode in Perl 6 by one of the primary contributors to the implementation: http://jnthn.net/papers/2015-spw-nfg.pdf |
I couldn't find a formal description, but from what I can gather from here, Perl 6 uses an array of 32-bit signed integers, with negative numbers corresponding to "synthetic codepoints", via some sort of lookup table. Update: here is a more formal spec of their grapheme normalization form (NFG). |
@simonbyrne pdd28 was always aspirational; the rakudo/moarvm implementation was only peripherally influenced by the parrot writeups |
Their approach is interesting, but it only makes sense when starting from UTF-32, and certainly not from UTF-8 (as in the case of Julia). That is, they already have O(1) codepoint indexing, what their trick adds is O(1) grapheme indexing. But this is terribly wasteful as regards memory use, and it forces them to not only validate, but also normalize all strings on input. I guess this is a good option for a language which considers that manipulating Unicode strings should be fast, at the expense of significant overhead when your needs are more basic (like parsing a CSV file or computing stats from a database). Indeed, as they note themselves:
I'm not sure what "native" charset means, but it sounds weird that their Unicode support is so good that they advise moving away from Unicode altogether. Kind of counter-productive... That said, it looks like there's a pattern in new languages for string iteration to go over graphemes rather than codepoints (which are mostly an implementation detail). I'm not sure how feasible it would be to get an acceptable performance for that so as to make graphemes the default. @samoconnor See #9297. |
46e49de
to
47cebce
Compare
3fbd15e
to
9395632
Compare
c2d06bc
to
4c99c93
Compare
da86d36
to
6e99e2b
Compare
I think this is a great idea. Are you also thinking of deprecating |
@dcarrera What would be the point of deprecating |
@nalimilan I see. My bad. |
da3061b
to
8ab3278
Compare
2e65447
to
8c5929b
Compare
Perhaps this is relevant... ? I think the example below suggests that julia> collect(Base.flatten(Vector[[1,2], [3,4]]))
4-element Array{Any,1}:
1
2
3
4
julia> vcat(Vector[[1,2], [3,4]]...)
4-element Array{Int64,1}:
1
2
3
4
julia> collect(Base.flatten(["12", "34"]))
4-element Array{Char,1}:
'1'
'2'
'3'
'4'
julia> vcat(["12", "34"]...)
2-element Array{ASCIIString,1}:
"12"
"34" |
That's certainly a direction that the string API could take. I think it's separate from this work, however. |
Fair enough. Your discussion of "allowing character indexing and iteration" for the new for c in "hello" println(c) end
ERROR: MethodError: no method matching start(::String)
"hello"[1]
ERROR: MethodError: no method matching getindex(::String, ::Int64)
for c in chars("hello") println(c) end
first(chars["hello"]) |
@samoconnor See also #9261 and #9297. |
thx @nalimilan |
8c5929b
to
9f6dba6
Compare
This branch is a preview of where I'd like to go with built-in string types, except that I'd like to take it even further. So far this branch collapses
ASCIIString
andUTF8String
into a single, concrete UTF-8 string type calledString
. In addition to this, I'd like to make the representation ofString
andSubString{String}
the same, removing that distinction as well. I'd also like to move all non-String string types out of Base and into aStringEncodings
package (or something like that). We can keep simple utility functions to transcodeString
to UTF-16 on Windows for system calls, but otherwise, no non-UTF-8 functionality would exist in Base. Then there would truly be only one (string type in base, that is).Additionally, I'd like to replace the current
Char
type with a newChar
type that leaves UTF-8-like data as-is, allowing character indexing and iteration to do far less work in the common case where you don't actually care about the values of code points (you can still check equality and ordering because of the cleverness of UTF-8). That would reduce the performance penalty of using UTF-8 everywhere.Comments and thoughts welcomed.