RFC, WIP: highlander #14383

StefanKarpinski · 2015-12-12T22:11:27Z

This branch is a preview of where I'd like to go with built-in string types, except that I'd like to take it even further. So far this branch collapses ASCIIString and UTF8String into a single, concrete UTF-8 string type called String. In addition to this, I'd like to make the representation of String and SubString{String} the same, removing that distinction as well. I'd also like to move all non-String string types out of Base and into a StringEncodings package (or something like that). We can keep simple utility functions to transcode String to UTF-16 on Windows for system calls, but otherwise, no non-UTF-8 functionality would exist in Base. Then there would truly be only one (string type in base, that is).

Additionally, I'd like to replace the current Char type with a new Char type that leaves UTF-8-like data as-is, allowing character indexing and iteration to do far less work in the common case where you don't actually care about the values of code points (you can still check equality and ordering because of the cleverness of UTF-8). That would reduce the performance penalty of using UTF-8 everywhere.

Comments and thoughts welcomed.

ScottPJones · 2015-12-12T22:39:04Z

Sorry, but I think the part of changing Char would be a disaster. It might help performance in a few small places, but would be very bad in many others.
I really don't see this as being a good path forwards for string handling.
I think something like @quinnj's ideas, combined with the ideas of changing the way arrays are handled that I've heard discussed, with some of the ideas about better storage of small strings from the bytevec code I think you were working on, and using traits for the encodings of strings, would be much more powerful and useful for those of us doing a lot of string processing.

eschnett · 2015-12-13T02:48:15Z

If substrings are very efficient, to the point that a one-character (read: one-grapheme) string is as efficient as a scalar variable, is there then still a need for a separate "Character" type? On a historical note -- Fortran makes no distinction between characters and strings except via their length.

I'm sure there can be various ways to iterate over strings, yielding either grapheme clusters, graphemes, codepoints, or bytes, normalized or not. The default way might be bytes. Or not. I hear that iterating over codepoints is almost never what one wants.

ScottPJones · 2015-12-13T03:39:56Z

Another thing to look at is the string support in Swift 2.0 (which is now open source, right here on GitHub)

simonbyrne · 2015-12-13T21:49:27Z

While I'm broadly in favour, I think it would be worthwhile getting some good benchmarks together so we have a rough idea of what the cost is going to be in terms of performance.For example, the following function

function hasspace(s)
    for c in s
        isspace(c) && return true
    end
    false
end

is 6x faster for an ASCIIString vs the same data in a UTF8String.

nalimilan · 2015-12-13T22:13:02Z

@simonbyrne I suspect we should have a function to check whether a character is an ASCII space, which is much faster to check than Unicode categories. I think it is quite common to have to look for a space in a non-ASCII string, and we want this to be fast too.

@StefanKarpinski A big +1 for the first part of the proposal. The second part about Char is promising, but I don't understand how it would work. Would Char be a simple view on the string data? At that point, @eschnett I right that the distinction between char and string might not make sense. Considering everything as a string would have some advantages, like allowing to represent the same way codepoints and grapheme clusters. The latter is the best definition of "character" in many cases, and this is indeed what you get in Swift when iterating over a string.

StefanKarpinski · 2015-12-14T16:21:00Z

@simonbyrne: benchmarks are definitely necessary before this change can go forward. I think the new character representation can reduce that performance gap considerably.

@nalimilan: the Char trick is basically to represent characters as a 32-bit immutable value like now but as undecoded UTF-8 bytes instead of as the Unicode code point. This allows you to get a character value from a UTF-8 string using just a few integer ops and a table lookup. It still needs a little work, but when I tried it out on the sk/newchar branch – see this commit – it was remarkably non-disruptive. Basically, the only code that has any realy trouble with the change is code that assumes that UTF-32 strings and arrays of Char have the same in-memory representation. Since using UTF-32 is rare and making that very low-level assumption is even rarer, this doesn't cause much trouble. The other place that there can be trouble is code that converts between Char and UInt32 using reinterpret – but you really shouldn't do that and using convert seems much more common.

nalimilan · 2015-12-14T16:45:11Z

I see, makes sense. Sounds much easier than making char and string the same thing. We could imagine having AbstractChar to allow for efficient UTF-16 or UTF-32 implementations in packages.

StefanKarpinski · 2015-12-14T18:02:11Z

Yes, that's a possibility.

samoconnor · 2016-01-04T22:28:53Z

+1
Sounds all good to me.

samoconnor · 2016-01-17T04:53:58Z

@eschnett an @nalimilan have a good point about ... iterating over codepoints is almost never what one wants._ etc.

I like the idea of a 32-bit Char type with some UTF-8 implementation magic.

I think that String should not be indexable at all (or should only be byte-indexable, i.e. same as Vector{UInt8}).

There are so many ways one might want to iterate/index the content of a string...

bytes(s)[7]
for byte in eachbyte(s)

chars(s)[7]
for char in eachchar(s) ...

clusters(s)[7]
for cluster in eachcluster(s) ...

lines(s)[7]
for line in eachline(s)

words(s)[7]
for row in eachword(s)

jlstatement(s)[7]
for stmt in eachjlstatement(s)

I think there are three use case classes:

the code knows about the high level abstractions represented in the String and knows how to deal with them (lines, modified emojis, words, rows, statements, verses, paragraphs, tags, whatever).
the code treats the String as an opaque value to be compared and or passed around, but has no concept of indexing or iteration.
low level code that has to deal with the String as a byte array in order to pass it to/from some external system (network, serialisation format, etc).

yurivish · 2016-01-17T21:31:43Z

Perl 6's string implementation allows constant-time indexing into graphemes. Here are slides from a talk about Unicode in Perl 6 by one of the primary contributors to the implementation: http://jnthn.net/papers/2015-spw-nfg.pdf

simonbyrne · 2016-01-17T22:01:00Z

I couldn't find a formal description, but from what I can gather from here, Perl 6 uses an array of 32-bit signed integers, with negative numbers corresponding to "synthetic codepoints", via some sort of lookup table.

Update: here is a more formal spec of their grapheme normalization form (NFG).

diakopter · 2016-01-17T22:37:23Z

@simonbyrne pdd28 was always aspirational; the rakudo/moarvm implementation was only peripherally influenced by the parrot writeups

nalimilan · 2016-01-18T09:36:41Z

Their approach is interesting, but it only makes sense when starting from UTF-32, and certainly not from UTF-8 (as in the case of Julia). That is, they already have O(1) codepoint indexing, what their trick adds is O(1) grapheme indexing. But this is terribly wasteful as regards memory use, and it forces them to not only validate, but also normalize all strings on input. I guess this is a good option for a language which considers that manipulating Unicode strings should be fast, at the expense of significant overhead when your needs are more basic (like parsing a CSV file or computing stats from a database). Indeed, as they note themselves:

This means that although Parrot will use 32-bit NFG strings for optimizations within operations, for the most part individual users should use the native character set and encoding of their data, rather than using NFG strings directly.

I'm not sure what "native" charset means, but it sounds weird that their Unicode support is so good that they advise moving away from Unicode altogether. Kind of counter-productive...

That said, it looks like there's a pattern in new languages for string iteration to go over graphemes rather than codepoints (which are mostly an implementation detail). I'm not sure how feasible it would be to get an acceptable performance for that so as to make graphemes the default.

@samoconnor See #9297.

dcarrera · 2016-03-08T15:41:03Z

I think this is a great idea. Are you also thinking of deprecating AbstractString ?

nalimilan · 2016-03-08T16:24:59Z

@dcarrera What would be the point of deprecating AbstractString? That would make it impossible to create custom string types in packages and have them work with Base.

dcarrera · 2016-03-08T16:54:23Z

@nalimilan I see. My bad.

samoconnor · 2016-03-28T21:57:07Z

Perhaps this is relevant... ?

I think the example below suggests that String should not be treated as a collection/itterable.
At present some things treat *String as a collection and some do not.

julia> collect(Base.flatten(Vector[[1,2], [3,4]]))
4-element Array{Any,1}:
 1
 2
 3
 4

julia> vcat(Vector[[1,2], [3,4]]...)
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> collect(Base.flatten(["12", "34"]))
4-element Array{Char,1}:
 '1'
 '2'
 '3'
 '4'

julia> vcat(["12", "34"]...)
2-element Array{ASCIIString,1}:
 "12"
 "34"

StefanKarpinski · 2016-03-28T23:01:27Z

That's certainly a direction that the string API could take. I think it's separate from this work, however.

samoconnor · 2016-03-28T23:13:04Z

Fair enough. Your discussion of "allowing character indexing and iteration" for the new Char type made me think that it might be a good opportunity to make the relationship between String and Char explicit at the same time (i.e. remove the implicit treatment of String as a collection of Char):

for c in "hello" println(c) end
ERROR: MethodError: no method matching start(::String)

"hello"[1]
ERROR: MethodError: no method matching getindex(::String, ::Int64)

for c in chars("hello") println(c) end

first(chars["hello"])

nalimilan · 2016-03-29T09:14:25Z

@samoconnor See also #9261 and #9297.

samoconnor · 2016-03-29T10:30:42Z

thx @nalimilan

StefanKarpinski mentioned this pull request Jan 4, 2016

Check for invalid UTF8String in readall() #14545

Closed

StefanKarpinski force-pushed the sk/highlander branch 2 times, most recently from a5f275c to aeebea9 Compare January 11, 2016 17:07

StefanKarpinski force-pushed the sk/highlander branch 5 times, most recently from 46e49de to 47cebce Compare January 28, 2016 17:09

StefanKarpinski force-pushed the sk/highlander branch from 3fbd15e to 9395632 Compare February 9, 2016 00:34

yuyichao mentioned this pull request Feb 18, 2016

Is it bug ? : 'endof()' returns curious values. #15127

Closed

This was referenced Feb 20, 2016

reading invalid text data, other encodings #1792

Closed

Performance improvements with text processing #14196

Closed

StefanKarpinski force-pushed the sk/highlander branch 4 times, most recently from c2d06bc to 4c99c93 Compare February 24, 2016 21:35

StefanKarpinski force-pushed the sk/highlander branch 2 times, most recently from da86d36 to 6e99e2b Compare February 25, 2016 18:22

StefanKarpinski force-pushed the sk/highlander branch 2 times, most recently from da3061b to 8ab3278 Compare March 26, 2016 15:23

StefanKarpinski force-pushed the sk/highlander branch from 2e65447 to 8c5929b Compare March 28, 2016 21:01

samoconnor mentioned this pull request Mar 29, 2016

@catch, retry, partition, asyncmap and pmap #15409

Merged

StefanKarpinski added 12 commits April 12, 2016 18:41

highlander: remove ASCIIString from base Julia entirely

eabc7ad

highlander: rename UTF8String, ByteString => String

bd674c0

highlander: rename jl_utf8_string_type => jl_string_type

ff4ca11

highlander: remove all T <: String since String is concrete

fd1c02c

highlander: excise UTF-32

9e2c7d2

highlander: excise UTF-16

ddf8ab0

highlander: rearrange files that provide string functionality

181a7f8

new Char: up to 4 zero-padded UTF-8 bytes; basic change passes tests

846dbf0

fixes

4962a53

wip

2cbd501

wip

37d1e8e

wip

9f6dba6

StefanKarpinski force-pushed the sk/highlander branch from 8c5929b to 9f6dba6 Compare April 13, 2016 23:08

stevengj mentioned this pull request Apr 25, 2016

RFC: Deprecate Int-Char comparisons, e.g. 'x' == 120 #16024

Merged

StefanKarpinski closed this Jun 21, 2016

tkelman deleted the sk/highlander branch September 24, 2016 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC, WIP: highlander #14383

RFC, WIP: highlander #14383

StefanKarpinski commented Dec 12, 2015

ScottPJones commented Dec 12, 2015

eschnett commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

simonbyrne commented Dec 13, 2015

nalimilan commented Dec 13, 2015

StefanKarpinski commented Dec 14, 2015

nalimilan commented Dec 14, 2015

StefanKarpinski commented Dec 14, 2015

samoconnor commented Jan 4, 2016

samoconnor commented Jan 17, 2016

yurivish commented Jan 17, 2016

simonbyrne commented Jan 17, 2016

diakopter commented Jan 17, 2016

nalimilan commented Jan 18, 2016

dcarrera commented Mar 8, 2016

nalimilan commented Mar 8, 2016

dcarrera commented Mar 8, 2016

samoconnor commented Mar 28, 2016

StefanKarpinski commented Mar 28, 2016

samoconnor commented Mar 28, 2016

nalimilan commented Mar 29, 2016

samoconnor commented Mar 29, 2016

RFC, WIP: highlander #14383

RFC, WIP: highlander #14383

Conversation

StefanKarpinski commented Dec 12, 2015

ScottPJones commented Dec 12, 2015

eschnett commented Dec 13, 2015

ScottPJones commented Dec 13, 2015

simonbyrne commented Dec 13, 2015

nalimilan commented Dec 13, 2015

StefanKarpinski commented Dec 14, 2015

nalimilan commented Dec 14, 2015

StefanKarpinski commented Dec 14, 2015

samoconnor commented Jan 4, 2016

samoconnor commented Jan 17, 2016

yurivish commented Jan 17, 2016

simonbyrne commented Jan 17, 2016

diakopter commented Jan 17, 2016

nalimilan commented Jan 18, 2016

dcarrera commented Mar 8, 2016

nalimilan commented Mar 8, 2016

dcarrera commented Mar 8, 2016

samoconnor commented Mar 28, 2016

StefanKarpinski commented Mar 28, 2016

samoconnor commented Mar 28, 2016

nalimilan commented Mar 29, 2016

samoconnor commented Mar 29, 2016