support unicode whitespace in strip and split? #14099

stevengj · 2015-11-22T13:57:56Z

Three of the functions in util.jl — split, lstrip, and rstrip — use

const _default_delims = [' ','\t','\n','\v','\f','\r']

as their list of "space" characters, which is obviously only the ASCII space characters. As discussed on the mailing list, this leads to unexpected behavior (contrary to the documentation) when splitting a string that uses other Unicode space characters.

Three options:

Change the documentation of split and strip to say that they only handle ASCII spaces by default.
Change _default_delims to filter(isspace, Char(0):Char(0x10FFFF)) or similar. Howeve, this will slow down these functions somewhat, because there are currently 23 Unicode space characters vs. only 6 ASCII space characters. (We could also use a Set, but I don't think that is much faster for such a small list.)
Define a special immutable _Spaces; end collection type, and _default_delims = _Spaces(), where in(c::Char, ::_Spaces) = isspace(c) and we make sure isspace(c) is carefully optimized.

I'm inclined to favor the third option at the moment.

The text was updated successfully, but these errors were encountered:

stevengj · 2015-11-22T14:11:16Z

I just checked, and isspace(c) is already 72% faster than c in _ascii_default_delims, while c in _ascii_default_delims is 70% faster than the Set version of c in _unicode_default_delims and 140% faster than the Array version of c in _ascii_default_delims.

nalimilan · 2015-11-22T14:18:30Z

I was thinking along the same lines. I have also found two other possibilities:

Use the regex r"\s", which is very easy with the current code and should work fine. Likely not great for performance.
Add split(s, p) taking a predicate, which would be exported for general use, and make split(s) = split(s, isspace). It should be easy to refactor the existing code so that split(s, c::Char) would call split(s, p).

I'd favour either your last solution, or my last solution (which are quite similar). Mine makes sense only if exporting split(s, p) would be useful.

tkelman · 2015-11-22T14:30:41Z

Wouldn't a predicate version make more sense with the predicate first? That would allow do notation.

nalimilan · 2015-11-22T14:37:52Z

@tkelman Yeah, probably.

stevengj · 2015-11-22T14:58:49Z

@nalimilan, I agree that we should have split(predicate::Function, s::AbstractString). However, we should also optimize the common case of splitting on a collection of characters, and especially on spaces, and since Function arguments aren't currently inlined etc. I think we should still implement specialized types for those cases. Maybe specialized callable types, similar to how reduce is optimized for + etc.

nalimilan · 2015-11-22T15:01:22Z

Making Spaces callable and passing it to split(p, s) sounds like a good solution indeed.

stevengj · 2015-11-22T15:01:55Z

(The regex test is 50x slower than isspace.)

iagobaapellaniz · 2017-02-16T14:11:47Z

I found that

julia> "\U2297" == "\otimes[TAB]"
true

while

julia> r"\U2297" == r"\otimes[TAB]"
false

julia> Regex("\U2297") == r"\otimes[TAB]"
true

Is this a bug? Should it consider that \U is special?
Should I open a new issue? Feel free to do that for me if that's the case...

stevengj · 2017-02-16T14:56:16Z

@i-apellaniz, no, it's not a bug (you can't use \U escapes in a regex AFAIK) and please don't post questions on a completely unrelated issue.

iagobaapellaniz · 2017-02-16T15:56:45Z

Sorry, I'll open a new issue, since I think this must be further investigated. Sorry again.

laborg · 2022-02-08T14:16:20Z

This has been fixed in #27309

stevengj added the unicode Related to unicode characters and encodings label Nov 22, 2015

laborg closed this as completed Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support unicode whitespace in strip and split? #14099

support unicode whitespace in strip and split? #14099

stevengj commented Nov 22, 2015

stevengj commented Nov 22, 2015

nalimilan commented Nov 22, 2015

tkelman commented Nov 22, 2015

nalimilan commented Nov 22, 2015

stevengj commented Nov 22, 2015

nalimilan commented Nov 22, 2015

stevengj commented Nov 22, 2015

iagobaapellaniz commented Feb 16, 2017 •

edited

Loading

stevengj commented Feb 16, 2017

iagobaapellaniz commented Feb 16, 2017

laborg commented Feb 8, 2022

support unicode whitespace in strip and split? #14099

support unicode whitespace in strip and split? #14099

Comments

stevengj commented Nov 22, 2015

stevengj commented Nov 22, 2015

nalimilan commented Nov 22, 2015

tkelman commented Nov 22, 2015

nalimilan commented Nov 22, 2015

stevengj commented Nov 22, 2015

nalimilan commented Nov 22, 2015

stevengj commented Nov 22, 2015

iagobaapellaniz commented Feb 16, 2017 • edited Loading

stevengj commented Feb 16, 2017

iagobaapellaniz commented Feb 16, 2017

laborg commented Feb 8, 2022

iagobaapellaniz commented Feb 16, 2017 •

edited

Loading