Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support unicode whitespace in strip and split? #14099

Closed
stevengj opened this issue Nov 22, 2015 · 11 comments
Closed

support unicode whitespace in strip and split? #14099

stevengj opened this issue Nov 22, 2015 · 11 comments
Labels
unicode Related to unicode characters and encodings

Comments

@stevengj
Copy link
Member

Three of the functions in util.jlsplit, lstrip, and rstrip — use

const _default_delims = [' ','\t','\n','\v','\f','\r']

as their list of "space" characters, which is obviously only the ASCII space characters. As discussed on the mailing list, this leads to unexpected behavior (contrary to the documentation) when splitting a string that uses other Unicode space characters.

Three options:

  • Change the documentation of split and strip to say that they only handle ASCII spaces by default.
  • Change _default_delims to filter(isspace, Char(0):Char(0x10FFFF)) or similar. Howeve, this will slow down these functions somewhat, because there are currently 23 Unicode space characters vs. only 6 ASCII space characters. (We could also use a Set, but I don't think that is much faster for such a small list.)
  • Define a special immutable _Spaces; end collection type, and _default_delims = _Spaces(), where in(c::Char, ::_Spaces) = isspace(c) and we make sure isspace(c) is carefully optimized.

I'm inclined to favor the third option at the moment.

@stevengj stevengj added the unicode Related to unicode characters and encodings label Nov 22, 2015
@stevengj
Copy link
Member Author

I just checked, and isspace(c) is already 72% faster than c in _ascii_default_delims, while c in _ascii_default_delims is 70% faster than the Set version of c in _unicode_default_delims and 140% faster than the Array version of c in _ascii_default_delims.

@nalimilan
Copy link
Member

I was thinking along the same lines. I have also found two other possibilities:

  • Use the regex r"\s", which is very easy with the current code and should work fine. Likely not great for performance.
  • Add split(s, p) taking a predicate, which would be exported for general use, and make split(s) = split(s, isspace). It should be easy to refactor the existing code so that split(s, c::Char) would call split(s, p).

I'd favour either your last solution, or my last solution (which are quite similar). Mine makes sense only if exporting split(s, p) would be useful.

@tkelman
Copy link
Contributor

tkelman commented Nov 22, 2015

Wouldn't a predicate version make more sense with the predicate first? That would allow do notation.

@nalimilan
Copy link
Member

@tkelman Yeah, probably.

@stevengj
Copy link
Member Author

@nalimilan, I agree that we should have split(predicate::Function, s::AbstractString). However, we should also optimize the common case of splitting on a collection of characters, and especially on spaces, and since Function arguments aren't currently inlined etc. I think we should still implement specialized types for those cases. Maybe specialized callable types, similar to how reduce is optimized for + etc.

@nalimilan
Copy link
Member

Making Spaces callable and passing it to split(p, s) sounds like a good solution indeed.

@stevengj
Copy link
Member Author

(The regex test is 50x slower than isspace.)

@iagobaapellaniz
Copy link
Contributor

iagobaapellaniz commented Feb 16, 2017

I found that

julia> "\U2297" == "\otimes[TAB]"
true

while

julia> r"\U2297" == r"\otimes[TAB]"
false

julia> Regex("\U2297") == r"\otimes[TAB]"
true

Is this a bug? Should it consider that \U is special?
Should I open a new issue? Feel free to do that for me if that's the case...

@stevengj
Copy link
Member Author

@i-apellaniz, no, it's not a bug (you can't use \U escapes in a regex AFAIK) and please don't post questions on a completely unrelated issue.

@iagobaapellaniz
Copy link
Contributor

Sorry, I'll open a new issue, since I think this must be further investigated. Sorry again.

@laborg
Copy link
Contributor

laborg commented Feb 8, 2022

This has been fixed in #27309

@laborg laborg closed this as completed Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

5 participants