-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support unicode whitespace in strip and split? #14099
Comments
I just checked, and |
I was thinking along the same lines. I have also found two other possibilities:
I'd favour either your last solution, or my last solution (which are quite similar). Mine makes sense only if exporting |
Wouldn't a predicate version make more sense with the predicate first? That would allow do notation. |
@tkelman Yeah, probably. |
@nalimilan, I agree that we should have |
Making |
(The regex test is 50x slower than |
I found that julia> "\U2297" == "\otimes[TAB]"
true while julia> r"\U2297" == r"\otimes[TAB]"
false
julia> Regex("\U2297") == r"\otimes[TAB]"
true Is this a bug? Should it consider that |
@i-apellaniz, no, it's not a bug (you can't use |
Sorry, I'll open a new issue, since I think this must be further investigated. Sorry again. |
This has been fixed in #27309 |
Three of the functions in
util.jl
—split
,lstrip
, andrstrip
— useas their list of "space" characters, which is obviously only the ASCII space characters. As discussed on the mailing list, this leads to unexpected behavior (contrary to the documentation) when splitting a string that uses other Unicode space characters.
Three options:
split
andstrip
to say that they only handle ASCII spaces by default._default_delims
tofilter(isspace, Char(0):Char(0x10FFFF))
or similar. Howeve, this will slow down these functions somewhat, because there are currently 23 Unicode space characters vs. only 6 ASCII space characters. (We could also use aSet
, but I don't think that is much faster for such a small list.)immutable _Spaces; end
collection type, and_default_delims = _Spaces()
, wherein(c::Char, ::_Spaces) = isspace(c)
and we make sureisspace(c)
is carefully optimized.I'm inclined to favor the third option at the moment.
The text was updated successfully, but these errors were encountered: