Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode whitespace not recognized by lstrip() #27211

Closed
sbromberger opened this issue May 22, 2018 · 10 comments
Closed

unicode whitespace not recognized by lstrip() #27211

sbromberger opened this issue May 22, 2018 · 10 comments
Labels
strings "Strings!" unicode Related to unicode characters and encodings

Comments

@sbromberger
Copy link
Contributor

Consider:

julia> a = " \U02009 foo"
"   foo"

julia> lstrip(a)
"  foo"

Note that \U02009 is a Unicode Character 'THIN SPACE' character.

I don't know much about string processing or unicode, but it seems to me that lstrip() is doing something that's perhaps too naive:

const _default_delims = [' ','\t','\n','\v','\f','\r']
...
function lstrip(s::AbstractString, chars::Chars=_default_delims)

I realize you can override chars but I would suggest that we expand the default character set for string trimming to include unicode separators (ref https://www.fileformat.info/info/unicode/category/Zs/list.htm).

@simonbyrne simonbyrne added the strings "Strings!" label May 23, 2018
@simonbyrne
Copy link
Contributor

What do other languages do here?

Note that the other chars ('\t','\n','\v','\f','\r') are control characters.

@ararslan
Copy link
Member

Ruby and Crystal leave it, Python and Rust strip it

@ararslan ararslan added the unicode Related to unicode characters and encodings label May 23, 2018
@StefanKarpinski
Copy link
Member

Stripping unicode whitespace by default seems reasonable to me.

@ararslan
Copy link
Member

Related idea: Define lstrip and friends to take a function argument, where the function is used to determine what should be stripped, i.e.

lstrip(f, s::AbstractString) = # ...
lstrip(s::AbstractString) = lstrip(c->isspace(c) || c in _default_delims, s)

It seems like this would allow us to more concisely express what should be skipped, rather than adding all Unicode space characters to _default_delims. Also, as an example in this hypothetical universe, lstrip("~~~1", '~') would instead be written lstrip(==('~'), "~~~1").

@sbromberger
Copy link
Contributor Author

This is essentially equivalent to overriding chars, though, isn’t it? If that’s the case then we might as well just leave it as-is (but I’m still in support of adding the Unicode chars to _default_delim).

@ararslan
Copy link
Member

Yeah, but you don't need to enumerate all Unicode whitespace chars in _default_delims.

@simonbyrne
Copy link
Contributor

The existing _default_delims would already be covered by isspace, so it could just be:

lstrip(s::AbstractString) = lstrip(isspace, s::AbstractString)

@simonbyrne
Copy link
Contributor

simonbyrne commented May 31, 2018

In summary, our choices are:

  1. do nothing

  2. add all unicode spaces to _default_delims

    • complicated to keep up-to-date, and will be inefficient (since it won't be able to use range checks or short-circuiting)
  3. have a non-exported way to allow predicates for just isspace

    • seems silly, but if we can't come to an agreement, is at least better than 1.
  4. allow predicates as a second argument (allow predicate functions to lstrip etc. and default to stripping unicode whitespace #27309)

    • non-breaking
    • straightforward to document
    • violates general style rule of having function arg as first argument
    • we do violate it in other places (e.g. split/rsplit).
  5. move chars to first argument

    • breaking
    • ordering to me seems somewhat odd
  6. allow only predicate function, move to first (Use predicate function in lstrip etc. #27232)

    • general consensus was against
  7. allow predicate as a first argument, keep other chars as second argument (take commits from Use predicate function in lstrip etc. #27232 up to d7ae074)

    • non-breaking
    • a little weird that the argument order changes.

Given that it is a fairly minor function, my vague preference is option 4.

@StefanKarpinski
Copy link
Member

  1. 4 + 7 — allow predicates as first or second argument; allow non-predicate chars as second only.

@ararslan
Copy link
Member

A big 👎 to 4, 👍 to 7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

4 participants