unicode whitespace not recognized by lstrip() #27211

sbromberger · 2018-05-22T22:56:15Z

Consider:

julia> a = " \U02009 foo"
"   foo"

julia> lstrip(a)
"  foo"

Note that \U02009 is a Unicode Character 'THIN SPACE' character.

I don't know much about string processing or unicode, but it seems to me that lstrip() is doing something that's perhaps too naive:

const _default_delims = [' ','\t','\n','\v','\f','\r']
...
function lstrip(s::AbstractString, chars::Chars=_default_delims)

I realize you can override chars but I would suggest that we expand the default character set for string trimming to include unicode separators (ref https://www.fileformat.info/info/unicode/category/Zs/list.htm).

The text was updated successfully, but these errors were encountered:

simonbyrne · 2018-05-23T00:16:18Z

What do other languages do here?

Note that the other chars ('\t','\n','\v','\f','\r') are control characters.

ararslan · 2018-05-23T05:16:57Z

Ruby and Crystal leave it, Python and Rust strip it

StefanKarpinski · 2018-05-23T11:03:06Z

Stripping unicode whitespace by default seems reasonable to me.

ararslan · 2018-05-23T13:14:12Z

Related idea: Define lstrip and friends to take a function argument, where the function is used to determine what should be stripped, i.e.

lstrip(f, s::AbstractString) = # ...
lstrip(s::AbstractString) = lstrip(c->isspace(c) || c in _default_delims, s)

It seems like this would allow us to more concisely express what should be skipped, rather than adding all Unicode space characters to _default_delims. Also, as an example in this hypothetical universe, lstrip("~~~1", '~') would instead be written lstrip(==('~'), "~~~1").

sbromberger · 2018-05-23T13:36:52Z

This is essentially equivalent to overriding chars, though, isn’t it? If that’s the case then we might as well just leave it as-is (but I’m still in support of adding the Unicode chars to _default_delim).

ararslan · 2018-05-23T13:44:33Z

Yeah, but you don't need to enumerate all Unicode whitespace chars in _default_delims.

simonbyrne · 2018-05-23T19:28:11Z

The existing _default_delims would already be covered by isspace, so it could just be:

lstrip(s::AbstractString) = lstrip(isspace, s::AbstractString)

Fixes #27211.

simonbyrne · 2018-05-31T18:54:51Z

In summary, our choices are:

do nothing
add all unicode spaces to _default_delims
- complicated to keep up-to-date, and will be inefficient (since it won't be able to use range checks or short-circuiting)
have a non-exported way to allow predicates for just isspace
- seems silly, but if we can't come to an agreement, is at least better than 1.
allow predicates as a second argument (allow predicate functions to lstrip etc. and default to stripping unicode whitespace #27309)
- non-breaking
- straightforward to document
- violates general style rule of having function arg as first argument
- we do violate it in other places (e.g. split/rsplit).
move chars to first argument
- breaking
- ordering to me seems somewhat odd
allow only predicate function, move to first (Use predicate function in lstrip etc. #27232)
- general consensus was against
allow predicate as a first argument, keep other chars as second argument (take commits from Use predicate function in lstrip etc. #27232 up to d7ae074)
- non-breaking
- a little weird that the argument order changes.

Given that it is a fairly minor function, my vague preference is option 4.

StefanKarpinski · 2018-05-31T21:17:21Z

4 + 7 — allow predicates as first or second argument; allow non-predicate chars as second only.

ararslan · 2018-05-31T22:04:23Z

A big 👎 to 4, 👍 to 7.

simonbyrne added the strings "Strings!" label May 23, 2018

ararslan added the unicode Related to unicode characters and encodings label May 23, 2018

simonbyrne added a commit that referenced this issue May 23, 2018

Use predicate function in lstrip etc.

d6e9afb

Fixes #27211.

simonbyrne mentioned this issue May 23, 2018

Use predicate function in lstrip etc. #27232

Closed

simonbyrne added a commit that referenced this issue May 25, 2018

Use predicate function in lstrip etc.

3c942d4

Fixes #27211.

simonbyrne mentioned this issue May 29, 2018

allow predicate functions to lstrip etc. and default to stripping unicode whitespace #27309

Merged

simonbyrne closed this as completed in #27309 Jun 15, 2018

simonbyrne mentioned this issue Jun 15, 2018

Normalize the argument order in lstrip and rstrip #27605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode whitespace not recognized by lstrip() #27211

unicode whitespace not recognized by lstrip() #27211

sbromberger commented May 22, 2018

simonbyrne commented May 23, 2018

ararslan commented May 23, 2018

StefanKarpinski commented May 23, 2018

ararslan commented May 23, 2018

sbromberger commented May 23, 2018

ararslan commented May 23, 2018

simonbyrne commented May 23, 2018

simonbyrne commented May 31, 2018 •

edited

Loading

StefanKarpinski commented May 31, 2018

ararslan commented May 31, 2018

unicode whitespace not recognized by lstrip() #27211

unicode whitespace not recognized by lstrip() #27211

Comments

sbromberger commented May 22, 2018

simonbyrne commented May 23, 2018

ararslan commented May 23, 2018

StefanKarpinski commented May 23, 2018

ararslan commented May 23, 2018

sbromberger commented May 23, 2018

ararslan commented May 23, 2018

simonbyrne commented May 23, 2018

simonbyrne commented May 31, 2018 • edited Loading

StefanKarpinski commented May 31, 2018

ararslan commented May 31, 2018

simonbyrne commented May 31, 2018 •

edited

Loading