Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix consistency with is_ascii_whitespace #82152

Closed
wants to merge 1 commit into from

Conversation

klensy
Copy link
Contributor

@klensy klensy commented Feb 15, 2021

Previously this method checked only 4 of 5 possible ascii whitespace byte variants (as defined in https://doc.rust-lang.org/std/primitive.char.html#method.is_ascii_whitespace).

Checked bytes:
before:

'\t' | '\n' | '\r' | ' '

after:

'\t' | '\n' | '\x0C' | '\r' | ' '

@rust-highfive
Copy link
Contributor

r? @estebank

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 15, 2021
@klensy
Copy link
Contributor Author

klensy commented Feb 15, 2021

Btw, there also in

pub fn is_whitespace(c: char) -> bool {
// This is Pattern_White_Space.
//
// Note that this set is stable (ie, it doesn't change with different
// Unicode versions), so it's ok to just hard-code the values.
matches!(
c,
// Usual ASCII suspects
'\u{0009}' // \t
| '\u{000A}' // \n
| '\u{000B}' // vertical tab
| '\u{000C}' // form feed
| '\u{000D}' // \r
| '\u{0020}' // space
// NEXT LINE from latin1
| '\u{0085}'
// Bidi markers
| '\u{200E}' // LEFT-TO-RIGHT MARK
| '\u{200F}' // RIGHT-TO-LEFT MARK
// Dedicated whitespace characters from Unicode
| '\u{2028}' // LINE SEPARATOR
| '\u{2029}' // PARAGRAPH SEPARATOR
)
}

slightly different definition of ascii whitespaces (6 of it, added 0x0b; omitting non-ascii part of it).

@petrochenkov petrochenkov self-assigned this Feb 15, 2021
@klensy
Copy link
Contributor Author

klensy commented Feb 15, 2021

Is rustc_lexer::is_whitespace and char::is_whitespace gives the same results? Need to check.

@petrochenkov
Copy link
Contributor

cc @matklad on this

@petrochenkov
Copy link
Contributor

rustc_lexer::is_whitespace is the standard unicode whitespace category for breaking programming code into tokens (Pattern_White_Space, UAX #31).

skip_ascii_whitespace is used specifically for skipping whitespace in multi-line string literals like

    "foo\
      bar"

I don't think we have a good motivation to extend this set with form feed (\x0C) specifically.
Maybe it can be aligned with some standard too, but I don't see anything similar to this case in UAX #31.

@petrochenkov
Copy link
Contributor

I suggest to wait for matklad's opinion and then probably close this PR.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 15, 2021
@matklad
Copy link
Member

matklad commented Feb 15, 2021

I don’t think consistency with str methods is a good motivation here: character classes used in the lexer are deliberately different from those used by str methods. Lexer guarantees stability across Unicode versions, while str does not.

What is inconsistent here is that we use different definition of white space in string tokens and in between tokens whitespace. One is a hard-coded list of “reasonable” whitespace characters, and another is standard’s defined Pattern_Whitespace.

IIRC, the reason for that inconsistency is that, originally, both used simple whitespace, but then, during the implementation of Unicode identifiers, one was changed while the other was not, and Unicode whitespace wasn’t gated properly and leaked to stable.

in practice, this probably doesn’t really matter, so I’d try to avoid changes here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants