-
Notifications
You must be signed in to change notification settings - Fork 13.3k
lexer: Treat more floats with empty exponent as valid tokens #131656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
lexer: Treat more floats with empty exponent as valid tokens #131656
Conversation
rustbot has assigned @petrochenkov. Use |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
8b41315
to
e8c244e
Compare
This comment has been minimized.
This comment has been minimized.
a642bff
to
82a8103
Compare
This comment has been minimized.
This comment has been minimized.
I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose. |
This comment has been minimized.
This comment has been minimized.
Apologies for the delays. |
I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward. |
There's also one subtle change in behavior here.
then For correct behavior we'll need to use |
633b1b0
to
d13c474
Compare
This comment has been minimized.
This comment has been minimized.
d13c474
to
2c33df5
Compare
This comment has been minimized.
This comment has been minimized.
c3947a0
to
cfb8823
Compare
This comment has been minimized.
This comment has been minimized.
cfb8823
to
79d5952
Compare
Hey @petrochenkov,
Other than the points above, it should be ready for review again. |
Nominated.
Things like this are not relevant to the language team. |
I updated the 'summary' to point to the comment showing how different input is lexed. |
Are the old and new token regexes written down somewhere? If we're going to do this, that would be something I think we'd want to include in the release notes for people who own syntax highlighters to update for this, for example. |
cc @rust-lang/spec ...in terms of how we document the lexer behavior. |
Looks like they are visible in this diff: https://github.com/mattheww/lexeywan/compare/2025-03_e-suffix |
@richard-uk1 / @mattheww / anyone else: Can you think of any language changes we might want to make where adding this might it more difficult? As an example, say we wanted to remove the requirement that floats always contain a |
I can't. Note that
|
@rfcbot fcp merge |
Team member @traviscross has proposed to merge this. The next step is review by the rest of the tagged team members: No concerns currently listed. Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns. |
@rfcbot reviewed |
@rfcbot reviewed However, I do have a question for @mattheww -- in this table Still rejected
What is the reason we do not accept a suffix like |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period, with a disposition to merge, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. This will be merged soon. |
Does the reference need to be updated for this? If so, this should probably be marked |
I've made a branch in my PEG-based model of the lexer which makes it match the result of this PR. The main change to the grammar is in mattheww/lexeywan@d7a4e91 The limitation discussed in #131656 (comment) is modelled separately in mattheww/lexeywan@8a2075a Rendered writeup for this PR's behaviour: Rendered writeup for Rust 1.86: |
Here are some more cases to think about: Newly accepted
Still rejected
|
Both look ok to me (the second case will need to be supported in the full implementation). |
Summary
This PR changes the lexer to allow more tokens containing a number followed by 'e' with no exponent value (see #131656 (comment)). I'll use the RFC template but keep it short.
This PR is a continuation of #79912. It is also solving exactly the same problem as #111628. Also adds tests for various edge cases and improves diagnostics marginally/subjectively.
Motivation
When Rust parses a number, it is allowed to have an arbitrary suffix. If this suffix is a number literal suffix (
u8
,i16
,f32
etc), the compiler will use the corresponding type when parsing the number. If the suffix is not a literal, then the compiler rejects the number. This rejection happens after any proc macros have been run on the source code, enabling proc macros to interpret number suffixes as they wish. This means, for example, it is possible to parse1px
into some structure like{ number: f64, unit: &str }
:{ number: 1.0, unit: Pixels }
. However, this is not possible for prefixes that begin withe
orE
, because in this case the lexer attempts to parse an exponent before proc macros are run, meaning the code is rejected before the proc macro authors have chance to interpret it.This PR removes the special case for
e
/E
, so numeric suffixes likeem
can be used in proc macros. An application example (given in one of the above issues) is CSS colors. Simplifying somewhat, a color looks like#xxxxxx
wherex
is a hexadecimal character[0-9a-fA-F]
. All colors are valid Rust tokens, for example#123abc
is interpreted as#
followed by the number123
with suffixabc
, while#abc123
is#
followed by the string"123abc"
. There is one exception to this: cases with 1 or more numbers followed by 'e'. With this PR, all colors are valid tokens, and so html colors can be parsed in proc macros without using strings or other syntatic clutter.Guide-level explanation
Before this PR, it was not possible to have numbers with suffixes starting with the 'e'/'E' character. This is because when the 'e' character was seen, the lexer tries to parse an exponent, and in the case of e.g. "1em" fails, resulting in a compiler error before a proc macro receives the input tokens. After this PR, the same check is carried out, but after proc macros have been run on the input.
Reference-level explanation
This PR ensures that invalid exponents are passed to the parser as suffixes for numbers rather than an error. The parser then checks the suffix to see if it is prefixed by a valid exponent, and in this case parses the float. Otherwise, if the suffix is invalid (not
u8
,i16
etc.) the compiler will reject the number with an 'invalid suffix' error.Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like
1e_______________23
) but nevertheless it is handled correctly.Drawbacks
I don't think there are any obvious drawbacks to doing this. Existing diagnostics are maintained, or in one case subjectively slightly improved (
1em
would give "invalid suffix" rather than "empty exponent"). The lexer still has bounded read-ahead (the new patch requires 2 character lookahead). The new code is perhaps slightly more complex, but I have endeavoured to document it clearly. Perf testing showed no regression.Rationale and alternatives
The alternatives are to do nothing or to implement #111628 (comment). Implementing this is a long term goal, but will take more work, and someone motivated to do that work. This PR fixes the specific issue with exponents without waiting for that work to be completed.
Prior art
There are a number of previous attempts to implement this functionality: #111628, #79912.
Unresolved questions
None
Future possibilities
A more wholesale refactor of the lexer is proposed here: #111628 (comment). This PR does not block any future work towards that goal.
r: @petrochenkov, since they reviewed #79912.