Skip to content

lexer: Treat more floats with empty exponent as valid tokens #131656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

richard-uk1
Copy link
Contributor

@richard-uk1 richard-uk1 commented Oct 13, 2024

Summary

This PR changes the lexer to allow more tokens containing a number followed by 'e' with no exponent value (see #131656 (comment)). I'll use the RFC template but keep it short.

This PR is a continuation of #79912. It is also solving exactly the same problem as #111628. Also adds tests for various edge cases and improves diagnostics marginally/subjectively.

Motivation

When Rust parses a number, it is allowed to have an arbitrary suffix. If this suffix is a number literal suffix (u8, i16, f32 etc), the compiler will use the corresponding type when parsing the number. If the suffix is not a literal, then the compiler rejects the number. This rejection happens after any proc macros have been run on the source code, enabling proc macros to interpret number suffixes as they wish. This means, for example, it is possible to parse 1px into some structure like { number: f64, unit: &str }: { number: 1.0, unit: Pixels }. However, this is not possible for prefixes that begin with e or E, because in this case the lexer attempts to parse an exponent before proc macros are run, meaning the code is rejected before the proc macro authors have chance to interpret it.

This PR removes the special case for e/E, so numeric suffixes like em can be used in proc macros. An application example (given in one of the above issues) is CSS colors. Simplifying somewhat, a color looks like #xxxxxx where x is a hexadecimal character [0-9a-fA-F]. All colors are valid Rust tokens, for example #123abc is interpreted as # followed by the number 123 with suffix abc, while #abc123 is # followed by the string "123abc". There is one exception to this: cases with 1 or more numbers followed by 'e'. With this PR, all colors are valid tokens, and so html colors can be parsed in proc macros without using strings or other syntatic clutter.

Guide-level explanation

Before this PR, it was not possible to have numbers with suffixes starting with the 'e'/'E' character. This is because when the 'e' character was seen, the lexer tries to parse an exponent, and in the case of e.g. "1em" fails, resulting in a compiler error before a proc macro receives the input tokens. After this PR, the same check is carried out, but after proc macros have been run on the input.

Reference-level explanation

This PR ensures that invalid exponents are passed to the parser as suffixes for numbers rather than an error. The parser then checks the suffix to see if it is prefixed by a valid exponent, and in this case parses the float. Otherwise, if the suffix is invalid (not u8, i16 etc.) the compiler will reject the number with an 'invalid suffix' error.

Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like 1e_______________23) but nevertheless it is handled correctly.

Drawbacks

I don't think there are any obvious drawbacks to doing this. Existing diagnostics are maintained, or in one case subjectively slightly improved (1em would give "invalid suffix" rather than "empty exponent"). The lexer still has bounded read-ahead (the new patch requires 2 character lookahead). The new code is perhaps slightly more complex, but I have endeavoured to document it clearly. Perf testing showed no regression.

Rationale and alternatives

The alternatives are to do nothing or to implement #111628 (comment). Implementing this is a long term goal, but will take more work, and someone motivated to do that work. This PR fixes the specific issue with exponents without waiting for that work to be completed.

Prior art

There are a number of previous attempts to implement this functionality: #111628, #79912.

Unresolved questions

None

Future possibilities

A more wholesale refactor of the lexer is proposed here: #111628 (comment). This PR does not block any future work towards that goal.

r: @petrochenkov, since they reviewed #79912.

@rustbot
Copy link
Collaborator

rustbot commented Oct 13, 2024

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 13, 2024
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from 8b41315 to e8c244e Compare October 13, 2024 23:06
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from a642bff to 82a8103 Compare November 10, 2024 12:06
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 marked this pull request as ready for review November 10, 2024 12:56
@richard-uk1
Copy link
Contributor Author

I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose.

@rust-log-analyzer

This comment has been minimized.

@petrochenkov
Copy link
Contributor

Apologies for the delays.
This is a language change rather than a fix, and it's not very urgent, so I'll prioritize some other things in my review queue over it.
In any case I should get to in in February, when I return to work.

@petrochenkov
Copy link
Contributor

I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward.

@petrochenkov
Copy link
Contributor

There's also one subtle change in behavior here.
If we have a suffix that starts with _s and then continues with some unicode character that is

  • is_id_continue
  • not is_id_start, and
  • not an ascii digit,

then eat_literal_suffix fill fail to eat the tail of the suffix correctly because it expects a is_id_start character at the start.

For correct behavior we'll need to use self.eat_while(is_id_continue); instead of eat_literal_suffix if the suffix start position is overridden.
This should be testable with U+005F _ LOW LINE.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 21, 2025
@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from 633b1b0 to d13c474 Compare February 22, 2025 16:12
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from d13c474 to 2c33df5 Compare February 22, 2025 16:40
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from c3947a0 to cfb8823 Compare February 22, 2025 17:05
@rust-log-analyzer

This comment has been minimized.

@richard-uk1 richard-uk1 force-pushed the move_empty_exponent_to_rustc_session branch from cfb8823 to 79d5952 Compare February 23, 2025 13:30
@richard-uk1
Copy link
Contributor Author

richard-uk1 commented Feb 23, 2025

Hey @petrochenkov,

  • did you mean U+005F (underscore) as the test character? It wouldn't fit your criteria since it is a valid identifier start character. I've been using U+00B7 (middle dot) as it meets your criteria.
  • I checked and the nuber parser already fails to handle the case where there is a id_continue but not id_start character that isn't a digit in the suffix, e.g. (1_\u{00B7}). If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR? Please LMK if I've misunderstood.

Other than the points above, it should be ready for review again.

@petrochenkov petrochenkov added the I-lang-nominated Nominated for discussion during a lang team meeting. label Mar 7, 2025
@petrochenkov
Copy link
Contributor

Nominated.

This PR moves part of the exponent checks from rustc_lexer/rustc_parser into rustc_session

Things like this are not relevant to the language team.
It's better to give some grammar or at least link to #131656 (comment).

@richard-uk1
Copy link
Contributor Author

richard-uk1 commented Mar 8, 2025

It's better to give some grammar or at least link to #131656 (comment).

I updated the 'summary' to point to the comment showing how different input is lexed.

@scottmcm
Copy link
Member

scottmcm commented Apr 9, 2025

Are the old and new token regexes written down somewhere? If we're going to do this, that would be something I think we'd want to include in the release notes for people who own syntax highlighters to update for this, for example.

@traviscross traviscross removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 9, 2025
@traviscross
Copy link
Contributor

cc @rust-lang/spec

...in terms of how we document the lexer behavior.

@tmandry
Copy link
Member

tmandry commented Apr 9, 2025

Are the old and new token regexes written down somewhere?

Looks like they are visible in this diff: https://github.com/mattheww/lexeywan/compare/2025-03_e-suffix

@tmandry
Copy link
Member

tmandry commented Apr 9, 2025

@richard-uk1 / @mattheww / anyone else: Can you think of any language changes we might want to make where adding this might it more difficult?

As an example, say we wanted to remove the requirement that floats always contain a ..

@mattheww
Copy link
Contributor

mattheww commented Apr 9, 2025

I can't.

Note that 123e4 and 123f64 are already accepted as floating-point literal expressions, despite not containing a ..

123f64 is documented as an integer literal token in the Reference (and I think it's implemented that way in rustc), because it's easier that way. But I don't think the distinction is observable to Rust programs.

@traviscross
Copy link
Contributor

@rfcbot fcp merge

@rfcbot
Copy link
Collaborator

rfcbot commented Apr 16, 2025

Team member @traviscross has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns.
See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Apr 16, 2025
@tmandry
Copy link
Member

tmandry commented Apr 16, 2025

@rfcbot reviewed

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Apr 16, 2025

@rfcbot reviewed

However, I do have a question for @mattheww -- in this table

Still rejected

source possible future tokenisation
1e+f32 1 with suffix e, +, f32
1e_· 1 with suffix e_·
1e2· 1 with suffix e2·

What is the reason we do not accept a suffix like e_? (Edit: ok, I see, it has a · which I was missing)

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 16, 2025
@rfcbot
Copy link
Collaborator

rfcbot commented Apr 16, 2025

🔔 This is now entering its final comment period, as per the review above. 🔔

@traviscross traviscross added I-lang-radar Items that are on lang's radar and will need eventual work or consideration. and removed I-lang-nominated Nominated for discussion during a lang team meeting. labels Apr 23, 2025
@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. to-announce Announce this issue on triage meeting and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Apr 26, 2025
@rfcbot
Copy link
Collaborator

rfcbot commented Apr 26, 2025

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

@ehuss
Copy link
Contributor

ehuss commented Apr 26, 2025

Does the reference need to be updated for this? If so, this should probably be marked S-waiting-on-documentation.

@mattheww
Copy link
Contributor

I've made a branch in my PEG-based model of the lexer which makes it match the result of this PR.

The main change to the grammar is in mattheww/lexeywan@d7a4e91

The limitation discussed in #131656 (comment) is modelled separately in mattheww/lexeywan@8a2075a

Rendered writeup for this PR's behaviour:
https://mjw.woodcraft.me.uk/2025-lexeywan-peg-e-suffix/

Rendered writeup for Rust 1.86:
https://mjw.woodcraft.me.uk/2025-lexeywan/

@mattheww
Copy link
Contributor

Here are some more cases to think about:

Newly accepted

source suffix
0b1e e
0o1e e

Still rejected

source possible future tokenisation
0b1e0 0b1 with suffix e0
0o1e2 0o1 with suffix e2

@petrochenkov
Copy link
Contributor

Here are some more cases to think about:

Both look ok to me (the second case will need to be supported in the full implementation).

@traviscross traviscross added the S-waiting-on-documentation Status: Waiting on approved PRs to documentation before merging label Apr 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. I-lang-radar Items that are on lang's radar and will need eventual work or consideration. S-waiting-on-documentation Status: Waiting on approved PRs to documentation before merging S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). T-lang Relevant to the language team, which will review and decide on the PR/issue. to-announce Announce this issue on triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.