lexer: Treat more floats with empty exponent as valid tokens #131656

richard-uk1 · 2024-10-13T14:02:43Z

Summary

This PR changes the lexer to allow more tokens containing a number followed by 'e' with no exponent value (see #131656 (comment)). I'll use the RFC template but keep it short.

This PR is a continuation of #79912. It is also solving exactly the same problem as #111628. Also adds tests for various edge cases and improves diagnostics marginally/subjectively.

Motivation

When Rust parses a number, it is allowed to have an arbitrary suffix. If this suffix is a number literal suffix (u8, i16, f32 etc), the compiler will use the corresponding type when parsing the number. If the suffix is not a literal, then the compiler rejects the number. This rejection happens after any proc macros have been run on the source code, enabling proc macros to interpret number suffixes as they wish. This means, for example, it is possible to parse 1px into some structure like { number: f64, unit: &str }: { number: 1.0, unit: Pixels }. However, this is not possible for prefixes that begin with e or E, because in this case the lexer attempts to parse an exponent before proc macros are run, meaning the code is rejected before the proc macro authors have chance to interpret it.

This PR removes the special case for e/E, so numeric suffixes like em can be used in proc macros. An application example (given in one of the above issues) is CSS colors. Simplifying somewhat, a color looks like #xxxxxx where x is a hexadecimal character [0-9a-fA-F]. All colors are valid Rust tokens, for example #123abc is interpreted as # followed by the number 123 with suffix abc, while #abc123 is # followed by the string "123abc". There is one exception to this: cases with 1 or more numbers followed by 'e'. With this PR, all colors are valid tokens, and so html colors can be parsed in proc macros without using strings or other syntatic clutter.

Guide-level explanation

Before this PR, it was not possible to have numbers with suffixes starting with the 'e'/'E' character. This is because when the 'e' character was seen, the lexer tries to parse an exponent, and in the case of e.g. "1em" fails, resulting in a compiler error before a proc macro receives the input tokens. After this PR, the same check is carried out, but after proc macros have been run on the input.

Reference-level explanation

This PR ensures that invalid exponents are passed to the parser as suffixes for numbers rather than an error. The parser then checks the suffix to see if it is prefixed by a valid exponent, and in this case parses the float. Otherwise, if the suffix is invalid (not u8, i16 etc.) the compiler will reject the number with an 'invalid suffix' error.

Exponents that contain arbitrarily long underscore suffixes are handled without read-ahead by tracking the exponent start in case of invalid exponent, so the suffix start is correct. This is very much an edge-case (the user would have to write something like 1e_______________23) but nevertheless it is handled correctly.

Drawbacks

I don't think there are any obvious drawbacks to doing this. Existing diagnostics are maintained, or in one case subjectively slightly improved (1em would give "invalid suffix" rather than "empty exponent"). The lexer still has bounded read-ahead (the new patch requires 2 character lookahead). The new code is perhaps slightly more complex, but I have endeavoured to document it clearly. Perf testing showed no regression.

Rationale and alternatives

The alternatives are to do nothing or to implement #111628 (comment). Implementing this is a long term goal, but will take more work, and someone motivated to do that work. This PR fixes the specific issue with exponents without waiting for that work to be completed.

Prior art

There are a number of previous attempts to implement this functionality: #111628, #79912.

Unresolved questions

None

Future possibilities

A more wholesale refactor of the lexer is proposed here: #111628 (comment). This PR does not block any future work towards that goal.

r: @petrochenkov, since they reviewed #79912.

rustbot · 2024-10-13T14:02:51Z

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

richard-uk1 · 2024-11-10T12:57:10Z

I've added some ui tests and I think this is now ready for review. It doesn't stop @petrochenkov changing the whole thing later if they so choose.

petrochenkov · 2024-12-10T18:51:22Z

Apologies for the delays.
This is a language change rather than a fix, and it's not very urgent, so I'll prioritize some other things in my review queue over it.
In any case I should get to in in February, when I return to work.

tests/ui/consts/custom-suffixes-exponent-like.rs

compiler/rustc_session/src/errors.rs

compiler/rustc_lexer/src/lib.rs

petrochenkov · 2025-02-21T18:13:54Z

I'd still prefer to see #111628 (comment) implemented, but I think this looks like a compatible subset, so we can go forward.

compiler/rustc_lexer/src/lib.rs

petrochenkov · 2025-02-21T18:42:57Z

There's also one subtle change in behavior here.
If we have a suffix that starts with _s and then continues with some unicode character that is

is_id_continue
not is_id_start, and
not an ascii digit,

then eat_literal_suffix fill fail to eat the tail of the suffix correctly because it expects a is_id_start character at the start.

For correct behavior we'll need to use self.eat_while(is_id_continue); instead of eat_literal_suffix if the suffix start position is overridden.
This should be testable with U+005F _ LOW LINE.

richard-uk1 · 2025-02-23T13:41:00Z

Hey @petrochenkov,

did you mean U+005F (underscore) as the test character? It wouldn't fit your criteria since it is a valid identifier start character. I've been using U+00B7 (middle dot) as it meets your criteria.
I checked and the nuber parser already fails to handle the case where there is a id_continue but not id_start character that isn't a digit in the suffix, e.g. (1_\u{00B7}). If I understand correctly, this means it's not a regression, so perhaps it should be fixed in a separate PR? Please LMK if I've misunderstood.

Other than the points above, it should be ready for review again.

petrochenkov · 2025-03-07T22:40:56Z

Nominated.

This PR moves part of the exponent checks from rustc_lexer/rustc_parser into rustc_session

Things like this are not relevant to the language team.
It's better to give some grammar or at least link to #131656 (comment).

richard-uk1 · 2025-03-08T00:26:28Z

It's better to give some grammar or at least link to #131656 (comment).

I updated the 'summary' to point to the comment showing how different input is lexed.

scottmcm · 2025-04-09T17:23:56Z

Are the old and new token regexes written down somewhere? If we're going to do this, that would be something I think we'd want to include in the release notes for people who own syntax highlighters to update for this, for example.

traviscross · 2025-04-09T17:28:09Z

cc @rust-lang/spec

...in terms of how we document the lexer behavior.

tmandry · 2025-04-09T17:38:38Z

Are the old and new token regexes written down somewhere?

Looks like they are visible in this diff: https://github.com/mattheww/lexeywan/compare/2025-03_e-suffix

tmandry · 2025-04-09T17:39:40Z

@richard-uk1 / @mattheww / anyone else: Can you think of any language changes we might want to make where adding this might it more difficult?

As an example, say we wanted to remove the requirement that floats always contain a ..

mattheww · 2025-04-09T18:21:12Z

I can't.

Note that 123e4 and 123f64 are already accepted as floating-point literal expressions, despite not containing a ..

123f64 is documented as an integer literal token in the Reference (and I think it's implemented that way in rustc), because it's easier that way. But I don't think the distinction is observable to Rust programs.

traviscross · 2025-04-16T17:56:03Z

@rfcbot fcp merge

rfcbot · 2025-04-16T17:56:06Z

Team member @traviscross has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns.
See this document for info about what commands tagged team members can give me.

tmandry · 2025-04-16T17:56:58Z

@rfcbot reviewed

nikomatsakis · 2025-04-16T17:57:29Z

@rfcbot reviewed

However, I do have a question for @mattheww -- in this table

Still rejected

source	possible future tokenisation
`1e+f32`	`1` with suffix `e`, `+`, `f32`
`1e_·`	`1` with suffix `e_·`
`1e2·`	`1` with suffix `e2·`

What is the reason we do not accept a suffix like e_? (Edit: ok, I see, it has a · which I was missing)

rfcbot · 2025-04-16T17:57:37Z

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot · 2025-04-26T18:01:12Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

ehuss · 2025-04-26T18:32:39Z

Does the reference need to be updated for this? If so, this should probably be marked S-waiting-on-documentation.

mattheww · 2025-04-27T11:23:58Z

I've made a branch in my PEG-based model of the lexer which makes it match the result of this PR.

The main change to the grammar is in mattheww/lexeywan@d7a4e91

The limitation discussed in #131656 (comment) is modelled separately in mattheww/lexeywan@8a2075a

Rendered writeup for this PR's behaviour:
https://mjw.woodcraft.me.uk/2025-lexeywan-peg-e-suffix/

Rendered writeup for Rust 1.86:
https://mjw.woodcraft.me.uk/2025-lexeywan/

mattheww · 2025-04-27T11:26:14Z

Here are some more cases to think about:

Newly accepted

source	suffix
`0b1e`	`e`
`0o1e`	`e`

Still rejected

source	possible future tokenisation
`0b1e0`	`0b1` with suffix `e0`
`0o1e2`	`0o1` with suffix `e2`

petrochenkov · 2025-04-27T12:12:56Z

Here are some more cases to think about:

Both look ok to me (the second case will need to be supported in the full implementation).

rustbot assigned petrochenkov Oct 13, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 13, 2024