-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
re: documentation claim that special characters lose their special meaning inside […] seems wrong #106482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@serhiy-storchaka You might be the only coredev that can answer this question about \ in re [...] set expressions. |
Agreed the documentation on what is allowed in square-bracket character sets/classes could be made clearer. There is documentation suggesting to escape a literal closing square bracket In the how-to https://docs.python.org/3/howto/regex.html#matching-characters, the six predefined backslash character classes \d \D \s \S \w \W are documented as allowed in square brackets. Also, \b is documented as representing the backspace control character in square brackets. A related limitation is it is not clear if there is any way to have a literal backslash in square brackets. |
In any case I'd hope that either It would IMO be extremely confusing, if the specialness of Other things: Lines 504 to 507 in d0c6ba9
I would interpret this as follows:
|
With Serhiy’s documentation changes, I think backslash escaping would be defined for the hyphen For reserved ASCII letters like |
Not really sure about that... in his commit he says:
I would interpret this, as the And maybe I miss something, but I think it's still unclear, whether
mean special sequences which do not match a single char (but zero, or - if ever - more than one). |
Would the following bullet point work?
|
Hmm. Strictly speaking it's IMO still insofar unclear, that:
doesn't definitely tell (the "may" could be read in different ways IMO), what happens, if the following character does not have a special meaning, either because it generally has none (like in the What about the following:
Did I forget anything? ^^ |
Maybe one could extend the:
even to:
|
I hoped writing “any special meaning” would imply if there was no special meaning nothing happens. Would it be clearer to expand to the following? “Backslash followed by any character other than an ASCII digit or ASCII letter escapes the special meaning that character would have on its own, such as with I’m confused in your suggestion when you say for example n in |
I know, which is why I wrote »(the "may" could be read in different ways IMO)«. IMO a reader could interpret this correctly as "the character has a special meaning or it does not have a special meaning.
Hmm. I think it goes in a better direction, but that alone would IMO be still ambiguous in cases like e.g.
That's indeed a flaw. One could perhaps write, that it becomes the literal character (only), if the character alone would have the special meaning (like the |
To add on to this imo "special characters lose their meaning inside sets" sounds like there are no special characters inside sets whereas actually there even are new special characters that don't have a special meaning outside sets ( Btw is it just me or is the spacing between bullet points fluctuating in that section? |
I think so. Additional exceptions about \b for backspace and octal escapes vs group references are already documented. The only other thing that comes to mind is a technicality in the wording for non-alphanumeric escapes. An escaped character in a complemented set seems to exclude that character from matches like any other ordinary character, but we currently say “the resulting RE will match the second character”. >>> re.fullmatch(r'[\@]', '@') # Matches escaped character
<re.Match object; span=(0, 1), match='@'>
>>> print(re.fullmatch(r'[^\@]', '@')) # Does not match due to set complement
None
Yes I think every time there is an index entry, it starts a new bullet list spaced from the previous list. Not sure if it is possible to have an index entry pointing inside a bullet list. |
Co-authored-by: Martin Panter <[email protected]> Co-authored-by: Hugo van Kemenade <[email protected]>
…GH-106517) (cherry picked from commit 1557da6) Co-authored-by: Serhiy Storchaka <[email protected]> Co-authored-by: Martin Panter <[email protected]> Co-authored-by: Hugo van Kemenade <[email protected]>
…6517) (#132365) Co-authored-by: Serhiy Storchaka <[email protected]> Co-authored-by: Martin Panter <[email protected]> Co-authored-by: Hugo van Kemenade <[email protected]>
…#106517) Co-authored-by: Martin Panter <[email protected]> Co-authored-by: Hugo van Kemenade <[email protected]>
Documentation
The claim at:
cpython/Doc/library/re.rst
Lines 253 to 255 in d0c6ba9
seems wrong at least for
\
.Consider the following example:
My expectation would be that after backslash-unescaping the
b"…"
-string,pattern
is assigned the sequence of:literal
\
, the line-feed "character", the carriage-return "character"If it would be true, that "Special characters lose their special meaning inside sets.", then the resolved
\
in the unescapedpattern
should match the one in my test stringb"a\\b"
, however it does not.I guess what Python actually "sees" is:
backslash-escaped line-feed "character", the carriage-return "character"
which probably effectively yields:
the line-feed "character", the carriage-return "character"
Now you could argue that the
\
is not considered a special-character for the terms of the regular expression syntax... but it is, at least already because of:cpython/Doc/library/re.rst
Lines 504 to 507 in d0c6ba9
and ff..
Also, even the section that explains
[…]
mentions the escaping functionality of it:cpython/Doc/library/re.rst
Lines 249 to 250 in d0c6ba9
I think:
cpython/Doc/library/re.rst
Lines 253 to 255 in d0c6ba9
should be improved to document that:
\
is exempt from this[0\-9]
is0
,-
and9
, because the-
was special in that position. But what about[\-9]
? Here, the-
would not have been special, so it the result\
,-
and9
or just-
and9
?\
... ones that are special outside and RE bracket expression, like\$
,\D
.\w
or\number
... and/or ones that are never special, like\ü
.Thanks,
Chris.
Linked PRs
The text was updated successfully, but these errors were encountered: