Skip to content

re: documentation claim that special characters lose their special meaning inside […] seems wrong #106482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
calestyo opened this issue Jul 6, 2023 · 12 comments
Labels
docs Documentation in the Doc dir topic-regex

Comments

@calestyo
Copy link
Contributor

calestyo commented Jul 6, 2023

Documentation

The claim at:

cpython/Doc/library/re.rst

Lines 253 to 255 in d0c6ba9

* Special characters lose their special meaning inside sets. For example,
``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
``'*'``, or ``')'``.

seems wrong at least for \.

Consider the following example:

>>> bool(re.search(string=b"a\\b",pattern=b"[\\\n\r]"))
False

My expectation would be that after backslash-unescaping the b"…"-string, pattern is assigned the sequence of:

literal \, the line-feed "character", the carriage-return "character"

If it would be true, that "Special characters lose their special meaning inside sets.", then the resolved \ in the unescaped pattern should match the one in my test string b"a\\b", however it does not.

I guess what Python actually "sees" is:

backslash-escaped line-feed "character", the carriage-return "character"

which probably effectively yields:

the line-feed "character", the carriage-return "character"

Now you could argue that the \ is not considered a special-character for the terms of the regular expression syntax... but it is, at least already because of:

cpython/Doc/library/re.rst

Lines 504 to 507 in d0c6ba9

The special sequences consist of ``'\'`` and a character from the list below.
If the ordinary character is not an ASCII digit or an ASCII letter, then the
resulting RE will match the second character. For example, ``\$`` matches the
character ``'$'``.

and ff..

Also, even the section that explains […] mentions the escaping functionality of it:

cpython/Doc/library/re.rst

Lines 249 to 250 in d0c6ba9

``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
``[a\-z]``) or if it's placed as the first or last character

I think:

cpython/Doc/library/re.rst

Lines 253 to 255 in d0c6ba9

* Special characters lose their special meaning inside sets. For example,
``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
``'*'``, or ``')'``.

should be improved to document that:

  • \ is exempt from this
  • whether or this is only the case for characters that are actually special with respect to the RE bracket expression, i.e. [0\-9] is 0, - and 9, because the - was special in that position. But what about [\-9]? Here, the - would not have been special, so it the result \, - and 9 or just - and 9?
  • or whether this is simply the case for any character following the \ ... ones that are special outside and RE bracket expression, like \$, \D. \w or \number... and/or ones that are never special, like .

Thanks,
Chris.

Linked PRs

@calestyo calestyo added the docs Documentation in the Doc dir label Jul 6, 2023
@terryjreedy
Copy link
Member

@serhiy-storchaka You might be the only coredev that can answer this question about \ in re [...] set expressions.

@vadmium
Copy link
Member

vadmium commented Jul 7, 2023

Agreed the documentation on what is allowed in square-bracket character sets/classes could be made clearer.

There is documentation suggesting to escape a literal closing square bracket \], an initial opening bracket [, and doubled hyphens, ampersands, tildes, and vertical bars (--, &&, ~~, ||). So I conclude that in [\-9] the backslash and hyphen \- represent just a single literal hyphen character.

In the how-to https://docs.python.org/3/howto/regex.html#matching-characters, the six predefined backslash character classes \d \D \s \S \w \W are documented as allowed in square brackets. Also, \b is documented as representing the backspace control character in square brackets.

A related limitation is it is not clear if there is any way to have a literal backslash in square brackets.

@calestyo
Copy link
Contributor Author

calestyo commented Jul 7, 2023

A related limitation is it is not clear if there is any way to have a literal backslash in square brackets.
Shouldn't that just be via \\? Or do you mean that it's not yet properly documented?

In any case I'd hope that either \ always has a special meaning (which would include, that it (needlessly) quotes a following character with no special meaning like in ) inside bracket expressions - or never.

It would IMO be extremely confusing, if the specialness of \ depended on what followed, lik in the[\-9], I gave above.

Other things:

cpython/Doc/library/re.rst

Lines 504 to 507 in d0c6ba9

The special sequences consist of ``'\'`` and a character from the list below.
If the ordinary character is not an ASCII digit or an ASCII letter, then the
resulting RE will match the second character. For example, ``\$`` matches the
character ``'$'``.

I would interpret this as follows:

  • any character that is not a ASCII-digit/letter is forever defined to be just that character, i.e. it's guaranteed that or \- will never be special.

  • conversely, not only those ASCII-digits/letters that are already listed may be special, that is \q, may once become special

  • It's not clear what such non-defined \ + <ASCII-letter-or-digit> yield in terms of behaviour (or did I just miss that somewhere?). Do they resolve to the literal character? Give an exception?

    The initial text at:

    literal. Also, please note that any invalid escape sequences in Python's
    usage of the backslash in string literals now generate a :exc:`SyntaxWarning`
    and in the future this will become a :exc:`SyntaxError`. This behaviour
    will happen even if it is a valid escape sequence for a regular expression.

    obviously means the \-escapes from the strings, not from the REs.

  • In terms of RE-escapes, \n, \t and friends do not seem to be defined... so r"[\n]", AFAIU, should fall under the previous question is: that a literal n, does it give an exception... or should it also be made special in terms of RE, so that: r"[\t]" would be effectively the same as "[\t]" and both match a horizontal tab?

@vadmium
Copy link
Member

vadmium commented Jul 9, 2023

With Serhiy’s documentation changes, I think backslash escaping would be defined for the hyphen \-, , and standard Python string escapes including \n, \t and the backslash itself \\.

For reserved ASCII letters like \q, the documentation would say they are errors. The code looks like it checks and raises an exception, but I’m not sure it is worth making that documented behaviour.

@calestyo
Copy link
Contributor Author

calestyo commented Jul 9, 2023

Not really sure about that... in his commit he says:

   * Backslash either escapes characters which have special meaning in a set
     such as ``'-'``, ``']'``, ``'^'`` and ``'\\'`` itself or signals
     a special sequence which represents a single character such as
     ``\xa0`` or ``\n`` or a character class such as ``\w`` or ``\S``
     (defined below).

I would interpret this, as the \ in [\-9] is not an escaping one, as - would not have a special meaning at that place (neither is it a special sequence like \d.

It get's even more weird, cause if it's not escaping, it would be a normal literal \. But then: is this now the set of \, - and 9 - or is it the sequence of characters from \ to 9 (in which case the - would be special again ;-) ), which would however be invalid, as \ is 0x5c and 9 is 0x30.

And maybe I miss something, but I think it's still unclear, whether \q or are allowed and what they'd yield.

The former is an ASCII letter, but not yet defined, and the patche's:

     Special sequences which do not match a single character such as ``\A``
     and ``\Z`` are not allowed.

mean special sequences which do not match a single char (but zero, or - if ever - more than one).

The latter (ü) is not ASCII, but his current wording would rather imply to me that it's either not allowed inside a bracket expression or undefined.

@vadmium
Copy link
Member

vadmium commented Jul 10, 2023

Would the following bullet point work?

  • Backslash followed by any character other than an ASCII digit or ASCII letter escapes any special meaning that character may have on its own, such as with '-', ']', '^' and '\\' itself. Backslash followed by an ASCII digit or letter signals a special sequence which represents a single character such as \xa0 or \n or a character class such as \w or \S (defined below). Note that \b represents a single “backspace” character, not a word boundary as outside a set, and numeric escapes such as \1 are always octal escapes, not group references. Special sequences which do not match a single character such as \A and \Z are not allowed.

@calestyo
Copy link
Contributor Author

Hmm. Strictly speaking it's IMO still insofar unclear, that:

Backslash followed by any character other than an ASCII digit or ASCII letter escapes any special meaning that character may have on its own

doesn't definitely tell (the "may" could be read in different ways IMO), what happens, if the following character does not have a special meaning, either because it generally has none (like in the case) or because it does not have one at that position (like in the [\-e] case, where, without the \ the - would not be special).

What about the following:

  • Within a bracket expression, a backslash escapes the following character.
    • If that character has no special meaning at that position (like [\ü] or [\-9], which yield the literal ü respectively the literal - and 9) it results in the character only (the escaping \ is not kept as a literal character on it's own).
    • If that character has special meaning (like [\\], [\]] or [0\-9], which yield the literal \ respectively the literal ] respectively the literal 0, - and 9) it results in the literal character (with no special meaning) only (the escaping \ is not kept as a literal character on it's own).
    • Characters with special meaning inside a bracket expression include:
      • The characters specifically special for bracket expressions themselves \, -, ^ and ]
      • If preceded by an escaping \, any ASCII digit or ASCII letter, if it represents a single character (like \n, or \xA0 as well as characters classes like \w or \S). Note that \b represents a single “backspace” character, not a word boundary as outside a set, and numeric escapes such as \1 are always octal escapes, not group references. If the character is an ASCII letter or ASCII digits, but the escape sequence has not yet a defined meaning (like \q) or does not match a single character (like \A or \Z its use is invalid and ???? raises an exception.

Did I forget anything? ^^

@calestyo
Copy link
Contributor Author

Maybe one could extend the:

or [\-9], which yield the literal ü respectively the literal - and 9)

even to:

or [\-9], which yield the literal ü respectively the literal - and 9 but not the literal \)

@vadmium
Copy link
Member

vadmium commented Jul 11, 2023

I hoped writing “any special meaning” would imply if there was no special meaning nothing happens. Would it be clearer to expand to the following?

“Backslash followed by any character other than an ASCII digit or ASCII letter escapes the special meaning that character would have on its own, such as with '-', ']', '^' and '\\' itself, or is ignored if there is no special meaning. Backslash followed by an ASCII digit or letter . . .”

I’m confused in your suggestion when you say for example n in \n has a special meaning, but also claim if the character following the backslash has a special meaning, it results in the literal character. Wouldn’t this mean \n represents a literal n rather than a newline? I think the deciding factor between a literal character vs something else is whether that second character is ASCII alphanumeric, not whether the character has a “special meaning”.

@calestyo
Copy link
Contributor Author

calestyo commented Jul 11, 2023

I hoped writing “any special meaning” would imply if there was no special meaning nothing happens. Would it be clearer to expand to the following?

I know, which is why I wrote »(the "may" could be read in different ways IMO)«.

IMO a reader could interpret this correctly as "the character has a special meaning or it does not have a special meaning.

But there's the case of e.g. a, which alone by itself never has a special meaning, but only in combination with a leading \. Whereas others like - or ^ have or have no special meaning, just depending on their position (regardless of a leading \.

Also, one could interpret “any special meaning” not just as "it has one, or it has none", but also as "one out of a set of special meanings", like e.g. ^ has, which can be the start anchor (outside a bracket expression) or the set-negator (inside a bracket expression).

“Backslash followed by any character other than an ASCII digit or ASCII letter escapes the special meaning that character would have on its own, such as with '-', ']', '^' and '\' itself, or is ignored if there is no special meaning. Backslash followed by an ASCII digit or letter . . .”

Hmm. I think it goes in a better direction, but that alone would IMO be still ambiguous in cases like e.g. [\-9] because if there was no \, then the - would have no special meaning, thus in the above wording "escapes the special meaning that character would have on its own" (it would have none here), it would mean that the \ is not ignored, forming an invalid range here. Which might however also be easily a valid one assuming e.g. [\-z].

I’m confused in your suggestion when you say for example n in \n has a special meaning, but also claim if the character following the backslash has a special meaning, it results in the literal character. Wouldn’t this mean \n represents a literal n rather than a newline? I think the deciding factor between a literal character vs something else is whether that second character is ASCII alphanumeric, not whether the character has a “special meaning”.

That's indeed a flaw. One could perhaps write, that it becomes the literal character (only), if the character alone would have the special meaning (like the - in [0-9]) - whereas in contrast, it becomes the "special character" (like newline), if the character becomes it's special meaning through the preceding \?

Just an idea though.

@Ymiros0
Copy link

Ymiros0 commented Aug 2, 2023

To add on to this imo "special characters lose their meaning inside sets" sounds like there are no special characters inside sets whereas actually there even are new special characters that don't have a special meaning outside sets (^ and -) (Which tbf should be quite obvious to the reader of that segment, but might be confusing nonetheless).
I don't quite follow this entire discussion about backslashes though, is there any major difference between backslashes outside sets and inside sets I am unaware of? Wouldn't it be simpler to say that they behave the same with the exception of escape sequences that do not define a single character?

Btw is it just me or is the spacing between bullet points fluctuating in that section?

@vadmium
Copy link
Member

vadmium commented Oct 15, 2024

Wouldn't it be simpler to say that they behave the same with the exception of escape sequences that do not define a single character?

I think so. Additional exceptions about \b for backspace and octal escapes vs group references are already documented.

The only other thing that comes to mind is a technicality in the wording for non-alphanumeric escapes. An escaped character in a complemented set seems to exclude that character from matches like any other ordinary character, but we currently say “the resulting RE will match the second character”.

>>> re.fullmatch(r'[\@]', '@')  # Matches escaped character
<re.Match object; span=(0, 1), match='@'>
>>> print(re.fullmatch(r'[^\@]', '@'))  # Does not match due to set complement
None

Btw is it just me or is the spacing between bullet points fluctuating in that section?

Yes I think every time there is an index entry, it starts a new bullet list spaced from the previous list. Not sure if it is possible to have an index entry pointing inside a bullet list.

hugovk added a commit that referenced this issue Apr 10, 2025
Co-authored-by: Martin Panter <[email protected]>
Co-authored-by: Hugo van Kemenade <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Apr 10, 2025
…GH-106517)

(cherry picked from commit 1557da6)

Co-authored-by: Serhiy Storchaka <[email protected]>
Co-authored-by: Martin Panter <[email protected]>
Co-authored-by: Hugo van Kemenade <[email protected]>
hugovk added a commit that referenced this issue Apr 10, 2025
…6517) (#132365)

Co-authored-by: Serhiy Storchaka <[email protected]>
Co-authored-by: Martin Panter <[email protected]>
Co-authored-by: Hugo van Kemenade <[email protected]>
seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir topic-regex
Projects
None yet
Development

No branches or pull requests

5 participants