Description
Bug report
Regular expressions that combine a possessive quantifier with a negative lookahead match extra erroneous characters in re module 2.2.1 of Python 3.11. (The test was run on Windows 10 using the official distribution of Python 3.11.0.)
For example, the following regular expression aims to match consecutive characters that are not 'C' in string 'ABC'. (There are simpler ways to do this, but this is just an example to illustrate the problem.)
import re
text = 'ABC'
print('Possessive quantifier, negative lookahead:',
re.findall('(((?!C).)++)', text))
Output:
Possessive quantifier, negative lookahead: [('ABC', 'B')]
The first subgroup of the match is the entire match, while the second subgroup is the last character that was matched. They should be 'AB' and 'B', respectively. While the last matched character is correctly identified as 'B', the complete match is erroneously set to 'ABC'.
Replacing the negative lookahead with a positive lookahead eliminates the problem:
print('Possessive quantifier, positive lookahead:',
re.findall('(((?=[^C]).)++)', text))
Output:
Possessive quantifier, positive lookahead: [('AB', 'B')]
Alternately, keeping the negative lookahead but replacing the possessive quantifier with a greedy quantifier also eliminates the problem:
print('Greedy quantifier, negative lookahead:',
re.findall('(((?!C).)+)', text))
Output:
Greedy quantifier, negative lookahead: [('AB', 'B')]
While this example uses the ++ quantifier, the *+ and ?+ quantifiers exhibit similar behaviour. Also, using a longer pattern in the negative lookahead leads to even more characters being erroneously matched.
Thank you for adding possessive quantifiers to the re module! It is a very useful feature!
Environment
- re module 2.2.1 in standard library
- CPython versions tested on: 3.11.0
- Operating system and architecture: Windows 10