Description
-
The
re
module contains a table for characters:c1.upper() == c2.upper() and c1 != c2 and c1.lower() == c1 and c2.lower() == c2
. For example, 'ς' and 'σ':'ς'.upper() == 'σ'.upper() == 'Σ'
.It was generated for 3.5. But newer Python versions support newer Unicode standards, and more such characters were added. For example: 'в' and 'ᲀ':
'в'.upper() == 'ᲀ'.upper() == 'В'
.Python re lib fails case insensitive matches on Unicode data #56937
-
The code depends on some assumption about characters outside of the BMP range. The comment says that there are only two ranges of cased non-BMP characters, and that RANGE_UNI_IGNORE works with them.
Now there are more ranges of cased non-BMP characters. Seems the assumption is still true and RANGE_UNI_IGNORE still works, but the comment is outdated.
The plan is:
- Regenerate the table with actual Unicode versions for all maintained Python versions.
- Test the assumption and update the comment.
- Add a script and the
make
target for generating that table with the actual Unicode version (the developed version only). - For the above assumption, either test it in the script, or make the code working in case it is not true.