Outdated Unicode data in the re module

1. The `re` module contains a table for characters: `c1.upper() == c2.upper() and c1 != c2 and c1.lower() == c1 and c2.lower() == c2`. For example, 'ς' and 'σ': `'ς'.upper() == 'σ'.upper() == 'Σ'`.

   It was generated for 3.5. But newer Python versions support newer Unicode standards, and more such characters were added. For example: 'в' and 'ᲀ': `'в'.upper() == 'ᲀ'.upper() == 'В'`.

   #56937

2. The code depends on some assumption about characters outside of the BMP range. The comment says that there are only two ranges of cased non-BMP characters, and that RANGE_UNI_IGNORE works with them.

   Now there are more ranges of cased non-BMP characters. Seems the assumption is still true and RANGE_UNI_IGNORE still works, but the comment is outdated.

   #61583

The plan is:

- [x] Regenerate the table with actual Unicode versions for all maintained Python versions.
- [x] Test the assumption and update the comment.
- [ ] Add a script and the `make` target for generating that table with the actual Unicode version (the developed version only).
- [ ] For the above assumption, either test it in the script, or make the code working in case it is not true.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Outdated Unicode data in the re module #91575

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Outdated Unicode data in the re module #91575

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions