Skip to content

Outdated Unicode data in the re module #91575

@serhiy-storchaka

Description

@serhiy-storchaka
  1. The re module contains a table for characters: c1.upper() == c2.upper() and c1 != c2 and c1.lower() == c1 and c2.lower() == c2. For example, 'ς' and 'σ': 'ς'.upper() == 'σ'.upper() == 'Σ'.

    It was generated for 3.5. But newer Python versions support newer Unicode standards, and more such characters were added. For example: 'в' and 'ᲀ': 'в'.upper() == 'ᲀ'.upper() == 'В'.

    Python re lib fails case insensitive matches on Unicode data #56937

  2. The code depends on some assumption about characters outside of the BMP range. The comment says that there are only two ranges of cased non-BMP characters, and that RANGE_UNI_IGNORE works with them.

    Now there are more ranges of cased non-BMP characters. Seems the assumption is still true and RANGE_UNI_IGNORE still works, but the comment is outdated.

    IGNORECASE breaks unicode literal range matching #61583

The plan is:

  • Regenerate the table with actual Unicode versions for all maintained Python versions.
  • Test the assumption and update the comment.
  • Add a script and the make target for generating that table with the actual Unicode version (the developed version only).
  • For the above assumption, either test it in the script, or make the code working in case it is not true.

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.10only security fixes3.11only security fixes3.9only security fixestopic-regextype-bugAn unexpected behavior, bug, or errortype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions