Skip to content

Regex syntax parsing of unicode code points is incorrect #854

Closed
@dtzxporter

Description

@dtzxporter

What version of regex are you using?

Latest

If it isn't the latest version, then please upgrade and check whether the bug
is still present.

Describe the bug at a high level.

Because regex_syntax is lazily using char::from_u32 not all valid unicode code points are parsed, and this prevents valid regex's from compiling.

Give a brief description of the actual problem you're observing.

image

Rust defines char as a "Unicode scalar value" and explicitly states that it's similar but not the same as a unicode code point.

The parser is supposed to extract all code points as documented above the function:
https://github.com/rust-lang/regex/blob/master/regex-syntax/src/ast/parse.rs#L1611

What is the expected behavior?

I expect this crate to include custom logic for validating code points, instead relying on char::from_u32 which omits valid code points/surrogate values because they aren't considered scalar values.

Javascript and several other regex engines can handle these fine.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions