Skip to content

move to byte based automata #146

Closed
Closed
@BurntSushi

Description

@BurntSushi

A prerequisite to a faster regex engine, and a DFA (see #66) is compiling UTF-8 decoding into the regex automaton. The reason why this is necessary, especially for a DFA, is so state transitions can proceed a byte-at-a-time and can therefore avoid large character maps in each state transition. We can preserve the invariant that all match boundaries returned are on UTF-8 code unit sequence boundaries by virtue of the fact that all regex automatons exclusively match valid UTF-8 encodings.

This is also a prerequisite to providing byte-based regexes that can match on a &[u8] instead of requiring a &str. See #85.

There are a few tricky components to this:

  1. Translating an arbitrary Unicode codepoint range to a sequence of non-overlapping byte-based character classes. This is done already in the utf8_ranges crate.
  2. Using said byte based character classes in the compiler. I've written a proof of concept, but the automata generated are squeamishly large for anything non-trivial.
  3. Being diligent about supporting zero-width assertions that are Unicode aware. Fortunately, this crate only has one: word boundaries. (I think this should be somewhat straight-forward to do in the NFA and backtracking engines, but is hard to do well in a DFA.)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions