move to byte based automata

A prerequisite to a faster regex engine, and a DFA (see #66) is compiling UTF-8 decoding into the regex automaton. The reason why this is necessary, especially for a DFA, is so state transitions can proceed a byte-at-a-time and can therefore avoid large character maps in each state transition. We can preserve the invariant that all match boundaries returned are on UTF-8 code unit sequence boundaries by virtue of the fact that all regex automatons exclusively match valid UTF-8 encodings.

This is also a prerequisite to providing byte-based regexes that can match on a `&[u8]` instead of requiring a `&str`. See #85.

There are a few tricky components to this:
1. Translating an arbitrary Unicode codepoint range to a sequence of non-overlapping byte-based character classes. This is done already in the [`utf8_ranges`](http://burntsushi.net/rustdoc/utf8_ranges/) crate.
2. Using said byte based character classes in the compiler. I've [written a proof of concept](https://github.com/BurntSushi/fst/blob/master/src/regex/compile.rs), but the automata generated are squeamishly large for anything non-trivial.
3. Being diligent about supporting zero-width assertions that are Unicode aware. Fortunately, this crate only has one: word boundaries. (I think this should be somewhat straight-forward to do in the NFA and backtracking engines, but is hard to do well in a DFA.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

move to byte based automata #146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

move to byte based automata #146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions