Ignore:
Timestamp:
Nov 28, 2010, 4:45:16 PM (15 years ago)
Author:
[email protected]
Message:

Bug 48101 - Yarr gives different results for /(?:a*?){2,}/

Reviewed by Sam Weinig.

JavaScriptCore:

The test cases in the linked mozilla bug demostrate a couple of
problems in subpattern matching. These bugs lie in the optimized
cases - for matching parentheses with a quantity count of 1, and
for matching greedy quantified parentheses at the end of a regex
(which do not backtrack).

In both of these cases we are failing to correctly handle empty
matches. In the case of parenthese-single matches (quantity count
one) we are failing to test for empty matches at all. In the case
of terminal subpattern matches we do currenty check, however there
is a subtler bug here too. In the case of an empty match we will
presently immediately fall through to the next alternative (or
complete the regex match), whereas upon a failed match we should
be backtracking into the failing alternative, to give it a chance
to match further (e.g. consider /a??b?|a/.exec("ab") - upon first
attempting to match the first alternative this will match the empty
string - since a?? is non-greedy, however rather than moving on to
the second alternative we should be re-matching the first one, at
which point the non-greedy a?? will match, and as such the result
should be "ab", not "a").

Terminal sunpattern matching contains a second bug, too. The frame
location values in the subpattern should be being allocated with
the outer disjunction's frame (as we do for the parentheses-single
optimization). Consider the following three regexes:

/a*(?:b*)*c*/
/a*(?:b*)c*/
/a*(?:b*)*/

Considering only the frame location required by the atoms a,b, and
c, (ignoring space associated with the nested subpattern) the first
regex (a normal subpattern match) requires a frame size of 2 for
the outer disjunction, (to backtrack terms a & c), with each
iteration of the subpattern requiring a frame of size 1 (in order
to backtrack b). In the case of the second regex (where the
parentheses-single optimization will kick in) the outer frame must
be set up with a frame size of 3, since the outer frame will also
be used when running the nested subpattern. We will currently only
allocate a farme of size 1 for the outer disjuntion (to contain a),
howver the frame size should be 2 (since the subpattern will be
evaluated in the outer frame). In addition to failing to allocate
frame space the frame offsets are also presently invalid - in the
case of the last regex b's frame location will be set assuming it
to be the first term in the frame, whereas in this case b lies
after the term a, and should be taking a separate frame location.

In order to correctly allocate the frame for terminal subpattern
matches we must move this optimization back up from the JIT into
the compiler (and thus interpreter too), since this is where the
frame allocation takes place.

  • yarr/RegexCompiler.cpp:

(JSC::Yarr::RegexPatternConstructor::setupAlternativeOffsets):
(JSC::Yarr::RegexPatternConstructor::checkForTerminalParentheses):
(JSC::Yarr::compileRegex):

  • yarr/RegexInterpreter.cpp:

(JSC::Yarr::Interpreter::matchParenthesesOnceBegin):
(JSC::Yarr::Interpreter::matchParenthesesOnceEnd):
(JSC::Yarr::Interpreter::backtrackParenthesesOnceBegin):
(JSC::Yarr::Interpreter::backtrackParenthesesOnceEnd):
(JSC::Yarr::Interpreter::matchParenthesesTerminalBegin):
(JSC::Yarr::Interpreter::matchParenthesesTerminalEnd):
(JSC::Yarr::Interpreter::backtrackParenthesesTerminalBegin):
(JSC::Yarr::Interpreter::backtrackParenthesesTerminalEnd):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::ByteCompiler::atomParenthesesOnceBegin):
(JSC::Yarr::ByteCompiler::atomParenthesesTerminalBegin):
(JSC::Yarr::ByteCompiler::atomParenthesesSubpatternBegin):
(JSC::Yarr::ByteCompiler::atomParentheticalAssertionEnd):
(JSC::Yarr::ByteCompiler::atomParenthesesSubpatternEnd):
(JSC::Yarr::ByteCompiler::atomParenthesesOnceEnd):
(JSC::Yarr::ByteCompiler::atomParenthesesTerminalEnd):
(JSC::Yarr::ByteCompiler::emitDisjunction):

  • yarr/RegexInterpreter.h:
  • yarr/RegexJIT.cpp:

(JSC::Yarr::RegexGenerator::generateParenthesesSingle):
(JSC::Yarr::RegexGenerator::generateParenthesesGreedyNoBacktrack):
(JSC::Yarr::RegexGenerator::generateTerm):

  • yarr/RegexPattern.h:

(JSC::Yarr::PatternTerm::PatternTerm):

LayoutTests:

Add layout tests for corner cases of repeat matches in regular expressions,
and of the examples documented in the ECMA-262 spec .

  • fast/regex/ecma-regex-examples-expected.txt: Added.
  • fast/regex/ecma-regex-examples.html: Added.
  • fast/regex/repeat-match-waldemar-expected.txt: Added.
  • fast/regex/repeat-match-waldemar.html: Added.
  • fast/regex/script-tests/ecma-regex-examples.js: Added.
  • fast/regex/script-tests/repeat-match-waldemar.js: Added.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/JavaScriptCore/yarr/RegexCompiler.cpp

    r72489 r72781  
    668668                // Note: for fixed once parentheses we will ensure at least the minimum is available; others are on their own.
    669669                term.frameLocation = currentCallFrameSize;
    670                 if ((term.quantityCount == 1) && !term.parentheses.isCopy) {
    671                     if (term.quantityType == QuantifierFixedCount) {
    672                         currentCallFrameSize = setupDisjunctionOffsets(term.parentheses.disjunction, currentCallFrameSize, currentInputPosition);
     670                if (term.quantityCount == 1 && !term.parentheses.isCopy) {
     671                    if (term.quantityType != QuantifierFixedCount)
     672                        currentCallFrameSize += RegexStackSpaceForBackTrackInfoParenthesesOnce;
     673                    currentCallFrameSize = setupDisjunctionOffsets(term.parentheses.disjunction, currentCallFrameSize, currentInputPosition);
     674                    // If quantity is fixed, then pre-check its minimum size.
     675                    if (term.quantityType == QuantifierFixedCount)
    673676                        currentInputPosition += term.parentheses.disjunction->m_minimumSize;
    674                     } else {
    675                         currentCallFrameSize += RegexStackSpaceForBackTrackInfoParenthesesOnce;
    676                         currentCallFrameSize = setupDisjunctionOffsets(term.parentheses.disjunction, currentCallFrameSize, currentInputPosition);
    677                     }
     677                    term.inputPosition = currentInputPosition;
     678                } else if (term.parentheses.isTerminal) {
     679                    currentCallFrameSize += RegexStackSpaceForBackTrackInfoParenthesesTerminal;
     680                    currentCallFrameSize = setupDisjunctionOffsets(term.parentheses.disjunction, currentCallFrameSize, currentInputPosition);
    678681                    term.inputPosition = currentInputPosition;
    679682                } else {
     
    729732    }
    730733
     734    // This optimization identifies sets of parentheses that we will never need to backtrack.
     735    // In these cases we do not need to store state from prior iterations.
     736    // We can presently avoid backtracking for:
     737    //   * a set of parens at the end of the regular expression (last term in any of the alternatives of the main body disjunction).
     738    //   * where the parens are non-capturing, and quantified unbounded greedy (*).
     739    //   * where the parens do not contain any capturing subpatterns.
     740    void checkForTerminalParentheses()
     741    {
     742        // This check is much too crude; should be just checking whether the candidate
     743        // node contains nested capturing subpatterns, not the whole expression!
     744        if (m_pattern.m_numSubpatterns)
     745            return;
     746
     747        Vector<PatternAlternative*>& alternatives = m_pattern.m_body->m_alternatives;
     748        for (unsigned i =0; i < alternatives.size(); ++i) {
     749            Vector<PatternTerm>& terms = alternatives[i]->m_terms;
     750            if (terms.size()) {
     751                PatternTerm& term = terms.last();
     752                if (term.type == PatternTerm::TypeParenthesesSubpattern
     753                    && term.quantityType == QuantifierGreedy
     754                    && term.quantityCount == UINT_MAX
     755                    && !term.capture())
     756                    term.parentheses.isTerminal = true;
     757            }
     758        }
     759    }
     760
    731761    void optimizeBOL()
    732762    {
     
    931961    }
    932962
     963    constructor.checkForTerminalParentheses();
    933964    constructor.optimizeBOL();
    934965       
Note: See TracChangeset for help on using the changeset viewer.