Context Navigation

← Previous Change
Next Change →

YarrInterpreter.cpp

Timestamp:

Mar 1, 2016, 4:39:01 PM (9 years ago)

Author:

[email protected]

Message:

[ES6] Add support for Unicode regular expressions
https://bugs.webkit.org/show_bug.cgi?id=154842

Reviewed by Filip Pizlo.

Source/JavaScriptCore:

Added processing of Unicode regular expressions to the Yarr interpreter.

Changed parsing of regular expression patterns and PatternTerms to process characters as
UChar32 in the Yarr code. The parser converts matched surrogate pairs into the appropriate
Unicode character when the expression is parsed. When matching a unicode expression and
reading source characters, we convert proper surrogate pair into a Unicode character and
advance the source cursor, "pos", one more position. The exception to this is when we
know when generating a fixed character atom that we need to match a unicode character
that doesn't fit in 16 bits. The code calls this an extendedUnicodeCharacter and has a
helper to determine this.

Added 'u' flag and 'unicode' identifier to regular expression classes. Added an "isUnicode"
parameter to YarrPattern pattern() and internal users of that function.

Updated the generation of the canonicalization tables to include a new set a tables that
follow the ES 6.0, 21.2.2.8.2 Step 2. Renamed the YarrCanonicalizeUCS2.* files to
YarrCanonicalizeUnicode.*.

Added a new Layout/js test that tests the added functionality. Updated other tests that
have minor es6 unicode checks and look for valid flags.

Ran the ChakraCore Unicode regular expression tests as well.

CMakeLists.txt:
JavaScriptCore.vcxproj/JavaScriptCore.vcxproj:
JavaScriptCore.vcxproj/JavaScriptCore.vcxproj.filters:
JavaScriptCore.xcodeproj/project.pbxproj:

inspector/ContentSearchUtilities.cpp:

(Inspector::ContentSearchUtilities::findMagicComment):

yarr/RegularExpression.cpp:

(JSC::Yarr::RegularExpression::Private::compile):
Updated use of pattern().

runtime/CommonIdentifiers.h:
runtime/RegExp.cpp:

(JSC::regExpFlags):
(JSC::RegExpFunctionalTestCollector::outputOneTest):
(JSC::RegExp::finishCreation):
(JSC::RegExp::compile):
(JSC::RegExp::compileMatchOnly):

runtime/RegExp.h:
runtime/RegExpKey.h:
runtime/RegExpPrototype.cpp:

(JSC::regExpProtoFuncCompile):
(JSC::flagsString):
(JSC::regExpProtoGetterMultiline):
(JSC::regExpProtoGetterUnicode):
(JSC::regExpProtoGetterFlags):
Updated for new 'y' (unicode) flag. Add check to use the interpreter for unicode regular expressions.

tests/es6.yaml:
tests/stress/static-getter-in-names.js:

Updated tests for new flag and for passing the minimal es6 regular expression processing.

yarr/Yarr.h: Updated the size of information now kept for backtracking.

yarr/YarrCanonicalizeUCS2.cpp: Removed.
yarr/YarrCanonicalizeUCS2.h: Removed.
yarr/YarrCanonicalizeUCS2.js: Removed.
yarr/YarrCanonicalizeUnicode.cpp: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp.
yarr/YarrCanonicalizeUnicode.h: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h.

(JSC::Yarr::canonicalCharacterSetInfo):
(JSC::Yarr::canonicalRangeInfoFor):
(JSC::Yarr::getCanonicalPair):
(JSC::Yarr::isCanonicallyUnique):
(JSC::Yarr::areCanonicallyEquivalent):
(JSC::Yarr::rangeInfoFor): Deleted.

yarr/YarrCanonicalizeUnicode.js: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js.

(printHeader):
(printFooter):
(hex):
(canonicalize):
(canonicalizeUnicode):
(createUCS2CanonicalGroups):
(createUnicodeCanonicalGroups):
(cu.in.groupedCanonically.characters.sort): Deleted.
(cu.in.groupedCanonically.else): Deleted.
Refactored to output two sets of tables, one for UCS2 and one for Unicode. The UCS2 tables follow
the legacy canonicalization rules now specified in ES 6.0, 21.2.2.8.2 Step 3. The new Unicode
tables follow the rules specified in ES 6.0, 21.2.2.8.2 Step 2. Eliminated the unused Latin1 tables.

yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::InputStream::InputStream):
(JSC::Yarr::Interpreter::InputStream::readChecked):
(JSC::Yarr::Interpreter::InputStream::readSurrogatePairChecked):
(JSC::Yarr::Interpreter::InputStream::reread):
(JSC::Yarr::Interpreter::InputStream::prev):
(JSC::Yarr::Interpreter::testCharacterClass):
(JSC::Yarr::Interpreter::checkCharacter):
(JSC::Yarr::Interpreter::checkSurrogatePair):
(JSC::Yarr::Interpreter::checkCasedCharacter):
(JSC::Yarr::Interpreter::tryConsumeBackReference):
(JSC::Yarr::Interpreter::backtrackPatternCharacter):
(JSC::Yarr::Interpreter::matchCharacterClass):
(JSC::Yarr::Interpreter::backtrackCharacterClass):
(JSC::Yarr::Interpreter::matchParenthesesTerminalEnd):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::Interpreter::Interpreter):
(JSC::Yarr::ByteCompiler::assertionWordBoundary):
(JSC::Yarr::ByteCompiler::atomPatternCharacter):

yarr/YarrInterpreter.h:

(JSC::Yarr::ByteTerm::ByteTerm):
(JSC::Yarr::BytecodePattern::BytecodePattern):

yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClassRange):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::notAtEndOfInput):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):

yarr/YarrParser.h:

(JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::Parser):
(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::consumePossibleSurrogatePair):
(JSC::Yarr::Parser::parseCharacterClass):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::parse):
(JSC::Yarr::Parser::atEndOfPattern):
(JSC::Yarr::Parser::patternRemaining):
(JSC::Yarr::Parser::peek):
(JSC::Yarr::parse):

yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::append):
(JSC::Yarr::CharacterClassConstructor::putChar):
(JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):
(JSC::Yarr::CharacterClassConstructor::putRange):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::YarrPatternConstructor::YarrPatternConstructor):
(JSC::Yarr::YarrPatternConstructor::assertionWordBoundary):
(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassBegin):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassAtom):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassRange):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
(JSC::Yarr::YarrPattern::compile):
(JSC::Yarr::YarrPattern::YarrPattern):

yarr/YarrPattern.h:

(JSC::Yarr::CharacterRange::CharacterRange):
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::PatternTerm::PatternTerm):
(JSC::Yarr::YarrPattern::reset):

yarr/YarrSyntaxChecker.cpp:

(JSC::Yarr::SyntaxChecker::assertionBOL):
(JSC::Yarr::SyntaxChecker::assertionEOL):
(JSC::Yarr::SyntaxChecker::assertionWordBoundary):
(JSC::Yarr::SyntaxChecker::atomPatternCharacter):
(JSC::Yarr::SyntaxChecker::atomBuiltInCharacterClass):
(JSC::Yarr::SyntaxChecker::atomCharacterClassBegin):
(JSC::Yarr::SyntaxChecker::atomCharacterClassAtom):
(JSC::Yarr::checkSyntax):

LayoutTests:

Added a new test for the added unicode regular expression processing.

Updated several tests for the y flag changes and "unicode" property.

js/regexp-unicode-expected.txt: Added.
js/regexp-unicode.html: Added.
js/script-tests/regexp-unicode.js: Added.

New test.

js/Object-getOwnPropertyNames-expected.txt:
js/regexp-flags-expected.txt:
js/script-tests/Object-getOwnPropertyNames.js:
js/script-tests/regexp-flags.js:

(RegExp.prototype.hasOwnProperty):
Updated tests.

File:

: 1 edited

trunk/Source/JavaScriptCore/yarr/YarrInterpreter.cpp (modified) (25 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/Source/JavaScriptCore/yarr/YarrInterpreter.cpp

-              r194496
+              r197426
 /*
  * Copyright (C) 2009 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013, 2016 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga ([email protected]), University of Szeged
+ *
 …
 #include "Yarr.h"
 #include "YarrCanonicalizeUCS2.h"
+#include "YarrCanonicalizeUnicode.h"
 #include <wtf/BumpPointerAllocator.h>
 #include <wtf/DataLog.h>
 …
     struct BackTrackInfoPatternCharacter {
+        uintptr_t begin; // Only needed for unicode patterns
         uintptr_t matchAmount;
     };
     struct BackTrackInfoCharacterClass {
+        uintptr_t begin; // Only needed for unicode patterns
         uintptr_t matchAmount;
     };
 …
     class InputStream {
     public:
         InputStream(const CharType* input, unsigned start, unsigned length)
+        InputStream(const CharType* input, unsigned start, unsigned length, bool decodeSurrogatePairs)
             : input(input)
             , pos(start)
             , length(length)
+            , decodeSurrogatePairs(decodeSurrogatePairs)
+        {
+        }
 …
             unsigned p = pos - negativePositionOffest;
             ASSERT(p < length);
+            return input[p];
+            int result = input[p];
+            if (U16_IS_LEAD(result) && decodeSurrogatePairs && p + 1 < length
+                && U16_IS_TRAIL(input[p + 1])) {
+                if (atEnd())
+                    return -1;
+                result = U16_GET_SUPPLEMENTARY(result, input[p + 1]);
+                next();
+            }
+            return result;
+        }
+        int readSurrogatePairChecked(unsigned negativePositionOffest)
+        {
+            RELEASE_ASSERT(pos >= negativePositionOffest);
+            unsigned p = pos - negativePositionOffest;
+            ASSERT(p < length);
+            if (p + 1 >= length)
+                return -1;
+            int first = input[p];
+            if (U16_IS_LEAD(first) && U16_IS_TRAIL(input[p + 1]))
+                return U16_GET_SUPPLEMENTARY(first, input[p + 1]);
+            return -1;
+        }
 …
+        {
             ASSERT(from < length);
+            return input[from];
+            int result = input[from];
+            if (U16_IS_LEAD(result) && decodeSurrogatePairs && from + 1 < length
+                && U16_IS_TRAIL(input[from + 1])) {
+                result = U16_GET_SUPPLEMENTARY(result, input[from + 1]);
+            }
+            return result;
+        }
 …
         unsigned pos;
         unsigned length;
+        bool decodeSurrogatePairs;
     };
     bool testCharacterClass(CharacterClass* characterClass, int ch)
+    {
         if (ch & 0xFF80) {
+        if (ch & 0x1FFF80) {
             for (unsigned i = 0; i < characterClass->m_matchesUnicode.size(); ++i)
                 if (ch == characterClass->m_matchesUnicode[i])
 …
+    }
+    bool checkSurrogatePair(int testUnicodeChar, unsigned negativeInputOffset)
+    {
+        return testUnicodeChar == input.readSurrogatePairChecked(negativeInputOffset);
+    }
     bool checkCasedCharacter(int loChar, int hiChar, unsigned negativeInputOffset)
+    {
 …
             return false;
+        if (pattern->m_ignoreCase) {
+            for (unsigned i = 0; i < matchSize; ++i) {
+                int oldCh = input.reread(matchBegin + i);
+                int ch = input.readChecked(negativeInputOffset + matchSize - i);
+                if (oldCh == ch)
+                    continue;
+                // The definition for canonicalize (see ES 5.1, 15.10.2.8) means that
+        for (unsigned i = 0; i < matchSize; ++i) {
+            int oldCh = input.reread(matchBegin + i);
+            int ch;
+            if (!U_IS_BMP(oldCh)) {
+                ch = input.readSurrogatePairChecked(negativeInputOffset + matchSize - i);
+                ++i;
+            } else
+                ch = input.readChecked(negativeInputOffset + matchSize - i);
+            if (oldCh == ch)
+                continue;
+            if (pattern->m_ignoreCase) {
+                // The definition for canonicalize (see ES 6.0, 15.10.2.8) means that
                 // unicode values are never allowed to match against ascii ones.
                 if (isASCII(oldCh) || isASCII(ch)) {
                     if (toASCIIUpper(oldCh) == toASCIIUpper(ch))
                         continue;
                 } else if (areCanonicallyEquivalent(oldCh, ch))
+                } else if (areCanonicallyEquivalent(oldCh, ch, unicode ? CanonicalMode::Unicode : CanonicalMode::UCS2))
                     continue;
+                input.uncheckInput(matchSize);
+                return false;
+            }
+        } else {
+            for (unsigned i = 0; i < matchSize; ++i) {
+                if (!checkCharacter(input.reread(matchBegin + i), negativeInputOffset + matchSize - i)) {
+                    input.uncheckInput(matchSize);
+                    return false;
+                }
+            }
+            }
+            input.uncheckInput(matchSize);
+            return false;
+        }
 …
             if (backTrack->matchAmount) {
                 --backTrack->matchAmount;
+                input.uncheckInput(1);
+                if (unicode && !U_IS_BMP(term.atom.patternCharacter))
+                    input.uncheckInput(2);
+                else
+                    input.uncheckInput(1);
                 return true;
+            }
 …
                     return true;
+            }
             input.uncheckInput(backTrack->matchAmount);
+            input.setPos(backTrack->begin);
             break;
+        }
 …
+    {
         ASSERT(term.type == ByteTerm::TypeCharacterClass);
         BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + term.frameLocation);
+        BackTrackInfoCharacterClass* backTrack = reinterpret_cast<BackTrackInfoCharacterClass*>(context->frame + term.frameLocation);
         switch (term.atom.quantityType) {
         case QuantifierFixedCount: {
+            if (unicode) {
+                backTrack->begin = input.getPos();
+                unsigned matchAmount = 0;
+                for (matchAmount = 0; matchAmount < term.atom.quantityCount; ++matchAmount) {
+                    if (!checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition - matchAmount)) {
+                        input.setPos(backTrack->begin);
+                        return false;
+                    }
+                }
+                return true;
+            }
             for (unsigned matchAmount = 0; matchAmount < term.atom.quantityCount; ++matchAmount) {
                 if (!checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition - matchAmount))
 …
         case QuantifierGreedy: {
+            backTrack->begin = input.getPos();
             unsigned matchAmount = 0;
             while ((matchAmount < term.atom.quantityCount) && input.checkInput(1)) {
 …
         case QuantifierNonGreedy:
+            backTrack->begin = input.getPos();
             backTrack->matchAmount = 0;
             return true;
 …
+    {
         ASSERT(term.type == ByteTerm::TypeCharacterClass);
         BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + term.frameLocation);
+        BackTrackInfoCharacterClass* backTrack = reinterpret_cast<BackTrackInfoCharacterClass*>(context->frame + term.frameLocation);
         switch (term.atom.quantityType) {
         case QuantifierFixedCount:
+            if (unicode)
+                input.setPos(backTrack->begin);
             break;
         case QuantifierGreedy:
             if (backTrack->matchAmount) {
+                if (unicode) {
+                    // Rematch one less match
+                    input.setPos(backTrack->begin);
+                    --backTrack->matchAmount;
+                    for (unsigned matchAmount = 0; (matchAmount < backTrack->matchAmount) && input.checkInput(1); ++matchAmount) {
+                        if (!checkCharacterClass(term.atom.characterClass, term.invert(), term.inputPosition + 1)) {
+                            input.uncheckInput(1);
+                            break;
+                        }
+                    }
+                    return true;
+                }
                 --backTrack->matchAmount;
                 input.uncheckInput(1);
 …
                     return true;
+            }
             input.uncheckInput(backTrack->matchAmount);
+            input.setPos(backTrack->begin);
             break;
+        }
 …
             return false;
         // Successful match! Okay, what's next? - loop around and try to match moar!
+        // Successful match! Okay, what's next? - loop around and try to match more!
         context->term -= (term.atom.parenthesesWidth + 1);
         return true;
 …
         case ByteTerm::TypePatternCharacterOnce:
         case ByteTerm::TypePatternCharacterFixed: {
+            if (unicode) {
+                if (!U_IS_BMP(currentTerm().atom.patternCharacter)) {
+                    for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
+                        if (!checkSurrogatePair(currentTerm().atom.patternCharacter, currentTerm().inputPosition - matchAmount)) {
+                            BACKTRACK();
+                        }
+                    }
+                    MATCH_NEXT();
+                }
+            }
+            unsigned position = input.getPos(); // May need to back out reading a surrogate pair.
             for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
+                if (!checkCharacter(currentTerm().atom.patternCharacter, currentTerm().inputPosition - matchAmount))
+                if (!checkCharacter(currentTerm().atom.patternCharacter, currentTerm().inputPosition - matchAmount)) {
+                    input.setPos(position);
                     BACKTRACK();
+                }
+            }
             MATCH_NEXT();
 …
         case ByteTerm::TypePatternCharacterNonGreedy: {
             BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + currentTerm().frameLocation);
+            backTrack->begin = input.getPos();
             backTrack->matchAmount = 0;
             MATCH_NEXT();
 …
         case ByteTerm::TypePatternCasedCharacterOnce:
         case ByteTerm::TypePatternCasedCharacterFixed: {
+            if (unicode) {
+                // Case insensitive matching of unicode charaters are handled as TypeCharacterClass
+                ASSERT(U_IS_BMP(currentTerm().atom.patternCharacter));
+                unsigned position = input.getPos(); // May need to back out reading a surrogate pair.
+                for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
+                    if (!checkCasedCharacter(currentTerm().atom.casedCharacter.lo, currentTerm().atom.casedCharacter.hi, currentTerm().inputPosition - matchAmount)) {
+                        input.setPos(position);
+                        BACKTRACK();
+                    }
+                }
+                MATCH_NEXT();
+            }
             for (unsigned matchAmount = 0; matchAmount < currentTerm().atom.quantityCount; ++matchAmount) {
                 if (!checkCasedCharacter(currentTerm().atom.casedCharacter.lo, currentTerm().atom.casedCharacter.hi, currentTerm().inputPosition - matchAmount))
 …
         case ByteTerm::TypePatternCasedCharacterGreedy: {
             BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + currentTerm().frameLocation);
+            // Case insensitive matching of unicode charaters are handled as TypeCharacterClass
+            ASSERT(!unicode || U_IS_BMP(currentTerm().atom.patternCharacter));
             unsigned matchAmount = 0;
             while ((matchAmount < currentTerm().atom.quantityCount) && input.checkInput(1)) {
 …
         case ByteTerm::TypePatternCasedCharacterNonGreedy: {
             BackTrackInfoPatternCharacter* backTrack = reinterpret_cast<BackTrackInfoPatternCharacter*>(context->frame + currentTerm().frameLocation);
+            // Case insensitive matching of unicode charaters are handled as TypeCharacterClass
+            ASSERT(!unicode || U_IS_BMP(currentTerm().atom.patternCharacter));
             backTrack->matchAmount = 0;
             MATCH_NEXT();
 …
     Interpreter(BytecodePattern* pattern, unsigned* output, const CharType* input, unsigned length, unsigned start)
         : pattern(pattern)
+        , unicode(pattern->m_unicode)
         , output(output)
         , input(input, start, length)
+        , input(input, start, length, pattern->m_unicode)
         , allocatorPool(0)
         , remainingMatchCount(matchLimit)
 …
 private:
     BytecodePattern* pattern;
+    bool unicode;
     unsigned* output;
     InputStream input;
 …
+    }
     void atomPatternCharacter(UChar ch, unsigned inputPosition, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
+    void atomPatternCharacter(UChar32 ch, unsigned inputPosition, unsigned frameLocation, Checked<unsigned> quantityCount, QuantifierType quantityType)
+    {
         if (m_pattern.m_ignoreCase) {
             ASSERT(u_tolower(ch) <= 0xFFFF);
             ASSERT(u_toupper(ch) <= 0xFFFF);
             UChar lo = u_tolower(ch);
             UChar hi = u_toupper(ch);
+            ASSERT(u_tolower(ch) <= UCHAR_MAX_VALUE);
+            ASSERT(u_toupper(ch) <= UCHAR_MAX_VALUE);
+            UChar32 lo = u_tolower(ch);
+            UChar32 hi = u_toupper(ch);
             if (lo != hi) {

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 197426 in webkit for trunk/Source/JavaScriptCore/yarr/YarrInterpreter.cpp

Legend:

trunk/Source/JavaScriptCore/yarr/YarrInterpreter.cpp

Download in other formats: