Ignore:
Timestamp:
Mar 1, 2016, 4:39:01 PM (9 years ago)
Author:
[email protected]
Message:

[ES6] Add support for Unicode regular expressions
https://bugs.webkit.org/show_bug.cgi?id=154842

Reviewed by Filip Pizlo.

Source/JavaScriptCore:

Added processing of Unicode regular expressions to the Yarr interpreter.

Changed parsing of regular expression patterns and PatternTerms to process characters as
UChar32 in the Yarr code. The parser converts matched surrogate pairs into the appropriate
Unicode character when the expression is parsed. When matching a unicode expression and
reading source characters, we convert proper surrogate pair into a Unicode character and
advance the source cursor, "pos", one more position. The exception to this is when we
know when generating a fixed character atom that we need to match a unicode character
that doesn't fit in 16 bits. The code calls this an extendedUnicodeCharacter and has a
helper to determine this.

Added 'u' flag and 'unicode' identifier to regular expression classes. Added an "isUnicode"
parameter to YarrPattern pattern() and internal users of that function.

Updated the generation of the canonicalization tables to include a new set a tables that
follow the ES 6.0, 21.2.2.8.2 Step 2. Renamed the YarrCanonicalizeUCS2.* files to
YarrCanonicalizeUnicode.*.

Added a new Layout/js test that tests the added functionality. Updated other tests that
have minor es6 unicode checks and look for valid flags.

Ran the ChakraCore Unicode regular expression tests as well.

  • inspector/ContentSearchUtilities.cpp:

(Inspector::ContentSearchUtilities::findMagicComment):

  • yarr/RegularExpression.cpp:

(JSC::Yarr::RegularExpression::Private::compile):
Updated use of pattern().

  • runtime/CommonIdentifiers.h:
  • runtime/RegExp.cpp:

(JSC::regExpFlags):
(JSC::RegExpFunctionalTestCollector::outputOneTest):
(JSC::RegExp::finishCreation):
(JSC::RegExp::compile):
(JSC::RegExp::compileMatchOnly):

  • runtime/RegExp.h:
  • runtime/RegExpKey.h:
  • runtime/RegExpPrototype.cpp:

(JSC::regExpProtoFuncCompile):
(JSC::flagsString):
(JSC::regExpProtoGetterMultiline):
(JSC::regExpProtoGetterUnicode):
(JSC::regExpProtoGetterFlags):
Updated for new 'y' (unicode) flag. Add check to use the interpreter for unicode regular expressions.

  • tests/es6.yaml:
  • tests/stress/static-getter-in-names.js:

Updated tests for new flag and for passing the minimal es6 regular expression processing.

  • yarr/Yarr.h: Updated the size of information now kept for backtracking.
  • yarr/YarrCanonicalizeUCS2.cpp: Removed.
  • yarr/YarrCanonicalizeUCS2.h: Removed.
  • yarr/YarrCanonicalizeUCS2.js: Removed.
  • yarr/YarrCanonicalizeUnicode.cpp: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp.
  • yarr/YarrCanonicalizeUnicode.h: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h.

(JSC::Yarr::canonicalCharacterSetInfo):
(JSC::Yarr::canonicalRangeInfoFor):
(JSC::Yarr::getCanonicalPair):
(JSC::Yarr::isCanonicallyUnique):
(JSC::Yarr::areCanonicallyEquivalent):
(JSC::Yarr::rangeInfoFor): Deleted.

  • yarr/YarrCanonicalizeUnicode.js: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js.

(printHeader):
(printFooter):
(hex):
(canonicalize):
(canonicalizeUnicode):
(createUCS2CanonicalGroups):
(createUnicodeCanonicalGroups):
(cu.in.groupedCanonically.characters.sort): Deleted.
(cu.in.groupedCanonically.else): Deleted.
Refactored to output two sets of tables, one for UCS2 and one for Unicode. The UCS2 tables follow
the legacy canonicalization rules now specified in ES 6.0, 21.2.2.8.2 Step 3. The new Unicode
tables follow the rules specified in ES 6.0, 21.2.2.8.2 Step 2. Eliminated the unused Latin1 tables.

  • yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::InputStream::InputStream):
(JSC::Yarr::Interpreter::InputStream::readChecked):
(JSC::Yarr::Interpreter::InputStream::readSurrogatePairChecked):
(JSC::Yarr::Interpreter::InputStream::reread):
(JSC::Yarr::Interpreter::InputStream::prev):
(JSC::Yarr::Interpreter::testCharacterClass):
(JSC::Yarr::Interpreter::checkCharacter):
(JSC::Yarr::Interpreter::checkSurrogatePair):
(JSC::Yarr::Interpreter::checkCasedCharacter):
(JSC::Yarr::Interpreter::tryConsumeBackReference):
(JSC::Yarr::Interpreter::backtrackPatternCharacter):
(JSC::Yarr::Interpreter::matchCharacterClass):
(JSC::Yarr::Interpreter::backtrackCharacterClass):
(JSC::Yarr::Interpreter::matchParenthesesTerminalEnd):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::Interpreter::Interpreter):
(JSC::Yarr::ByteCompiler::assertionWordBoundary):
(JSC::Yarr::ByteCompiler::atomPatternCharacter):

  • yarr/YarrInterpreter.h:

(JSC::Yarr::ByteTerm::ByteTerm):
(JSC::Yarr::BytecodePattern::BytecodePattern):

  • yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClassRange):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::notAtEndOfInput):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):

  • yarr/YarrParser.h:

(JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::Parser):
(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::consumePossibleSurrogatePair):
(JSC::Yarr::Parser::parseCharacterClass):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::parse):
(JSC::Yarr::Parser::atEndOfPattern):
(JSC::Yarr::Parser::patternRemaining):
(JSC::Yarr::Parser::peek):
(JSC::Yarr::parse):

  • yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::append):
(JSC::Yarr::CharacterClassConstructor::putChar):
(JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):
(JSC::Yarr::CharacterClassConstructor::putRange):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::YarrPatternConstructor::YarrPatternConstructor):
(JSC::Yarr::YarrPatternConstructor::assertionWordBoundary):
(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassBegin):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassAtom):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassRange):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
(JSC::Yarr::YarrPattern::compile):
(JSC::Yarr::YarrPattern::YarrPattern):

  • yarr/YarrPattern.h:

(JSC::Yarr::CharacterRange::CharacterRange):
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::PatternTerm::PatternTerm):
(JSC::Yarr::YarrPattern::reset):

  • yarr/YarrSyntaxChecker.cpp:

(JSC::Yarr::SyntaxChecker::assertionBOL):
(JSC::Yarr::SyntaxChecker::assertionEOL):
(JSC::Yarr::SyntaxChecker::assertionWordBoundary):
(JSC::Yarr::SyntaxChecker::atomPatternCharacter):
(JSC::Yarr::SyntaxChecker::atomBuiltInCharacterClass):
(JSC::Yarr::SyntaxChecker::atomCharacterClassBegin):
(JSC::Yarr::SyntaxChecker::atomCharacterClassAtom):
(JSC::Yarr::checkSyntax):

LayoutTests:

Added a new test for the added unicode regular expression processing.

Updated several tests for the y flag changes and "unicode" property.

  • js/regexp-unicode-expected.txt: Added.
  • js/regexp-unicode.html: Added.
  • js/script-tests/regexp-unicode.js: Added.

New test.

  • js/Object-getOwnPropertyNames-expected.txt:
  • js/regexp-flags-expected.txt:
  • js/script-tests/Object-getOwnPropertyNames.js:
  • js/script-tests/regexp-flags.js:

(RegExp.prototype.hasOwnProperty):
Updated tests.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/Source/JavaScriptCore/yarr/YarrPattern.cpp

    r194496 r197426  
    11/*
    2  * Copyright (C) 2009, 2013 Apple Inc. All rights reserved.
     2 * Copyright (C) 2009, 2013-2016 Apple Inc. All rights reserved.
    33 * Copyright (C) 2010 Peter Varga ([email protected]), University of Szeged
    44 *
     
    2929
    3030#include "Yarr.h"
    31 #include "YarrCanonicalizeUCS2.h"
     31#include "YarrCanonicalizeUnicode.h"
    3232#include "YarrParser.h"
    3333#include <wtf/Vector.h>
     
    4141class CharacterClassConstructor {
    4242public:
    43     CharacterClassConstructor(bool isCaseInsensitive = false)
     43    CharacterClassConstructor(bool isCaseInsensitive, CanonicalMode canonicalMode)
    4444        : m_isCaseInsensitive(isCaseInsensitive)
     45        , m_canonicalMode(canonicalMode)
    4546    {
    4647    }
     
    6667    }
    6768
    68     void putChar(UChar ch)
     69    void putChar(UChar32 ch)
    6970    {
    7071        // Handle ascii cases.
     
    8586
    8687        // Add multiple matches, if necessary.
    87         const UCS2CanonicalizationRange* info = rangeInfoFor(ch);
     88        const CanonicalizationRange* info = canonicalRangeInfoFor(ch, m_canonicalMode);
    8889        if (info->type == CanonicalizeUnique)
    8990            addSorted(m_matchesUnicode, ch);
     
    9293    }
    9394
    94     void putUnicodeIgnoreCase(UChar ch, const UCS2CanonicalizationRange* info)
     95    void putUnicodeIgnoreCase(UChar32 ch, const CanonicalizationRange* info)
    9596    {
    9697        ASSERT(m_isCaseInsensitive);
    97         ASSERT(ch > 0x7f);
    9898        ASSERT(ch >= info->begin && ch <= info->end);
    9999        ASSERT(info->type != CanonicalizeUnique);
    100100        if (info->type == CanonicalizeSet) {
    101             for (const uint16_t* set = characterSetInfo[info->value]; (ch = *set); ++set)
    102                 addSorted(m_matchesUnicode, ch);
     101            for (const UChar32* set = canonicalCharacterSetInfo(info->value, m_canonicalMode); (ch = *set); ++set)
     102                addSorted(ch);
    103103        } else {
    104             addSorted(m_matchesUnicode, ch);
    105             addSorted(m_matchesUnicode, getCanonicalPair(info, ch));
    106         }
    107     }
    108 
    109     void putRange(UChar lo, UChar hi)
     104            addSorted(ch);
     105            addSorted(getCanonicalPair(info, ch));
     106        }
     107    }
     108
     109    void putRange(UChar32 lo, UChar32 hi)
    110110    {
    111111        if (lo <= 0x7f) {
    112112            char asciiLo = lo;
    113             char asciiHi = std::min(hi, (UChar)0x7f);
     113            char asciiHi = std::min(hi, (UChar32)0x7f);
    114114            addSortedRange(m_ranges, lo, asciiHi);
    115115           
     
    124124            return;
    125125
    126         lo = std::max(lo, (UChar)0x80);
     126        lo = std::max(lo, (UChar32)0x80);
    127127        addSortedRange(m_rangesUnicode, lo, hi);
    128128       
     
    130130            return;
    131131
    132         const UCS2CanonicalizationRange* info = rangeInfoFor(lo);
     132        const CanonicalizationRange* info = canonicalRangeInfoFor(lo, m_canonicalMode);
    133133        while (true) {
    134134            // Handle the range [lo .. end]
    135             UChar end = std::min<UChar>(info->end, hi);
     135            UChar32 end = std::min<UChar32>(info->end, hi);
    136136
    137137            switch (info->type) {
     
    141141            case CanonicalizeSet: {
    142142                UChar ch;
    143                 for (const uint16_t* set = characterSetInfo[info->value]; (ch = *set); ++set)
     143                for (const UChar32* set = canonicalCharacterSetInfo(info->value, m_canonicalMode); (ch = *set); ++set)
    144144                    addSorted(m_matchesUnicode, ch);
    145145                break;
     
    189189
    190190private:
    191     void addSorted(Vector<UChar>& matches, UChar ch)
     191    void addSorted(UChar32 ch)
     192    {
     193        addSorted(ch <= 0x7f ? m_matches : m_matchesUnicode, ch);
     194    }
     195
     196    void addSorted(Vector<UChar32>& matches, UChar32 ch)
    192197    {
    193198        unsigned pos = 0;
     
    215220    }
    216221
    217     void addSortedRange(Vector<CharacterRange>& ranges, UChar lo, UChar hi)
     222    void addSortedRange(Vector<CharacterRange>& ranges, UChar32 lo, UChar32 hi)
    218223    {
    219224        unsigned end = ranges.size();
     
    261266
    262267    bool m_isCaseInsensitive;
    263 
    264     Vector<UChar> m_matches;
     268    CanonicalMode m_canonicalMode;
     269
     270    Vector<UChar32> m_matches;
    265271    Vector<CharacterRange> m_ranges;
    266     Vector<UChar> m_matchesUnicode;
     272    Vector<UChar32> m_matchesUnicode;
    267273    Vector<CharacterRange> m_rangesUnicode;
    268274};
     
    272278    YarrPatternConstructor(YarrPattern& pattern)
    273279        : m_pattern(pattern)
    274         , m_characterClassConstructor(pattern.m_ignoreCase)
     280        , m_characterClassConstructor(pattern.m_ignoreCase, pattern.m_unicode ? CanonicalMode::Unicode : CanonicalMode::UCS2)
    275281        , m_invertParentheticalAssertion(false)
    276282    {
     
    314320    }
    315321
    316     void atomPatternCharacter(UChar ch)
     322    void atomPatternCharacter(UChar32 ch)
    317323    {
    318324        // We handle case-insensitive checking of unicode characters which do have both
    319325        // cases by handling them as if they were defined using a CharacterClass.
    320         if (!m_pattern.m_ignoreCase || isASCII(ch)) {
     326        if (!m_pattern.m_ignoreCase || (isASCII(ch) && !m_pattern.m_unicode)) {
    321327            m_alternative->m_terms.append(PatternTerm(ch));
    322328            return;
    323329        }
    324330
    325         const UCS2CanonicalizationRange* info = rangeInfoFor(ch);
     331        const CanonicalizationRange* info = canonicalRangeInfoFor(ch, m_pattern.m_unicode ? CanonicalMode::Unicode : CanonicalMode::UCS2);
    326332        if (info->type == CanonicalizeUnique) {
    327333            m_alternative->m_terms.append(PatternTerm(ch));
     
    358364    }
    359365
    360     void atomCharacterClassAtom(UChar ch)
     366    void atomCharacterClassAtom(UChar32 ch)
    361367    {
    362368        m_characterClassConstructor.putChar(ch);
    363369    }
    364370
    365     void atomCharacterClassRange(UChar begin, UChar end)
     371    void atomCharacterClassRange(UChar32 begin, UChar32 end)
    366372    {
    367373        m_characterClassConstructor.putRange(begin, end);
     
    597603                    currentCallFrameSize += YarrStackSpaceForBackTrackInfoPatternCharacter;
    598604                    alternative->m_hasFixedSize = false;
     605                } else if (m_pattern.m_unicode) {
     606                    currentInputPosition += (!U_IS_BMP(term.patternCharacter) ? 2 : 1) * term.quantityCount;
    599607                } else
    600608                    currentInputPosition += term.quantityCount;
     
    606614                    term.frameLocation = currentCallFrameSize;
    607615                    currentCallFrameSize += YarrStackSpaceForBackTrackInfoCharacterClass;
     616                    alternative->m_hasFixedSize = false;
     617                } else if (m_pattern.m_unicode) {
     618                    term.frameLocation = currentCallFrameSize;
     619                    currentCallFrameSize += YarrStackSpaceForBackTrackInfoCharacterClass;
     620                    currentInputPosition += term.quantityCount;
    608621                    alternative->m_hasFixedSize = false;
    609622                } else
     
    833846    YarrPatternConstructor constructor(*this);
    834847
    835     if (const char* error = parse(constructor, patternString))
     848    if (const char* error = parse(constructor, patternString, m_unicode))
    836849        return error;
    837850   
     
    847860        const char* error =
    848861#endif
    849             parse(constructor, patternString, numSubpatterns);
     862            parse(constructor, patternString, m_unicode, numSubpatterns);
    850863
    851864        ASSERT(!error);
     
    862875}
    863876
    864 YarrPattern::YarrPattern(const String& pattern, bool ignoreCase, bool multiline, const char** error)
     877YarrPattern::YarrPattern(const String& pattern, bool ignoreCase, bool multiline, bool unicode, const char** error)
    865878    : m_ignoreCase(ignoreCase)
    866879    , m_multiline(multiline)
     880    , m_unicode(unicode)
    867881    , m_containsBackreferences(false)
    868882    , m_containsBOL(false)
Note: See TracChangeset for help on using the changeset viewer.