Ignore:
Timestamp:
Mar 1, 2016, 4:39:01 PM (9 years ago)
Author:
[email protected]
Message:

[ES6] Add support for Unicode regular expressions
https://bugs.webkit.org/show_bug.cgi?id=154842

Reviewed by Filip Pizlo.

Source/JavaScriptCore:

Added processing of Unicode regular expressions to the Yarr interpreter.

Changed parsing of regular expression patterns and PatternTerms to process characters as
UChar32 in the Yarr code. The parser converts matched surrogate pairs into the appropriate
Unicode character when the expression is parsed. When matching a unicode expression and
reading source characters, we convert proper surrogate pair into a Unicode character and
advance the source cursor, "pos", one more position. The exception to this is when we
know when generating a fixed character atom that we need to match a unicode character
that doesn't fit in 16 bits. The code calls this an extendedUnicodeCharacter and has a
helper to determine this.

Added 'u' flag and 'unicode' identifier to regular expression classes. Added an "isUnicode"
parameter to YarrPattern pattern() and internal users of that function.

Updated the generation of the canonicalization tables to include a new set a tables that
follow the ES 6.0, 21.2.2.8.2 Step 2. Renamed the YarrCanonicalizeUCS2.* files to
YarrCanonicalizeUnicode.*.

Added a new Layout/js test that tests the added functionality. Updated other tests that
have minor es6 unicode checks and look for valid flags.

Ran the ChakraCore Unicode regular expression tests as well.

  • inspector/ContentSearchUtilities.cpp:

(Inspector::ContentSearchUtilities::findMagicComment):

  • yarr/RegularExpression.cpp:

(JSC::Yarr::RegularExpression::Private::compile):
Updated use of pattern().

  • runtime/CommonIdentifiers.h:
  • runtime/RegExp.cpp:

(JSC::regExpFlags):
(JSC::RegExpFunctionalTestCollector::outputOneTest):
(JSC::RegExp::finishCreation):
(JSC::RegExp::compile):
(JSC::RegExp::compileMatchOnly):

  • runtime/RegExp.h:
  • runtime/RegExpKey.h:
  • runtime/RegExpPrototype.cpp:

(JSC::regExpProtoFuncCompile):
(JSC::flagsString):
(JSC::regExpProtoGetterMultiline):
(JSC::regExpProtoGetterUnicode):
(JSC::regExpProtoGetterFlags):
Updated for new 'y' (unicode) flag. Add check to use the interpreter for unicode regular expressions.

  • tests/es6.yaml:
  • tests/stress/static-getter-in-names.js:

Updated tests for new flag and for passing the minimal es6 regular expression processing.

  • yarr/Yarr.h: Updated the size of information now kept for backtracking.
  • yarr/YarrCanonicalizeUCS2.cpp: Removed.
  • yarr/YarrCanonicalizeUCS2.h: Removed.
  • yarr/YarrCanonicalizeUCS2.js: Removed.
  • yarr/YarrCanonicalizeUnicode.cpp: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.cpp.
  • yarr/YarrCanonicalizeUnicode.h: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.h.

(JSC::Yarr::canonicalCharacterSetInfo):
(JSC::Yarr::canonicalRangeInfoFor):
(JSC::Yarr::getCanonicalPair):
(JSC::Yarr::isCanonicallyUnique):
(JSC::Yarr::areCanonicallyEquivalent):
(JSC::Yarr::rangeInfoFor): Deleted.

  • yarr/YarrCanonicalizeUnicode.js: Copied from Source/JavaScriptCore/yarr/YarrCanonicalizeUCS2.js.

(printHeader):
(printFooter):
(hex):
(canonicalize):
(canonicalizeUnicode):
(createUCS2CanonicalGroups):
(createUnicodeCanonicalGroups):
(cu.in.groupedCanonically.characters.sort): Deleted.
(cu.in.groupedCanonically.else): Deleted.
Refactored to output two sets of tables, one for UCS2 and one for Unicode. The UCS2 tables follow
the legacy canonicalization rules now specified in ES 6.0, 21.2.2.8.2 Step 3. The new Unicode
tables follow the rules specified in ES 6.0, 21.2.2.8.2 Step 2. Eliminated the unused Latin1 tables.

  • yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::InputStream::InputStream):
(JSC::Yarr::Interpreter::InputStream::readChecked):
(JSC::Yarr::Interpreter::InputStream::readSurrogatePairChecked):
(JSC::Yarr::Interpreter::InputStream::reread):
(JSC::Yarr::Interpreter::InputStream::prev):
(JSC::Yarr::Interpreter::testCharacterClass):
(JSC::Yarr::Interpreter::checkCharacter):
(JSC::Yarr::Interpreter::checkSurrogatePair):
(JSC::Yarr::Interpreter::checkCasedCharacter):
(JSC::Yarr::Interpreter::tryConsumeBackReference):
(JSC::Yarr::Interpreter::backtrackPatternCharacter):
(JSC::Yarr::Interpreter::matchCharacterClass):
(JSC::Yarr::Interpreter::backtrackCharacterClass):
(JSC::Yarr::Interpreter::matchParenthesesTerminalEnd):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::Interpreter::Interpreter):
(JSC::Yarr::ByteCompiler::assertionWordBoundary):
(JSC::Yarr::ByteCompiler::atomPatternCharacter):

  • yarr/YarrInterpreter.h:

(JSC::Yarr::ByteTerm::ByteTerm):
(JSC::Yarr::BytecodePattern::BytecodePattern):

  • yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClassRange):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::notAtEndOfInput):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):

  • yarr/YarrParser.h:

(JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::Parser):
(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::consumePossibleSurrogatePair):
(JSC::Yarr::Parser::parseCharacterClass):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::parse):
(JSC::Yarr::Parser::atEndOfPattern):
(JSC::Yarr::Parser::patternRemaining):
(JSC::Yarr::Parser::peek):
(JSC::Yarr::parse):

  • yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::append):
(JSC::Yarr::CharacterClassConstructor::putChar):
(JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):
(JSC::Yarr::CharacterClassConstructor::putRange):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::YarrPatternConstructor::YarrPatternConstructor):
(JSC::Yarr::YarrPatternConstructor::assertionWordBoundary):
(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassBegin):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassAtom):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassRange):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
(JSC::Yarr::YarrPattern::compile):
(JSC::Yarr::YarrPattern::YarrPattern):

  • yarr/YarrPattern.h:

(JSC::Yarr::CharacterRange::CharacterRange):
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::PatternTerm::PatternTerm):
(JSC::Yarr::YarrPattern::reset):

  • yarr/YarrSyntaxChecker.cpp:

(JSC::Yarr::SyntaxChecker::assertionBOL):
(JSC::Yarr::SyntaxChecker::assertionEOL):
(JSC::Yarr::SyntaxChecker::assertionWordBoundary):
(JSC::Yarr::SyntaxChecker::atomPatternCharacter):
(JSC::Yarr::SyntaxChecker::atomBuiltInCharacterClass):
(JSC::Yarr::SyntaxChecker::atomCharacterClassBegin):
(JSC::Yarr::SyntaxChecker::atomCharacterClassAtom):
(JSC::Yarr::checkSyntax):

LayoutTests:

Added a new test for the added unicode regular expression processing.

Updated several tests for the y flag changes and "unicode" property.

  • js/regexp-unicode-expected.txt: Added.
  • js/regexp-unicode.html: Added.
  • js/script-tests/regexp-unicode.js: Added.

New test.

  • js/Object-getOwnPropertyNames-expected.txt:
  • js/regexp-flags-expected.txt:
  • js/script-tests/Object-getOwnPropertyNames.js:
  • js/script-tests/regexp-flags.js:

(RegExp.prototype.hasOwnProperty):
Updated tests.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/Source/JavaScriptCore/runtime/RegExp.cpp

    r197379 r197426  
    6767            break;
    6868
     69        case 'u':
     70            if (flags & FlagUnicode)
     71                return InvalidFlags;
     72            flags = static_cast<RegExpFlags>(flags | FlagUnicode);
     73            break;
     74               
    6975        default:
    7076            return InvalidFlags;
     
    127133        if (regExp->multiline())
    128134            fputc('m', m_file);
     135        if (regExp->unicode())
     136            fputc('u', m_file);
    129137        fprintf(m_file, "\n");
    130138    }
     
    241249{
    242250    Base::finishCreation(vm);
    243     Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), &m_constructionError);
     251    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), unicode(), &m_constructionError);
    244252    if (m_constructionError)
    245253        m_state = ParseError;
     
    281289void RegExp::compile(VM* vm, Yarr::YarrCharSize charSize)
    282290{
    283     Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), &m_constructionError);
     291    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), unicode(), &m_constructionError);
    284292    if (m_constructionError) {
    285293        RELEASE_ASSERT_NOT_REACHED();
     
    298306
    299307#if ENABLE(YARR_JIT)
    300     if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && vm->canUseRegExpJIT()) {
     308    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && !unicode() && vm->canUseRegExpJIT()) {
    301309        Yarr::jitCompile(pattern, charSize, vm, m_regExpJITCode);
    302310        if (!m_regExpJITCode.isFallBack()) {
     
    400408void RegExp::compileMatchOnly(VM* vm, Yarr::YarrCharSize charSize)
    401409{
    402     Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), &m_constructionError);
     410    Yarr::YarrPattern pattern(m_patternString, ignoreCase(), multiline(), unicode(), &m_constructionError);
    403411    if (m_constructionError) {
    404412        RELEASE_ASSERT_NOT_REACHED();
     
    417425
    418426#if ENABLE(YARR_JIT)
    419     if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && vm->canUseRegExpJIT()) {
     427    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && !unicode() && vm->canUseRegExpJIT()) {
    420428        Yarr::jitCompile(pattern, charSize, vm, m_regExpJITCode, Yarr::MatchOnly);
    421429        if (!m_regExpJITCode.isFallBack()) {
Note: See TracChangeset for help on using the changeset viewer.