Ignore:
Timestamp:
Mar 28, 2019, 11:05:55 PM (6 years ago)
Author:
[email protected]
Message:

[YARR] Precompute BMP / non-BMP status when constructing character classes
https://bugs.webkit.org/show_bug.cgi?id=196296

Reviewed by Keith Miller.

Changed CharacterClass::m_hasNonBMPCharacters into a character width bit field which
indicateis if the class includes characters from either BMP, non-BMP or both ranges.
This allows the recognizing code to eliminate checks for the width of a matched
characters when the class has only one width. The character width is needed to
determine if we advance 1 or 2 character. Also, the pre-computed width of character
classes that contains either all BMP or all non-BMP characters allows the parser to
use fixed widths for terms using those character classes. Changed both the code gen
scripts and Yarr compiler to compute this bit field during the construction of
character classes.

For JIT'ed code of character classes that contain either all BMP or all non-BMP
characters, we can eliminate the generic check we were doing do compute how much
to advance after sucessfully matching a character in the class.

Generic isBMP check BMP only non-BMP only
-------------- -------------- --------------
inc %r9d inc %r9d add $0x2, %r9d
cmp $0x10000, %eax
jl isBMP
cmp %edx, %esi
jz atEndOfString
inc %r9d
inc %esi

isBMP:

For character classes that contained non-BMP characters, we were always generating
the code in the left column. The middle column is the code we generate for character
classes that contain only BMP characters. The right column is the code we now
generate if the character class has only non-BMP characters. In the fix width cases,
we can eliminate both the isBMP check as well as the atEndOfString check. The
atEndOfstring check is eliminated since we know how many characters this character
class requires and that check can be factored out to the beginning of the current
alternative. For character classes that contain both BMP and non-BMP characters,
we still generate the generic left column.

This change is a ~8% perf progression on UniPoker and a ~2% improvement on RexBench
as a whole.

  • runtime/RegExp.cpp:

(JSC::RegExp::matchCompareWithInterpreter):

  • runtime/RegExpInlines.h:

(JSC::RegExp::matchInline):

  • yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::checkCharacterClassDontAdvanceInputForNonBMP):
(JSC::Yarr::Interpreter::matchCharacterClass):

  • yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::advanceIndexAfterCharacterClassTermMatch):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):

  • yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::CharacterClassConstructor::characterWidths):
(JSC::Yarr::PatternTerm::dump):
(JSC::Yarr::anycharCreate):

  • yarr/YarrPattern.h:

(JSC::Yarr::operator|):
(JSC::Yarr::operator&):
(JSC::Yarr::operator|=):
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::CharacterClass::hasNonBMPCharacters):
(JSC::Yarr::CharacterClass::hasOneCharacterSize):
(JSC::Yarr::CharacterClass::hasOnlyNonBMPCharacters):
(JSC::Yarr::PatternTerm::invert const):
(JSC::Yarr::PatternTerm::invert): Deleted.

  • yarr/create_regex_tables:
  • yarr/generateYarrUnicodePropertyTables.py:
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/Source/JavaScriptCore/yarr/YarrJIT.cpp

    r243237 r243642  
    7373    static const RegisterID initialStart = ARM64Registers::x11;
    7474    static const RegisterID supplementaryPlanesBase = ARM64Registers::x12;
    75     static const RegisterID surrogateTagMask = ARM64Registers::x13;
    76     static const RegisterID leadingSurrogateTag = ARM64Registers::x14;
    77     static const RegisterID trailingSurrogateTag = ARM64Registers::x15;
     75    static const RegisterID leadingSurrogateTag = ARM64Registers::x13;
     76    static const RegisterID trailingSurrogateTag = ARM64Registers::x14;
     77    static const RegisterID endOfStringAddress = ARM64Registers::x15;
    7878
    7979    static const RegisterID returnRegister = ARM64Registers::x0;
    8080    static const RegisterID returnRegister2 = ARM64Registers::x1;
    8181
     82    const TrustedImm32 surrogateTagMask = TrustedImm32(0xfffffc00);
    8283#define HAVE_INITIAL_START_REG
    8384#define JIT_UNICODE_EXPRESSIONS
     
    144145    static const RegisterID regUnicodeInputAndTrail = X86Registers::r13;
    145146    static const RegisterID leadingSurrogateTag = X86Registers::r14;
    146     static const RegisterID trailingSurrogateTag = X86Registers::r15;
     147    static const RegisterID endOfStringAddress = X86Registers::r15;
    147148
    148149    static const RegisterID returnRegister = X86Registers::eax;
     
    150151
    151152    const TrustedImm32 supplementaryPlanesBase = TrustedImm32(0x10000);
     153    const TrustedImm32 trailingSurrogateTag = TrustedImm32(0xdc00);
    152154    const TrustedImm32 surrogateTagMask = TrustedImm32(0xfffffc00);
    153155#define HAVE_INITIAL_START_REG
     
    320322            if ((term.type == PatternTerm::TypeCharacterClass)
    321323                && (term.quantityType == QuantifierFixedCount)
    322                 && (!m_decodeSurrogatePairs || (!term.characterClass->m_hasNonBMPCharacters && !term.m_invert))
     324                && (!m_decodeSurrogatePairs || (term.characterClass->hasOneCharacterSize() && !term.m_invert))
    323325                && (nextTerm.type == PatternTerm::TypePatternCharacter)
    324326                && (nextTerm.quantityType == QuantifierFixedCount)) {
     
    384386            return;
    385387        }
     388
    386389        JumpList unicodeFail;
    387390        if (charClass->m_matchesUnicode.size() || charClass->m_rangesUnicode.size()) {
     
    448451            unicodeFail.link(this);
    449452    }
     453
     454#ifdef JIT_UNICODE_EXPRESSIONS
     455    void advanceIndexAfterCharacterClassTermMatch(const PatternTerm* term, JumpList& failures, const RegisterID character)
     456    {
     457        ASSERT(term->type == PatternTerm::TypeCharacterClass);
     458
     459        if (term->characterClass->hasOneCharacterSize() && !term->invert())
     460            add32(TrustedImm32(term->characterClass->hasNonBMPCharacters() ? 2 : 1), index);
     461        else {
     462            add32(TrustedImm32(1), index);
     463            failures.append(atEndOfInput());
     464            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
     465            add32(TrustedImm32(1), index);
     466            isBMPChar.link(this);
     467        }
     468    }
     469#endif
    450470
    451471    // Jumps if input not available; will have (incorrectly) incremented already!
     
    521541
    522542        JumpList notUnicode;
     543
    523544        load16Unaligned(regUnicodeInputAndTrail, resultReg);
    524545        and32(surrogateTagMask, resultReg, regT2);
    525546        notUnicode.append(branch32(NotEqual, regT2, leadingSurrogateTag));
    526547        addPtr(TrustedImm32(2), regUnicodeInputAndTrail);
    527         getEffectiveAddress(BaseIndex(input, length, TimesTwo), regT2);
    528         notUnicode.append(branch32(AboveOrEqual, regUnicodeInputAndTrail, regT2));
     548        notUnicode.append(branchPtr(AboveOrEqual, regUnicodeInputAndTrail, endOfStringAddress));
    529549        load16Unaligned(Address(regUnicodeInputAndTrail), regUnicodeInputAndTrail);
    530550        and32(surrogateTagMask, regUnicodeInputAndTrail, regT2);
     
    17351755        }
    17361756#ifdef JIT_UNICODE_EXPRESSIONS
    1737         if (m_decodeSurrogatePairs) {
     1757        if (m_decodeSurrogatePairs && (!term->characterClass->hasOneCharacterSize() || term->invert())) {
    17381758            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
    17391759            add32(TrustedImm32(1), index);
     
    17691789
    17701790        move(index, countRegister);
    1771         sub32(Imm32(term->quantityMaxCount.unsafeGet()), countRegister);
     1791
     1792        Checked<unsigned> scaledMaxCount = term->quantityMaxCount;
     1793
     1794#ifdef JIT_UNICODE_EXPRESSIONS
     1795        if (m_decodeSurrogatePairs && term->characterClass->hasOnlyNonBMPCharacters() && !term->invert())
     1796            scaledMaxCount *= 2;
     1797#endif
     1798        sub32(Imm32(scaledMaxCount.unsafeGet()), countRegister);
    17721799
    17731800        Label loop(this);
    17741801        JumpList matchDest;
    1775         readCharacter(m_checkedOffset - term->inputPosition - term->quantityMaxCount, character, countRegister);
     1802        readCharacter(m_checkedOffset - term->inputPosition - scaledMaxCount, character, countRegister);
    17761803        // If we are matching the "any character" builtin class we only need to read the
    17771804        // character and don't need to match as it will always succeed.
     
    17871814        }
    17881815
    1789         add32(TrustedImm32(1), countRegister);
    17901816#ifdef JIT_UNICODE_EXPRESSIONS
    17911817        if (m_decodeSurrogatePairs) {
    1792             Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
    1793             op.m_jumps.append(atEndOfInput());
     1818            if (term->characterClass->hasOneCharacterSize() && !term->invert())
     1819                add32(TrustedImm32(term->characterClass->hasNonBMPCharacters() ? 2 : 1), countRegister);
     1820            else {
     1821                add32(TrustedImm32(1), countRegister);
     1822                Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
     1823                op.m_jumps.append(atEndOfInput());
     1824                add32(TrustedImm32(1), countRegister);
     1825                add32(TrustedImm32(1), index);
     1826                isBMPChar.link(this);
     1827            }
     1828        } else
     1829#endif
    17941830            add32(TrustedImm32(1), countRegister);
    1795             add32(TrustedImm32(1), index);
    1796             isBMPChar.link(this);
    1797         }
    1798 #endif
    17991831        branch32(NotEqual, countRegister, index).linkTo(loop, this);
    18001832    }
     
    18121844        const RegisterID countRegister = regT1;
    18131845
    1814         if (m_decodeSurrogatePairs)
     1846        if (m_decodeSurrogatePairs && (!term->characterClass->hasOneCharacterSize() || term->invert()))
    18151847            storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
    18161848        move(TrustedImm32(0), countRegister);
     
    18261858            JumpList matchDest;
    18271859            readCharacter(m_checkedOffset - term->inputPosition, character);
    1828             // If we are matching the "any character" builtin class we only need to read the
    1829             // character and don't need to match as it will always succeed.
     1860            // If we are matching the "any character" builtin class for non-unicode patterns,
     1861            // we only need to read the character and don't need to match as it will always succeed.
    18301862            if (!term->characterClass->m_anyCharacter) {
    18311863                matchCharacterClass(character, matchDest, term->characterClass);
     
    18351867        }
    18361868
    1837         add32(TrustedImm32(1), index);
    18381869#ifdef JIT_UNICODE_EXPRESSIONS
    1839         if (m_decodeSurrogatePairs) {
    1840             failures.append(atEndOfInput());
    1841             Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
     1870        if (m_decodeSurrogatePairs)
     1871            advanceIndexAfterCharacterClassTermMatch(term, failures, character);
     1872        else
     1873#endif
    18421874            add32(TrustedImm32(1), index);
    1843             isBMPChar.link(this);
    1844         }
    1845 #endif
    18461875        add32(TrustedImm32(1), countRegister);
    18471876
     
    18691898        m_backtrackingState.append(branchTest32(Zero, countRegister));
    18701899        sub32(TrustedImm32(1), countRegister);
     1900        storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
     1901
    18711902        if (!m_decodeSurrogatePairs)
    18721903            sub32(TrustedImm32(1), index);
     1904        else if (term->characterClass->hasOneCharacterSize() && !term->invert())
     1905            sub32(TrustedImm32(term->characterClass->hasNonBMPCharacters() ? 2 : 1), index);
    18731906        else {
     1907            // Rematch one less
    18741908            const RegisterID character = regT0;
    18751909
    18761910            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
    1877             // Rematch one less
    1878             storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
    18791911
    18801912            Label rematchLoop(this);
     
    19061938        move(TrustedImm32(0), countRegister);
    19071939        op.m_reentry = label();
    1908         if (m_decodeSurrogatePairs)
    1909             storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
    1910         storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
     1940        if (m_decodeSurrogatePairs) {
     1941            if (!term->characterClass->hasOneCharacterSize() || term->invert())
     1942                storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
     1943            storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
     1944        }
    19111945    }
    19121946
     
    19231957        m_backtrackingState.link(this);
    19241958
    1925         if (m_decodeSurrogatePairs)
    1926             loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
    1927         loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
     1959        if (m_decodeSurrogatePairs) {
     1960            if (!term->characterClass->hasOneCharacterSize() || term->invert())
     1961                loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
     1962            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
     1963        }
    19281964
    19291965        nonGreedyFailures.append(atEndOfInput());
     
    19321968        JumpList matchDest;
    19331969        readCharacter(m_checkedOffset - term->inputPosition, character);
    1934         // If we are matching the "any character" builtin class we only need to read the
    1935         // character and don't need to match as it will always succeed.
     1970        // If we are matching the "any character" builtin class for non-unicode patterns,
     1971        // we only need to read the character and don't need to match as it will always succeed.
    19361972        if (term->invert() || !term->characterClass->m_anyCharacter) {
    19371973            matchCharacterClass(character, matchDest, term->characterClass);
     
    19451981        }
    19461982
    1947         add32(TrustedImm32(1), index);
    19481983#ifdef JIT_UNICODE_EXPRESSIONS
    1949         if (m_decodeSurrogatePairs) {
    1950             nonGreedyFailures.append(atEndOfInput());
    1951             Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
     1984        if (m_decodeSurrogatePairs)
     1985            advanceIndexAfterCharacterClassTermMatch(term, nonGreedyFailures, character);
     1986        else
     1987#endif
    19521988            add32(TrustedImm32(1), index);
    1953             isBMPChar.link(this);
    1954         }
    1955 #endif
    19561989        add32(TrustedImm32(1), countRegister);
    19571990
     
    37013734
    37023735            move(TrustedImm32(0xd800), leadingSurrogateTag);
    3703             move(TrustedImm32(0xdc00), trailingSurrogateTag);
    37043736        }
    37053737        // The ABI doesn't guarantee the upper bits are zero on unsigned arguments, so clear them ourselves.
     
    37353767            pushPair(framePointerRegister, linkRegister);
    37363768            move(TrustedImm32(0x10000), supplementaryPlanesBase);
    3737             move(TrustedImm32(0xfffffc00), surrogateTagMask);
    37383769            move(TrustedImm32(0xd800), leadingSurrogateTag);
    37393770            move(TrustedImm32(0xdc00), trailingSurrogateTag);
     
    38163847        , m_decodeSurrogatePairs(m_charSize == Char16 && m_pattern.unicode())
    38173848        , m_unicodeIgnoreCase(m_pattern.unicode() && m_pattern.ignoreCase())
     3849        , m_fixedSizedAlternative(false)
    38183850        , m_canonicalMode(m_pattern.unicode() ? CanonicalMode::Unicode : CanonicalMode::UCS2)
    38193851#if ENABLE(YARR_JIT_ALL_PARENS_EXPRESSIONS)
     
    38693901        generateFailReturn();
    38703902        hasInput.link(this);
     3903
     3904#ifdef JIT_UNICODE_EXPRESSIONS
     3905        if (m_decodeSurrogatePairs)
     3906            getEffectiveAddress(BaseIndex(input, length, TimesTwo), endOfStringAddress);
     3907#endif
    38713908
    38723909#if ENABLE(YARR_JIT_ALL_PARENS_EXPRESSIONS)
     
    41644201    bool m_decodeSurrogatePairs;
    41654202    bool m_unicodeIgnoreCase;
     4203    bool m_fixedSizedAlternative;
    41664204    CanonicalMode m_canonicalMode;
    41674205#if ENABLE(YARR_JIT_ALL_PARENS_EXPRESSIONS)
Note: See TracChangeset for help on using the changeset viewer.