Ignore:
Timestamp:
Mar 28, 2019, 11:05:55 PM (6 years ago)
Author:
[email protected]
Message:

[YARR] Precompute BMP / non-BMP status when constructing character classes
https://bugs.webkit.org/show_bug.cgi?id=196296

Reviewed by Keith Miller.

Changed CharacterClass::m_hasNonBMPCharacters into a character width bit field which
indicateis if the class includes characters from either BMP, non-BMP or both ranges.
This allows the recognizing code to eliminate checks for the width of a matched
characters when the class has only one width. The character width is needed to
determine if we advance 1 or 2 character. Also, the pre-computed width of character
classes that contains either all BMP or all non-BMP characters allows the parser to
use fixed widths for terms using those character classes. Changed both the code gen
scripts and Yarr compiler to compute this bit field during the construction of
character classes.

For JIT'ed code of character classes that contain either all BMP or all non-BMP
characters, we can eliminate the generic check we were doing do compute how much
to advance after sucessfully matching a character in the class.

Generic isBMP check BMP only non-BMP only
-------------- -------------- --------------
inc %r9d inc %r9d add $0x2, %r9d
cmp $0x10000, %eax
jl isBMP
cmp %edx, %esi
jz atEndOfString
inc %r9d
inc %esi

isBMP:

For character classes that contained non-BMP characters, we were always generating
the code in the left column. The middle column is the code we generate for character
classes that contain only BMP characters. The right column is the code we now
generate if the character class has only non-BMP characters. In the fix width cases,
we can eliminate both the isBMP check as well as the atEndOfString check. The
atEndOfstring check is eliminated since we know how many characters this character
class requires and that check can be factored out to the beginning of the current
alternative. For character classes that contain both BMP and non-BMP characters,
we still generate the generic left column.

This change is a ~8% perf progression on UniPoker and a ~2% improvement on RexBench
as a whole.

  • runtime/RegExp.cpp:

(JSC::RegExp::matchCompareWithInterpreter):

  • runtime/RegExpInlines.h:

(JSC::RegExp::matchInline):

  • yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::checkCharacterClassDontAdvanceInputForNonBMP):
(JSC::Yarr::Interpreter::matchCharacterClass):

  • yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::advanceIndexAfterCharacterClassTermMatch):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):

  • yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::CharacterClassConstructor::characterWidths):
(JSC::Yarr::PatternTerm::dump):
(JSC::Yarr::anycharCreate):

  • yarr/YarrPattern.h:

(JSC::Yarr::operator|):
(JSC::Yarr::operator&):
(JSC::Yarr::operator|=):
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::CharacterClass::hasNonBMPCharacters):
(JSC::Yarr::CharacterClass::hasOneCharacterSize):
(JSC::Yarr::CharacterClass::hasOnlyNonBMPCharacters):
(JSC::Yarr::PatternTerm::invert const):
(JSC::Yarr::PatternTerm::invert): Deleted.

  • yarr/create_regex_tables:
  • yarr/generateYarrUnicodePropertyTables.py:
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/Source/JavaScriptCore/yarr/generateYarrUnicodePropertyTables.py

    r236870 r243642  
    3636
    3737header = """/*
    38 * Copyright (C) 2017-2018 Apple Inc. All rights reserved.
     38* Copyright (C) 2017-2019 Apple Inc. All rights reserved.
    3939*
    4040* Redistribution and use in source and binary forms, with or without
     
    226226        self.aliases = []
    227227        self.index = len(PropertyData.allPropertyData)
     228        self.hasBMPCharacters = False
    228229        self.hasNonBMPCharacters = False
    229230        self.matches = []
     
    250251
    251252    def addMatch(self, codePoint):
    252         if codePoint > MaxBMP:
     253        if codePoint <= MaxBMP:
     254            self.hasBMPCharacters = True
     255        else:
    253256            self.hasNonBMPCharacters = True
    254257        if codePoint <= lastASCIICodePoint:
     
    282285
    283286    def addRange(self, lowCodePoint, highCodePoint):
     287        if lowCodePoint <= MaxBMP:
     288            self.hasBMPCharacters = True
    284289        if highCodePoint > MaxBMP:
    285290            self.hasNonBMPCharacters = True
     
    537542        file.write("        std::initializer_list<CharacterRange>(")
    538543        self.dumpMatchData(file, 4, self.unicodeRanges, lambda file, range: (file.write("{{{0:0=#6x}, {1:0=#6x}}}".format(range[0], range[1]))))
    539         file.write("));\n")
    540 
    541         file.write("    characterClass->m_hasNonBMPCharacters = {};\n".format(("false", "true")[self.hasNonBMPCharacters]))
     544        file.write("),\n")
     545
     546        file.write("        CharacterClassWidths::{});\n".format(("Unknown", "HasBMPChars", "HasNonBMPChars", "HasBothBMPAndNonBMP")[int(self.hasNonBMPCharacters) * 2 + int(self.hasBMPCharacters)]))
    542547        file.write("    return characterClass;\n}\n\n")
    543548
Note: See TracChangeset for help on using the changeset viewer.