Ignore:
Timestamp:
Mar 26, 2012, 1:13:39 PM (13 years ago)
Author:
[email protected]
Message:

Greek sigma is handled wrong in case independent regexp.
https://bugs.webkit.org/show_bug.cgi?id=82063

Reviewed by Oliver Hunt.

Source/JavaScriptCore:

The bug here is that we assume that any given codepoint has at most one additional value it
should match under a case insensitive match, and that the pair of codepoints that match (if
a codepoint does not only match itself) can be determined by calling toUpper/toLower on the
given codepoint). Life is not that simple.

Instead, pre-calculate a set of tables mapping from a UCS2 codepoint to the set of characters
it may match, under the ES5.1 case-insensitive matching rules. Since unicode is fairly regular
we can pack this table quite nicely, and get it down to 364 entries. This means we can use a
simple binary search to find an entry in typically eight compares.

  • CMakeLists.txt:
  • GNUmakefile.list.am:
  • JavaScriptCore.gypi:
  • JavaScriptCore.vcproj/JavaScriptCore/JavaScriptCore.vcproj:
  • JavaScriptCore.xcodeproj/project.pbxproj:
  • yarr/yarr.pri:
    • Added new files to build systems.
  • yarr/YarrCanonicalizeUCS2.cpp: Added.
    • New - autogenerated, UCS2 canonicalized comparison tables.
  • yarr/YarrCanonicalizeUCS2.h: Added.

(JSC::Yarr::rangeInfoFor):

  • Look up the canonicalization info for a UCS2 character.

(JSC::Yarr::getCanonicalPair):

  • For a UCS2 character with a single equivalent value, look it up.

(JSC::Yarr::isCanonicallyUnique):

  • Returns true if no other UCS2 code points are canonically equal.

(JSC::Yarr::areCanonicallyEquivalent):

  • Compare two values, under canonicalization rules.
  • yarr/YarrCanonicalizeUCS2.js: Added.
    • script used to generate YarrCanonicalizeUCS2.cpp.
  • yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::tryConsumeBackReference):

  • Use isCanonicallyUnique, rather than Unicode toUpper/toLower.
  • yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):

  • Use isCanonicallyUnique, rather than Unicode toUpper/toLower.
  • yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::putChar):

  • Updated to determine canonical equivalents correctly.

(JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):

  • Added, used to put a non-ascii, non-unique character in a case-insensitive match.

(JSC::Yarr::CharacterClassConstructor::putRange):

  • Updated to determine canonical equivalents correctly.

(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):

  • Changed to call putUnicodeIgnoreCase, instead of putChar, avoid a double lookup of rangeInfo.

LayoutTests:

  • fast/regex/script-tests/unicodeCaseInsensitive.js: Added.

(shouldBeTrue.ucs2CodePoint):

  • fast/regex/unicodeCaseInsensitive-expected.txt: Added.
  • fast/regex/unicodeCaseInsensitive.html: Added.
    • Added test cases for case-insensitive matches of non-ascii characters.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/Source/JavaScriptCore/yarr/YarrJIT.cpp

    r110033 r112143  
    3030#include "LinkBuffer.h"
    3131#include "Yarr.h"
     32#include "YarrCanonicalizeUCS2.h"
    3233
    3334#if ENABLE(YARR_JIT)
     
    263264        // For case-insesitive compares, non-ascii characters that have different
    264265        // upper & lower case representations are converted to a character class.
    265         ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || (Unicode::toLower(ch) == Unicode::toUpper(ch)));
     266        ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
    266267        if (m_pattern.m_ignoreCase && isASCIIAlpha(ch)) {
    267             or32(TrustedImm32(32), character);
    268             ch = Unicode::toLower(ch);
     268            or32(TrustedImm32(0x20), character);
     269            ch |= 0x20;
    269270        }
    270271
     
    686687        // For case-insesitive compares, non-ascii characters that have different
    687688        // upper & lower case representations are converted to a character class.
    688         ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || (Unicode::toLower(ch) == Unicode::toUpper(ch)));
    689 
    690         if ((m_pattern.m_ignoreCase) && (isASCIIAlpha(ch)))
     689        ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
     690
     691        if (m_pattern.m_ignoreCase && isASCIIAlpha(ch))
    691692            ignoreCaseMask |= 32;
    692693
     
    714715            // For case-insesitive compares, non-ascii characters that have different
    715716            // upper & lower case representations are converted to a character class.
    716             ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(currentCharacter) || (Unicode::toLower(currentCharacter) == Unicode::toUpper(currentCharacter)));
     717            ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(currentCharacter) || isCanonicallyUnique(currentCharacter));
    717718
    718719            allCharacters |= (currentCharacter << shiftAmount);
     
    791792        // For case-insesitive compares, non-ascii characters that have different
    792793        // upper & lower case representations are converted to a character class.
    793         ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || (Unicode::toLower(ch) == Unicode::toUpper(ch)));
     794        ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
    794795        if (m_pattern.m_ignoreCase && isASCIIAlpha(ch)) {
    795             or32(TrustedImm32(32), character);
    796             ch = Unicode::toLower(ch);
     796            or32(TrustedImm32(0x20), character);
     797            ch |= 0x20;
    797798        }
    798799
Note: See TracChangeset for help on using the changeset viewer.