Context Navigation

← Previous Change
Next Change →

YarrJIT.cpp

Timestamp:

Mar 26, 2012, 1:13:39 PM (13 years ago)

Author:

Message:

Greek sigma is handled wrong in case independent regexp.
https://bugs.webkit.org/show_bug.cgi?id=82063

Reviewed by Oliver Hunt.

Source/JavaScriptCore:

The bug here is that we assume that any given codepoint has at most one additional value it
should match under a case insensitive match, and that the pair of codepoints that match (if
a codepoint does not only match itself) can be determined by calling toUpper/toLower on the
given codepoint). Life is not that simple.

Instead, pre-calculate a set of tables mapping from a UCS2 codepoint to the set of characters
it may match, under the ES5.1 case-insensitive matching rules. Since unicode is fairly regular
we can pack this table quite nicely, and get it down to 364 entries. This means we can use a
simple binary search to find an entry in typically eight compares.

CMakeLists.txt:
GNUmakefile.list.am:
JavaScriptCore.gypi:
JavaScriptCore.vcproj/JavaScriptCore/JavaScriptCore.vcproj:
JavaScriptCore.xcodeproj/project.pbxproj:
yarr/yarr.pri:
- Added new files to build systems.
yarr/YarrCanonicalizeUCS2.cpp: Added.
- New - autogenerated, UCS2 canonicalized comparison tables.
yarr/YarrCanonicalizeUCS2.h: Added.

(JSC::Yarr::rangeInfoFor):

Look up the canonicalization info for a UCS2 character.

(JSC::Yarr::getCanonicalPair):

For a UCS2 character with a single equivalent value, look it up.

(JSC::Yarr::isCanonicallyUnique):

Returns true if no other UCS2 code points are canonically equal.

(JSC::Yarr::areCanonicallyEquivalent):

Compare two values, under canonicalization rules.

yarr/YarrCanonicalizeUCS2.js: Added.
- script used to generate YarrCanonicalizeUCS2.cpp.
yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::tryConsumeBackReference):

Use isCanonicallyUnique, rather than Unicode toUpper/toLower.

yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):

Use isCanonicallyUnique, rather than Unicode toUpper/toLower.

yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::putChar):

Updated to determine canonical equivalents correctly.

(JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):

Added, used to put a non-ascii, non-unique character in a case-insensitive match.

(JSC::Yarr::CharacterClassConstructor::putRange):

Updated to determine canonical equivalents correctly.

(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):

Changed to call putUnicodeIgnoreCase, instead of putChar, avoid a double lookup of rangeInfo.

LayoutTests:

fast/regex/script-tests/unicodeCaseInsensitive.js: Added.

(shouldBeTrue.ucs2CodePoint):

fast/regex/unicodeCaseInsensitive-expected.txt: Added.
fast/regex/unicodeCaseInsensitive.html: Added.
- Added test cases for case-insensitive matches of non-ascii characters.

File:

: 1 edited

trunk/Source/JavaScriptCore/yarr/YarrJIT.cpp (modified) (5 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/Source/JavaScriptCore/yarr/YarrJIT.cpp

-              r110033
+              r112143
 #include "LinkBuffer.h"
 #include "Yarr.h"
+#include "YarrCanonicalizeUCS2.h"
 #if ENABLE(YARR_JIT)
 …
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
         ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || (Unicode::toLower(ch) == Unicode::toUpper(ch)));
+        ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
         if (m_pattern.m_ignoreCase && isASCIIAlpha(ch)) {
             or32(TrustedImm32(32), character);
             ch = Unicode::toLower(ch);
+            or32(TrustedImm32(0x20), character);
+            ch |= 0x20;
+        }
 …
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
         ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || (Unicode::toLower(ch) == Unicode::toUpper(ch)));
         if ((m_pattern.m_ignoreCase) && (isASCIIAlpha(ch)))
+        ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
+        if (m_pattern.m_ignoreCase && isASCIIAlpha(ch))
             ignoreCaseMask |= 32;
 …
             // For case-insesitive compares, non-ascii characters that have different
             // upper & lower case representations are converted to a character class.
             ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(currentCharacter) || (Unicode::toLower(currentCharacter) == Unicode::toUpper(currentCharacter)));
+            ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(currentCharacter) || isCanonicallyUnique(currentCharacter));
             allCharacters |= (currentCharacter << shiftAmount);
 …
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
         ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || (Unicode::toLower(ch) == Unicode::toUpper(ch)));
+        ASSERT(!m_pattern.m_ignoreCase || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
         if (m_pattern.m_ignoreCase && isASCIIAlpha(ch)) {
             or32(TrustedImm32(32), character);
             ch = Unicode::toLower(ch);
+            or32(TrustedImm32(0x20), character);
+            ch |= 0x20;
+        }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 112143 in webkit for trunk/Source/JavaScriptCore/yarr/YarrJIT.cpp

Legend:

trunk/Source/JavaScriptCore/yarr/YarrJIT.cpp

Download in other formats: