Context Navigation

← Previous Change
Next Change →

YarrPattern.cpp

Timestamp:

Mar 26, 2012, 1:13:39 PM (13 years ago)

Author:

Message:

Greek sigma is handled wrong in case independent regexp.
https://bugs.webkit.org/show_bug.cgi?id=82063

Reviewed by Oliver Hunt.

Source/JavaScriptCore:

The bug here is that we assume that any given codepoint has at most one additional value it
should match under a case insensitive match, and that the pair of codepoints that match (if
a codepoint does not only match itself) can be determined by calling toUpper/toLower on the
given codepoint). Life is not that simple.

Instead, pre-calculate a set of tables mapping from a UCS2 codepoint to the set of characters
it may match, under the ES5.1 case-insensitive matching rules. Since unicode is fairly regular
we can pack this table quite nicely, and get it down to 364 entries. This means we can use a
simple binary search to find an entry in typically eight compares.

CMakeLists.txt:
GNUmakefile.list.am:
JavaScriptCore.gypi:
JavaScriptCore.vcproj/JavaScriptCore/JavaScriptCore.vcproj:
JavaScriptCore.xcodeproj/project.pbxproj:
yarr/yarr.pri:
- Added new files to build systems.
yarr/YarrCanonicalizeUCS2.cpp: Added.
- New - autogenerated, UCS2 canonicalized comparison tables.
yarr/YarrCanonicalizeUCS2.h: Added.

(JSC::Yarr::rangeInfoFor):

Look up the canonicalization info for a UCS2 character.

(JSC::Yarr::getCanonicalPair):

For a UCS2 character with a single equivalent value, look it up.

(JSC::Yarr::isCanonicallyUnique):

Returns true if no other UCS2 code points are canonically equal.

(JSC::Yarr::areCanonicallyEquivalent):

Compare two values, under canonicalization rules.

yarr/YarrCanonicalizeUCS2.js: Added.
- script used to generate YarrCanonicalizeUCS2.cpp.
yarr/YarrInterpreter.cpp:

(JSC::Yarr::Interpreter::tryConsumeBackReference):

Use isCanonicallyUnique, rather than Unicode toUpper/toLower.

yarr/YarrJIT.cpp:

(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):

Use isCanonicallyUnique, rather than Unicode toUpper/toLower.

yarr/YarrPattern.cpp:

(JSC::Yarr::CharacterClassConstructor::putChar):

Updated to determine canonical equivalents correctly.

(JSC::Yarr::CharacterClassConstructor::putUnicodeIgnoreCase):

Added, used to put a non-ascii, non-unique character in a case-insensitive match.

(JSC::Yarr::CharacterClassConstructor::putRange):

Updated to determine canonical equivalents correctly.

(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):

Changed to call putUnicodeIgnoreCase, instead of putChar, avoid a double lookup of rangeInfo.

LayoutTests:

fast/regex/script-tests/unicodeCaseInsensitive.js: Added.

(shouldBeTrue.ucs2CodePoint):

fast/regex/unicodeCaseInsensitive-expected.txt: Added.
fast/regex/unicodeCaseInsensitive.html: Added.
- Added test cases for case-insensitive matches of non-ascii characters.

File:

: 1 edited

trunk/Source/JavaScriptCore/yarr/YarrPattern.cpp (modified) (5 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/Source/JavaScriptCore/yarr/YarrPattern.cpp

-              r106748
+              r112143
 #include "Yarr.h"
+#include "YarrCanonicalizeUCS2.h"
 #include "YarrParser.h"
 #include <wtf/Vector.h>
 …
     void putChar(UChar ch)
+    {
+        // Handle ascii cases.
         if (ch <= 0x7f) {
             if (m_isCaseInsensitive && isASCIIAlpha(ch)) {
 …
             } else
                 addSorted(m_matches, ch);
+            return;
+        }
+        // Simple case, not a case-insensitive match.
+        if (!m_isCaseInsensitive) {
+            addSorted(m_matchesUnicode, ch);
+            return;
+        }
+        // Add multiple matches, if necessary.
+        UCS2CanonicalizationRange* info = rangeInfoFor(ch);
+        if (info->type == CanonicalizeUnique)
+            addSorted(m_matchesUnicode, ch);
+        else
+            putUnicodeIgnoreCase(ch, info);
+    }
+    void putUnicodeIgnoreCase(UChar ch, UCS2CanonicalizationRange* info)
+    {
+        ASSERT(m_isCaseInsensitive);
+        ASSERT(ch > 0x7f);
+        ASSERT(ch >= info->begin && ch <= info->end);
+        ASSERT(info->type != CanonicalizeUnique);
+        if (info->type == CanonicalizeSet) {
+            for (uint16_t* set = characterSetInfo[info->value]; (ch = *set); ++set)
+                addSorted(m_matchesUnicode, ch);
         } else {
+            UChar upper, lower;
+            if (m_isCaseInsensitive && ((upper = Unicode::toUpper(ch)) != (lower = Unicode::toLower(ch)))) {
+                addSorted(m_matchesUnicode, upper);
+                addSorted(m_matchesUnicode, lower);
+            } else
+                addSorted(m_matchesUnicode, ch);
+        }
+    }
+    // returns true if this character has another case, and 'ch' is the upper case form.
+    static inline bool isUnicodeUpper(UChar ch)
+    {
+        return ch != Unicode::toLower(ch);
+    }
+    // returns true if this character has another case, and 'ch' is the lower case form.
+    static inline bool isUnicodeLower(UChar ch)
+    {
+        return ch != Unicode::toUpper(ch);
+            addSorted(m_matchesUnicode, ch);
+            addSorted(m_matchesUnicode, getCanonicalPair(info, ch));
+        }
+    }
 …
+            }
+        }
+        if (hi >= 0x80) {
+            uint32_t unicodeCurr = std::max(lo, (UChar)0x80);
+            addSortedRange(m_rangesUnicode, unicodeCurr, hi);
+            if (m_isCaseInsensitive) {
+                while (unicodeCurr <= hi) {
+                    // If the upper bound of the range (hi) is 0xffff, the increments to
+                    // unicodeCurr in this loop may take it to 0x10000.  This is fine
+                    // (if so we won't re-enter the loop, since the loop condition above
+                    // will definitely fail) - but this does mean we cannot use a UChar
+                    // to represent unicodeCurr, we must use a 32-bit value instead.
+                    ASSERT(unicodeCurr <= 0xffff);
+                    if (isUnicodeUpper(unicodeCurr)) {
+                        UChar lowerCaseRangeBegin = Unicode::toLower(unicodeCurr);
+                        UChar lowerCaseRangeEnd = lowerCaseRangeBegin;
+                        while ((++unicodeCurr <= hi) && isUnicodeUpper(unicodeCurr) && (Unicode::toLower(unicodeCurr) == (lowerCaseRangeEnd + 1)))
+                            lowerCaseRangeEnd++;
+                        addSortedRange(m_rangesUnicode, lowerCaseRangeBegin, lowerCaseRangeEnd);
+                    } else if (isUnicodeLower(unicodeCurr)) {
+                        UChar upperCaseRangeBegin = Unicode::toUpper(unicodeCurr);
+                        UChar upperCaseRangeEnd = upperCaseRangeBegin;
+                        while ((++unicodeCurr <= hi) && isUnicodeLower(unicodeCurr) && (Unicode::toUpper(unicodeCurr) == (upperCaseRangeEnd + 1)))
+                            upperCaseRangeEnd++;
+                        addSortedRange(m_rangesUnicode, upperCaseRangeBegin, upperCaseRangeEnd);
+                    } else
+                        ++unicodeCurr;
+                }
+            }
+        }
+        if (hi <= 0x7f)
+            return;
+        lo = std::max(lo, (UChar)0x80);
+        addSortedRange(m_rangesUnicode, lo, hi);
+        if (!m_isCaseInsensitive)
+            return;
+        UCS2CanonicalizationRange* info = rangeInfoFor(lo);
+        while (true) {
+            // Handle the range [lo .. end]
+            UChar end = std::min(info->end, hi);
+            switch (info->type) {
+            case CanonicalizeUnique:
+                // Nothing to do - no canonical equivalents.
+                break;
+            case CanonicalizeSet: {
+                UChar ch;
+                for (uint16_t* set = characterSetInfo[info->value]; (ch = *set); ++set)
+                    addSorted(m_matchesUnicode, ch);
+                break;
+            }
+            case CanonicalizeRangeLo:
+                addSortedRange(m_rangesUnicode, lo + info->value, end + info->value);
+                break;
+            case CanonicalizeRangeHi:
+                addSortedRange(m_rangesUnicode, lo - info->value, end - info->value);
+                break;
+            case CanonicalizeAlternatingAligned:
+                // Use addSortedRange since there is likely an abutting range to combine with.
+                if (lo & 1)
+                    addSortedRange(m_rangesUnicode, lo - 1, lo - 1);
+                if (!(end & 1))
+                    addSortedRange(m_rangesUnicode, end + 1, end + 1);
+                break;
+            case CanonicalizeAlternatingUnaligned:
+                // Use addSortedRange since there is likely an abutting range to combine with.
+                if (!(lo & 1))
+                    addSortedRange(m_rangesUnicode, lo - 1, lo - 1);
+                if (end & 1)
+                    addSortedRange(m_rangesUnicode, end + 1, end + 1);
+                break;
+            }
+            if (hi == end)
+                return;
+            ++info;
+            lo = info->begin;
+        };
+    }
 …
         // We handle case-insensitive checking of unicode characters which do have both
         // cases by handling them as if they were defined using a CharacterClass.
+        if (m_pattern.m_ignoreCase && !isASCII(ch) && (Unicode::toUpper(ch) != Unicode::toLower(ch))) {
+            atomCharacterClassBegin();
+            atomCharacterClassAtom(ch);
+            atomCharacterClassEnd();
+        } else
+        if (!m_pattern.m_ignoreCase || isASCII(ch)) {
             m_alternative->m_terms.append(PatternTerm(ch));
+            return;
+        }
+        UCS2CanonicalizationRange* info = rangeInfoFor(ch);
+        if (info->type == CanonicalizeUnique) {
+            m_alternative->m_terms.append(PatternTerm(ch));
+            return;
+        }
+        m_characterClassConstructor.putUnicodeIgnoreCase(ch, info);
+        CharacterClass* newCharacterClass = m_characterClassConstructor.charClass();
+        m_pattern.m_userCharacterClasses.append(newCharacterClass);
+        m_alternative->m_terms.append(PatternTerm(newCharacterClass, false));
+    }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 112143 in webkit for trunk/Source/JavaScriptCore/yarr/YarrPattern.cpp

Legend:

trunk/Source/JavaScriptCore/yarr/YarrPattern.cpp

Download in other formats: