Context Navigation

← Previous Change
Next Change →

pcre_compile.cpp

Timestamp:

Dec 16, 2007, 8:19:25 PM (17 years ago)

Author:

Darin Adler

Message:

Reviewed by Maciej.

http://bugs.webkit.org/show_bug.cgi?id=16438
removed some more unused code
changed quite a few more names to WebKit-style
moved more things out of pcre_internal.h
changed some indentation to WebKit-style
improved design of the functions for reading and writing 2-byte values from the opcode stream (in pcre_internal.h)

pcre/dftables.cpp: (main): Added the kjs prefix a normal way in lieu of using macros.

pcre/pcre_compile.cpp: Moved some definitions here from pcre_internal.h. (errorText): Name changes, fewer typedefs. (checkEscape): Ditto. Changed uppercase conversion to use toASCIIUpper. (isCountedRepeat): Name change. (readRepeatCounts): Name change. (firstSignificantOpcode): Got rid of the use of OP_lengths, which is very lightly used here. Hard-coded the length of OP_BRANUMBER. (firstSignificantOpcodeSkippingAssertions): Ditto. Also changed to use the advanceToEndOfBracket function. (getOthercaseRange): Name changes. (encodeUTF8): Ditto. (compileBranch): Name changes. Removed unused after_manual_callout and the code to handle it. Removed code to handle OP_ONCE since we never emit this opcode. Changed to use advanceToEndOfBracket in more places. (compileBracket): Name changes. (branchIsAnchored): Removed code to handle OP_ONCE since we never emit this opcode. (bracketIsAnchored): Name changes. (branchNeedsLineStart): More fo the same. (bracketNeedsLineStart): Ditto. (branchFindFirstAssertedCharacter): Removed OP_ONCE code. (bracketFindFirstAssertedCharacter): More of the same. (calculateCompiledPatternLengthAndFlags): Ditto. (returnError): Name changes. (jsRegExpCompile): Ditto.

pcre/pcre_exec.cpp: Moved some definitions here from pcre_internal.h. (matchRef): Updated names. Improved macros to use the do { } while(0) idiom so they expand to single statements rather than to blocks or multiple statements. And refeactored the recursive match macros. (MatchStack::pushNewFrame): Name changes. (getUTF8CharAndIncrementLength): Name changes. (match): Name changes. Removed the ONCE opcode. (jsRegExpExecute): Name changes.

pcre/pcre_internal.h: Removed quite a few unneeded includes. Rewrote quite a few comments. Removed the macros that add kjs prefixes to the functions with external linkage; instead renamed the functions. Removed the unneeded typedefs pcre_uint16, pcre_uint32, and uschar. Removed the dead and not-all-working code for LINK_SIZE values other than 2, although we aim to keep the abstraction working. Removed the OP_LENGTHS macro. (put2ByteValue): Replaces put2ByteOpcodeValueAtOffset. (get2ByteValue): Replaces get2ByteOpcodeValueAtOffset. (put2ByteValueAndAdvance): Replaces put2ByteOpcodeValueAtOffsetAndAdvance. (putLinkValueAllowZero): Replaces putOpcodeValueAtOffset; doesn't do the addition, since a comma is really no better than a plus sign. Added an assertion to catch out of range values and changed the parameter type to int rather than unsigned. (getLinkValueAllowZero): Replaces getOpcodeValueAtOffset. (putLinkValue): New function that most former callers of the putOpcodeValueAtOffset function can use; asserts the value that is being stored is non-zero and then calls putLinkValueAllowZero. (getLinkValue): Ditto. (putLinkValueAndAdvance): Replaces putOpcodeValueAtOffsetAndAdvance. No caller was using an offset, which makes sense given the advancing behavior. (putLinkValueAllowZeroAndAdvance): Ditto. (isBracketOpcode): Added. For use in an assertion. (advanceToEndOfBracket): Renamed from moveOpcodePtrPastAnyAlternateBranches, and removed comments about how it's not well designed. This function takes a pointer to the beginning of a bracket and advances to the end of the bracket.

pcre/pcre_tables.cpp: Updated names.
pcre/pcre_ucp_searchfuncs.cpp: (kjs_pcre_ucp_othercase): Ditto.
pcre/pcre_xclass.cpp: (getUTF8CharAndAdvancePointer): Ditto. (kjs_pcre_xclass): Ditto.
pcre/ucpinternal.h: Ditto.

wtf/ASCIICType.h: (WTF::isASCIIAlpha): Added an int overload, like the one we already have for isASCIIDigit. (WTF::isASCIIAlphanumeric): Ditto. (WTF::isASCIIHexDigit): Ditto. (WTF::isASCIILower): Ditto. (WTF::isASCIISpace): Ditto. (WTF::toASCIILower): Ditto. (WTF::toASCIIUpper): Ditto.

File:

: 1 edited

trunk/JavaScriptCore/pcre/pcre_compile.cpp (modified) (89 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/JavaScriptCore/pcre/pcre_compile.cpp

-              r28785
+              r28793
 using namespace WTF;
+/* Negative values for the firstchar and reqchar variables */
+#define REQ_UNSET (-2)
+#define REQ_NONE  (-1)
 /*************************************************
 *      Code parameters and static tables         *
 …
 };
-/* Table of sizes for the fixed-length opcodes. It's defined in a macro so that
-the definition is next to the definition of the opcodes in pcre_internal.h. */
-static const uschar OP_lengths[] = { OP_LENGTHS };
 /* The texts of compile-time error messages. These are "char *" because they
 are passed to the outside world. */
 static const char* error_text(ErrorCode code)
+static const char* errorText(ErrorCode code)
+{
     static const char error_texts[] =
+    static const char errorTexts[] =
       /* 1 */
       "\\ at end of pattern\0"
 …
     int i = code;
     const char* text = error_texts;
+    const char* text = errorTexts;
     while (i > 1)
         i -= !*text++;
 …
         needOuterBracket = false;
+    }
     const uschar* start_code;   /* The start of the compiled code */
+    const unsigned char* start_code;   /* The start of the compiled code */
     const UChar* start_pattern; /* The start of the pattern */
     int top_backref;            /* Maximum back reference */
 …
 /* Definitions to allow mutual recursion */
 static bool compileBracket(int, int*, uschar**, const UChar**, const UChar*, ErrorCode*, int, int*, int*, CompileData&);
 static bool bracketIsAnchored(const uschar* code);
 static bool bracketNeedsLineStart(const uschar* code, unsigned captureMap, unsigned backrefMap);
 static int bracketFindFirstAssertedCharacter(const uschar* code, bool inassert);
+static bool compileBracket(int, int*, unsigned char**, const UChar**, const UChar*, ErrorCode*, int, int*, int*, CompileData&);
+static bool bracketIsAnchored(const unsigned char* code);
+static bool bracketNeedsLineStart(const unsigned char* code, unsigned captureMap, unsigned backrefMap);
+static int bracketFindFirstAssertedCharacter(const unsigned char* code, bool inassert);
 /*************************************************
 …
 */
 static int check_escape(const UChar** ptrptr, const UChar* patternEnd, ErrorCode* errorcodeptr, int bracount, bool isclass)
+static int checkEscape(const UChar** ptrptr, const UChar* patternEnd, ErrorCode* errorcodeptr, int bracount, bool isclass)
+{
     const UChar* ptr = *ptrptr + 1;
 …
     } else {
         switch (c) {
+        case '1':
+        case '2':
+        case '3':
+        case '4':
+        case '5':
+        case '6':
+        case '7':
+        case '8':
+        case '9':
+            /* Escape sequences starting with a non-zero digit are backreferences,
+             unless there are insufficient brackets, in which case they are octal
+             escape sequences. Those sequences end on the first non-octal character
+             or when we overflow 0-255, whichever comes first. */
+            if (!isclass) {
+                const UChar* oldptr = ptr;
+                c -= '0';
+                while ((ptr + 1 < patternEnd) && isASCIIDigit(ptr[1]) && c <= bracount)
+                    c = c * 10 + *(++ptr) - '0';
+                if (c <= bracount) {
+                    c = -(ESC_REF + c);
+            case '1':
+            case '2':
+            case '3':
+            case '4':
+            case '5':
+            case '6':
+            case '7':
+            case '8':
+            case '9':
+                /* Escape sequences starting with a non-zero digit are backreferences,
+                 unless there are insufficient brackets, in which case they are octal
+                 escape sequences. Those sequences end on the first non-octal character
+                 or when we overflow 0-255, whichever comes first. */
+                if (!isclass) {
+                    const UChar* oldptr = ptr;
+                    c -= '0';
+                    while ((ptr + 1 < patternEnd) && isASCIIDigit(ptr[1]) && c <= bracount)
+                        c = c * 10 + *(++ptr) - '0';
+                    if (c <= bracount) {
+                        c = -(ESC_REF + c);
+                        break;
+                    }
+                    ptr = oldptr;      /* Put the pointer back and fall through */
+                }
+                /* Handle an octal number following \. If the first digit is 8 or 9,
+                 this is not octal. */
+                if ((c = *ptr) >= '8')
                     break;
+                }
+                ptr = oldptr;      /* Put the pointer back and fall through */
+            }
+            /* Handle an octal number following \. If the first digit is 8 or 9,
+             this is not octal. */
+            if ((c = *ptr) >= '8')
+                break;
             /* \0 always starts an octal number, but we may drop through to here with a
              larger first octal digit. */
+        case '0': {
+            c -= '0';
+            int i;
+            for (i = 1; i <= 2; ++i) {
+                if (ptr + i >= patternEnd || ptr[i] < '0' || ptr[i] > '7')
+                    break;
+                int cc = c * 8 + ptr[i] - '0';
+                if (cc > 255)
+                    break;
+                c = cc;
+            case '0': {
+                c -= '0';
+                int i;
+                for (i = 1; i <= 2; ++i) {
+                    if (ptr + i >= patternEnd || ptr[i] < '0' || ptr[i] > '7')
+                        break;
+                    int cc = c * 8 + ptr[i] - '0';
+                    if (cc > 255)
+                        break;
+                    c = cc;
+                }
+                ptr += i - 1;
+                break;
+            }
+            ptr += i - 1;
+            break;
+        }
+        case 'x': {
+            c = 0;
+            int i;
+            for (i = 1; i <= 2; ++i) {
+                if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i])) {
+                    c = 'x';
+                    i = 1;
+                    break;
+                }
+                int cc = ptr[i];
+                if (cc >= 'a')
+                    cc -= 32;             /* Convert to upper case */
+                c = c * 16 + cc - ((cc < 'A') ? '0' : ('A' - 10));
+            case 'x': {
+                c = 0;
+                int i;
+                for (i = 1; i <= 2; ++i) {
+                    if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i])) {
+                        c = 'x';
+                        i = 1;
+                        break;
+                    }
+                    int cc = ptr[i];
+                    if (cc >= 'a')
+                        cc -= 32;             /* Convert to upper case */
+                    c = c * 16 + cc - ((cc < 'A') ? '0' : ('A' - 10));
+                }
+                ptr += i - 1;
+                break;
+            }
+            ptr += i - 1;
+            break;
+        }
+        case 'u': {
+            c = 0;
+            int i;
+            for (i = 1; i <= 4; ++i) {
+                if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i])) {
+                    c = 'u';
+                    i = 1;
+                    break;
+                }
+                int cc = ptr[i];
+                if (cc >= 'a')
+                    cc -= 32;             /* Convert to upper case */
+                c = c * 16 + cc - ((cc < 'A') ? '0' : ('A' - 10));
+            case 'u': {
+                c = 0;
+                int i;
+                for (i = 1; i <= 4; ++i) {
+                    if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i])) {
+                        c = 'u';
+                        i = 1;
+                        break;
+                    }
+                    int cc = ptr[i];
+                    if (cc >= 'a')
+                        cc -= 32;             /* Convert to upper case */
+                    c = c * 16 + cc - ((cc < 'A') ? '0' : ('A' - 10));
+                }
+                ptr += i - 1;
+                break;
+            }
+            ptr += i - 1;
+            break;
+            /* Other special escapes not starting with a digit are straightforward */
+        }
+        case 'c':
+            if (++ptr == patternEnd) {
+                *errorcodeptr = ERR2;
+                return 0;
+            case 'c':
+                if (++ptr == patternEnd) {
+                    *errorcodeptr = ERR2;
+                    return 0;
+                }
+                c = *ptr;
+                /* A letter is upper-cased; then the 0x40 bit is flipped. This coding
+                 is ASCII-specific, but then the whole concept of \cx is ASCII-specific. */
+                c = toASCIIUpper(c) ^ 0x40;
+                break;
+            }
-            c = *ptr;
-            /* A letter is upper-cased; then the 0x40 bit is flipped. This coding
-             is ASCII-specific, but then the whole concept of \cx is ASCII-specific. */
-            if (c >= 'a' && c <= 'z')
-                c -= 32;
-            c ^= 0x40;
-            break;
+        }
+    }
 …
     return c;
+}
 /*************************************************
 …
 */
 static bool is_counted_repeat(const UChar* p, const UChar* patternEnd)
+static bool isCountedRepeat(const UChar* p, const UChar* patternEnd)
+{
     if (p >= patternEnd || !isASCIIDigit(*p))
 …
+}
 /*************************************************
 *         Read repeat counts                     *
 …
 /* Read an item of the form {n,m} and return the values. This is called only
 after is_counted_repeat() has confirmed that a repeat-count quantifier exists,
+after isCountedRepeat() has confirmed that a repeat-count quantifier exists,
 so the syntax is guaranteed to be correct, but we need to check the values.
 …
 */
 static const UChar* read_repeat_counts(const UChar* p, int* minp, int* maxp, ErrorCode* errorcodeptr)
+static const UChar* readRepeatCounts(const UChar* p, int* minp, int* maxp, ErrorCode* errorcodeptr)
+{
     int min = 0;
 …
+}
 /*************************************************
 *      Find first significant op code            *
 …
 /* This is called by several functions that scan a compiled expression looking
 for a fixed first character, or an anchoring op code etc. It skips over things
+that do not influence this. For some calls, a change of option is important.
+For some calls, it makes sense to skip negative forward and all backward
+assertions, and also the \b assertion; for others it does not.
+that do not influence this.
 Arguments:
   code         pointer to the start of the group
-  skipassert   true if certain assertions are to be skipped
 Returns:       pointer to the first significant opcode
 */
 static const uschar* firstSignificantOpCode(const uschar* code)
+static const unsigned char* firstSignificantOpcode(const unsigned char* code)
+{
     while (*code == OP_BRANUMBER)
         code += OP_lengths[*code];
+        code += 3;
     return code;
+}
 static const uschar* firstSignificantOpCodeSkippingAssertions(const uschar* code)
+static const unsigned char* firstSignificantOpcodeSkippingAssertions(const unsigned char* code)
+{
     while (true) {
         switch (*code) {
         case OP_ASSERT_NOT:
             do {
                 code += getOpcodeValueAtOffset(code, 1);
             } while (*code == OP_ALT);
             code += OP_lengths[*code];
             break;
         case OP_WORD_BOUNDARY:
         case OP_NOT_WORD_BOUNDARY:
         case OP_BRANUMBER:
             code += OP_lengths[*code];
             break;
         default:
             return code;
+            case OP_ASSERT_NOT:
+                advanceToEndOfBracket(code);
+                code += 1 + LINK_SIZE;
+                break;
+            case OP_WORD_BOUNDARY:
+            case OP_NOT_WORD_BOUNDARY:
+                ++code;
+                break;
+            case OP_BRANUMBER:
+                code += 3;
+                break;
+            default:
+                return code;
+        }
+    }
-    ASSERT_NOT_REACHED();
+}
-/*************************************************
-*        Find the fixed length of a pattern      *
-*************************************************/
-/* Scan a pattern and compute the fixed length of subject that will match it,
-if the length is fixed. This is needed for dealing with backward assertions.
-In UTF8 mode, the result is in characters rather than bytes.
-Arguments:
-  code     points to the start of the pattern (the bracket)
-  options  the compiling options
-Returns:   the fixed length, or -1 if there is no fixed length,
-             or -2 if \C was encountered
-*/
-static int find_fixedlength(uschar* code, int options)
+{
-    int length = -1;
-    int branchlength = 0;
-    uschar* cc = code + 1 + LINK_SIZE;
-    /* Scan along the opcodes for this branch. If we get to the end of the
-     branch, check the length against that of the other branches. */
-    while (true) {
-        int d;
-        int op = *cc;
-        if (op >= OP_BRA)
-            op = OP_BRA;
-        switch (op) {
-            case OP_BRA:
-            case OP_ONCE:
-                d = find_fixedlength(cc, options);
-                if (d < 0)
-                    return d;
-                branchlength += d;
-                do {
-                    cc += getOpcodeValueAtOffset(cc, 1);
-                } while (*cc == OP_ALT);
-                cc += 1 + LINK_SIZE;
-                break;
-                /* Reached end of a branch; if it's a ket it is the end of a nested
-                 call. If it's ALT it is an alternation in a nested call. If it is
-                 END it's the end of the outer call. All can be handled by the same code. */
-            case OP_ALT:
-            case OP_KET:
-            case OP_KETRMAX:
-            case OP_KETRMIN:
-            case OP_END:
-                if (length < 0)
-                    length = branchlength;
-                else if (length != branchlength)
-                    return -1;
-                if (*cc != OP_ALT)
-                    return length;
-                cc += 1 + LINK_SIZE;
-                branchlength = 0;
-                break;
-                /* Skip over assertive subpatterns */
-            case OP_ASSERT:
-            case OP_ASSERT_NOT:
-                do {
-                    cc += getOpcodeValueAtOffset(cc, 1);
-                } while (*cc == OP_ALT);
-                /* Fall through */
-                /* Skip over things that don't match chars */
-            case OP_BRANUMBER:
-            case OP_CIRC:
-            case OP_DOLL:
-            case OP_NOT_WORD_BOUNDARY:
-            case OP_WORD_BOUNDARY:
-                cc += OP_lengths[*cc];
-                break;
-                /* Handle literal characters */
-            case OP_CHAR:
-            case OP_CHAR_IGNORING_CASE:
-            case OP_NOT:
-                branchlength++;
-                cc += 2;
-                while ((*cc & 0xc0) == 0x80)
-                    cc++;
-                break;
-            case OP_ASCII_CHAR:
-            case OP_ASCII_LETTER_IGNORING_CASE:
-                branchlength++;
-                cc += 2;
-                break;
-                /* Handle exact repetitions. The count is already in characters, but we
-                 need to skip over a multibyte character in UTF8 mode.  */
-            case OP_EXACT:
-                branchlength += get2ByteOpcodeValueAtOffset(cc,1);
-                cc += 4;
-                while((*cc & 0x80) == 0x80)
-                    cc++;
-                break;
-            case OP_TYPEEXACT:
-                branchlength += get2ByteOpcodeValueAtOffset(cc,1);
-                cc += 4;
-                break;
-                /* Handle single-char matchers */
-            case OP_NOT_DIGIT:
-            case OP_DIGIT:
-            case OP_NOT_WHITESPACE:
-            case OP_WHITESPACE:
-            case OP_NOT_WORDCHAR:
-            case OP_WORDCHAR:
-            case OP_NOT_NEWLINE:
-                branchlength++;
-                cc++;
-                break;
-                /* Check a class for variable quantification */
-            case OP_XCLASS:
-                cc += getOpcodeValueAtOffset(cc, 1) - 33;
-                /* Fall through */
-            case OP_CLASS:
-            case OP_NCLASS:
-                cc += 33;
-                switch (*cc) {
-                case OP_CRSTAR:
-                case OP_CRMINSTAR:
-                case OP_CRQUERY:
-                case OP_CRMINQUERY:
-                    return -1;
-                case OP_CRRANGE:
-                case OP_CRMINRANGE:
-                    if (get2ByteOpcodeValueAtOffset(cc, 1) != get2ByteOpcodeValueAtOffset(cc, 3))
-                        return -1;
-                    branchlength += get2ByteOpcodeValueAtOffset(cc, 1);
-                    cc += 5;
-                    break;
-                default:
-                    branchlength++;
+                }
-                break;
-                /* Anything else is variable length */
-            default:
-                return -1;
+        }
+    }
-    ASSERT_NOT_REACHED();
+}
-/*************************************************
-*         Complete a callout item                *
-*************************************************/
-/* A callout item contains the length of the next item in the pattern, which
-we can't fill in till after we have reached the relevant point. This is used
-for both automatic and manual callouts.
-Arguments:
-  previous_callout   points to previous callout item
-  ptr                current pattern pointer
-  cd                 pointers to tables etc
-*/
-static void complete_callout(uschar* previous_callout, const UChar* ptr, const CompileData& cd)
+{
-    int length = ptr - cd.start_pattern - getOpcodeValueAtOffset(previous_callout, 2);
-    putOpcodeValueAtOffset(previous_callout, 2 + LINK_SIZE, length);
+}
 /*************************************************
 …
 */
 static bool get_othercase_range(int* cptr, int d, int* ocptr, int* odptr)
+static bool getOthercaseRange(int* cptr, int d, int* ocptr, int* odptr)
+{
     int c, othercase = 0;
     for (c = *cptr; c <= d; c++) {
         if ((othercase = _pcre_ucp_othercase(c)) >= 0)
+        if ((othercase = kjs_pcre_ucp_othercase(c)) >= 0)
             break;
+    }
 …
     for (++c; c <= d; c++) {
         if (_pcre_ucp_othercase(c) != next)
+        if (kjs_pcre_ucp_othercase(c) != next)
             break;
         next++;
 …
  */
+// FIXME: This should be removed as soon as all UTF8 uses are removed from PCRE
+int _pcre_ord2utf8(int cvalue, uschar *buffer)
+static int encodeUTF8(int cvalue, unsigned char *buffer)
+{
     int i;
     for (i = 0; i < _pcre_utf8_table1_size; i++)
         if (cvalue <= _pcre_utf8_table1[i])
+    for (i = 0; i < kjs_pcre_utf8_table1_size; i++)
+        if (cvalue <= kjs_pcre_utf8_table1[i])
             break;
     buffer += i;
 …
         cvalue >>= 6;
+    }
     *buffer = _pcre_utf8_table2[i] | cvalue;
+    *buffer = kjs_pcre_utf8_table2[i] | cvalue;
     return i + 1;
+}
 …
 static bool
 compileBranch(int options, int* brackets, uschar** codeptr,
+compileBranch(int options, int* brackets, unsigned char** codeptr,
                const UChar** ptrptr, const UChar* patternEnd, ErrorCode* errorcodeptr, int *firstbyteptr,
                int* reqbyteptr, CompileData& cd)
 …
     int bravalue = 0;
     int reqvary, tempreqvary;
-    int after_manual_callout = 0;
     int c;
     uschar* code = *codeptr;
     uschar* tempcode;
+    unsigned char* code = *codeptr;
+    unsigned char* tempcode;
     bool groupsetfirstbyte = false;
     const UChar* ptr = *ptrptr;
     const UChar* tempptr;
+    uschar* previous = NULL;
+    uschar* previous_callout = NULL;
+    uschar classbits[32];
+    unsigned char* previous = NULL;
+    unsigned char classbits[32];
     bool class_utf8;
     uschar* class_utf8data;
     uschar utf8_char[6];
+    unsigned char* class_utf8data;
+    unsigned char utf8_char[6];
     /* Initialize no first byte, no required byte. REQ_UNSET means "no char
 …
         int subfirstbyte;
         int mclength;
         uschar mcbuffer[8];
+        unsigned char mcbuffer[8];
         /* Next byte in the pattern */
 …
          a quantifier. */
+        bool is_quantifier = c == '*' || c == '+' || c == '?' || (c == '{' && is_counted_repeat(ptr + 1, patternEnd));
+        if (!is_quantifier && previous_callout && after_manual_callout-- <= 0) {
+            complete_callout(previous_callout, ptr, cd);
+            previous_callout = NULL;
+        }
+        bool is_quantifier = c == '*' || c == '+' || c == '?' || (c == '{' && isCountedRepeat(ptr + 1, patternEnd));
         switch (c) {
 …
                  bit map. */
                 memset(classbits, 0, 32 * sizeof(uschar));
+                memset(classbits, 0, 32 * sizeof(unsigned char));
                 /* Process characters until ] is reached. The first pass
 …
                     if (c == '\\') {
                         c = check_escape(&ptr, patternEnd, errorcodeptr, *brackets, true);
+                        c = checkEscape(&ptr, patternEnd, errorcodeptr, *brackets, true);
                         if (c < 0) {
                             class_charcount += 2;     /* Greater than 1 is what matters */
 …
                         if (d == '\\') {
                             const UChar* oldptr = ptr;
                             d = check_escape(&ptr, patternEnd, errorcodeptr, *brackets, true);
+                            d = checkEscape(&ptr, patternEnd, errorcodeptr, *brackets, true);
                             /* \X is literal X; any other special means the '-' was literal */
 …
                                 int cc = c;
                                 int origd = d;
                                 while (get_othercase_range(&cc, origd, &occ, &ocd)) {
+                                while (getOthercaseRange(&cc, origd, &occ, &ocd)) {
                                     if (occ >= c && ocd <= d)
                                         continue;  /* Skip embedded ranges */
 …
                                     else {
                                         *class_utf8data++ = XCL_RANGE;
                                         class_utf8data += _pcre_ord2utf8(occ, class_utf8data);
+                                        class_utf8data += encodeUTF8(occ, class_utf8data);
+                                    }
                                     class_utf8data += _pcre_ord2utf8(ocd, class_utf8data);
+                                    class_utf8data += encodeUTF8(ocd, class_utf8data);
+                                }
+                            }
 …
                             *class_utf8data++ = XCL_RANGE;
                             class_utf8data += _pcre_ord2utf8(c, class_utf8data);
                             class_utf8data += _pcre_ord2utf8(d, class_utf8data);
+                            class_utf8data += encodeUTF8(c, class_utf8data);
+                            class_utf8data += encodeUTF8(d, class_utf8data);
                             /* With UCP support, we are done. Without UCP support, there is no
 …
                         class_utf8 = true;
                         *class_utf8data++ = XCL_SINGLE;
                         class_utf8data += _pcre_ord2utf8(c, class_utf8data);
+                        class_utf8data += encodeUTF8(c, class_utf8data);
                         if (options & IgnoreCaseOption) {
                             int othercase;
                             if ((othercase = _pcre_ucp_othercase(c)) >= 0) {
+                            if ((othercase = kjs_pcre_ucp_othercase(c)) >= 0) {
                                 *class_utf8data++ = XCL_SINGLE;
                                 class_utf8data += _pcre_ord2utf8(othercase, class_utf8data);
+                                class_utf8data += encodeUTF8(othercase, class_utf8data);
+                            }
+                        }
 …
                     /* Now fill in the complete length of the item */
                     putOpcodeValueAtOffset(previous, 1, code - previous);
+                    putLinkValue(previous + 1, code - previous);
                     break;   /* End of class handling */
+                }
 …
                 if (!is_quantifier)
                     goto NORMAL_CHAR;
                 ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorcodeptr);
+                ptr = readRepeatCounts(ptr + 1, &repeat_min, &repeat_max, errorcodeptr);
                 if (*errorcodeptr)
                     goto FAILED;
 …
                 /* Save start of previous item, in case we have to move it up to make space
                  for an inserted OP_ONCE for the additional '+' extension. */
+                /* FIXME: Probably don't need this because we don't use OP_ONCE. */
                 tempcode = previous;
 …
                     if (code[-1] & 0x80) {
                         uschar *lastchar = code - 1;
+                        unsigned char *lastchar = code - 1;
                         while((*lastchar & 0xc0) == 0x80)
                             lastchar--;
 …
                     int prop_value = -1;
                     uschar* oldcode = code;
+                    unsigned char* oldcode = code;
                     code = previous;                  /* Usually overwrite previous item */
 …
                         else {
                             *code++ = OP_UPTO + repeat_type;
                             put2ByteOpcodeValueAtOffsetAndAdvance(code, 0, repeat_max);
+                            put2ByteValueAndAdvance(code, repeat_max);
+                        }
+                    }
 …
                                 goto END_REPEAT;
                             *code++ = OP_UPTO + repeat_type;
                             put2ByteOpcodeValueAtOffsetAndAdvance(code, 0, repeat_max - 1);
+                            put2ByteValueAndAdvance(code, repeat_max - 1);
+                        }
+                    }
 …
                     else {
                         *code++ = OP_EXACT + op_type;  /* NB EXACT doesn't have repeat_type */
                         put2ByteOpcodeValueAtOffsetAndAdvance(code, 0, repeat_min);
+                        put2ByteValueAndAdvance(code, repeat_min);
                         /* If the maximum is unlimited, insert an OP_STAR. Before doing so,
 …
                             repeat_max -= repeat_min;
                             *code++ = OP_UPTO + repeat_type;
                             put2ByteOpcodeValueAtOffsetAndAdvance(code, 0, repeat_max);
+                            put2ByteValueAndAdvance(code, repeat_max);
+                        }
+                    }
 …
                     else {
                         *code++ = OP_CRRANGE + repeat_type;
                         put2ByteOpcodeValueAtOffsetAndAdvance(code, 0, repeat_min);
+                        put2ByteValueAndAdvance(code, repeat_min);
                         if (repeat_max == -1)
                             repeat_max = 0;  /* 2-byte encoding for max */
                         put2ByteOpcodeValueAtOffsetAndAdvance(code, 0, repeat_max);
+                        put2ByteValueAndAdvance(code, repeat_max);
+                    }
+                }
 …
                  cases. */
                 else if (*previous >= OP_BRA || *previous == OP_ONCE) {
+                else if (*previous >= OP_BRA) {
                     int ketoffset = 0;
                     int len = code - previous;
                     uschar* bralink = NULL;
+                    unsigned char* bralink = NULL;
                     /* If the maximum repeat count is unlimited, find the end of the bracket
 …
                     if (repeat_max == -1) {
+                        uschar* ket = previous;
+                        do {
+                            ket += getOpcodeValueAtOffset(ket, 1);
+                        } while (*ket != OP_KET);
+                        const unsigned char* ket = previous;
+                        advanceToEndOfBracket(ket);
                         ketoffset = code - ket;
+                    }
 …
                             int offset = (!bralink) ? 0 : previous - bralink;
                             bralink = previous;
                             putOpcodeValueAtOffsetAndAdvance(previous, 0, offset);
+                            putLinkValueAllowZeroAndAdvance(previous, offset);
+                        }
 …
                                 int offset = (!bralink) ? 0 : code - bralink;
                                 bralink = code;
                                 putOpcodeValueAtOffsetAndAdvance(code, 0, offset);
+                                putLinkValueAllowZeroAndAdvance(code, offset);
+                            }
 …
                         while (bralink) {
                             int offset = code - bralink + 1;
                             uschar* bra = code - offset;
                             int oldlinkoffset = getOpcodeValueAtOffset(bra, 1);
                             bralink = oldlinkoffset ? bralink - oldlinkoffset : 0;
+                            unsigned char* bra = code - offset;
+                            int oldlinkoffset = getLinkValueAllowZero(bra + 1);
+                            bralink = (!oldlinkoffset) ? 0 : bralink - oldlinkoffset;
                             *code++ = OP_KET;
                             putOpcodeValueAtOffsetAndAdvance(code, 0, offset);
                             putOpcodeValueAtOffset(bra, 1, offset);
+                            putLinkValueAndAdvance(code, offset);
+                            putLinkValue(bra + 1, offset);
+                        }
+                    }
 …
                 if (*(++ptr) == '?') {
                     switch (*(++ptr)) {
                     case ':':                 /* Non-extracting bracket */
                         bravalue = OP_BRA;
                         ptr++;
                         break;
                     case '=':                 /* Positive lookahead */
                         bravalue = OP_ASSERT;
                         ptr++;
                         break;
                     case '!':                 /* Negative lookahead */
                         bravalue = OP_ASSERT_NOT;
                         ptr++;
                         break;
+                        case ':':                 /* Non-extracting bracket */
+                            bravalue = OP_BRA;
+                            ptr++;
+                            break;
+                        case '=':                 /* Positive lookahead */
+                            bravalue = OP_ASSERT;
+                            ptr++;
+                            break;
+                        case '!':                 /* Negative lookahead */
+                            bravalue = OP_ASSERT_NOT;
+                            ptr++;
+                            break;
                         /* Character after (? not specially recognized */
                     default:                  /* Option setting */
                         *errorcodeptr = ERR12;
                         goto FAILED;
+                    }
+                        default:
+                            *errorcodeptr = ERR12;
+                            goto FAILED;
+                        }
+                }
 …
                         bravalue = OP_BRA + EXTRACT_BASIC_MAX + 1;
                         code[1 + LINK_SIZE] = OP_BRANUMBER;
                         put2ByteOpcodeValueAtOffset(code, 2+LINK_SIZE, *brackets);
+                        put2ByteValue(code + 2 + LINK_SIZE, *brackets);
                         skipbytes = 3;
+                    }
 …
                  new setting for the ims options if they have changed. */
                 previous = (bravalue >= OP_ONCE) ? code : 0;
+                previous = (bravalue >= OP_BRAZERO) ? code : 0;
                 *code = bravalue;
                 tempcode = code;
 …
                 groupsetfirstbyte = false;
                 if (bravalue >= OP_BRA || bravalue == OP_ONCE) {
+                if (bravalue >= OP_BRA) {
                     /* If we have not yet set a firstbyte in this branch, take it from the
                      subpattern, remembering that it was set here so that a repeat of more
 …
             case '\\':
                 tempptr = ptr;
                 c = check_escape(&ptr, patternEnd, errorcodeptr, *brackets, false);
+                c = checkEscape(&ptr, patternEnd, errorcodeptr, *brackets, false);
                 /* Handle metacharacters introduced by \. For ones like \d, the ESC_ values
 …
                         previous = code;
                         *code++ = OP_REF;
                         put2ByteOpcodeValueAtOffsetAndAdvance(code, 0, number);
+                        put2ByteValueAndAdvance(code, number);
+                    }
 …
+                    }
                 } else {
                     mclength = _pcre_ord2utf8(c, mcbuffer);
+                    mclength = encodeUTF8(c, mcbuffer);
                     *code++ = (options & IgnoreCaseOption) ? OP_CHAR_IGNORING_CASE : OP_CHAR;
 …
     return false;
+}
 /*************************************************
 …
 static bool
 compileBracket(int options, int* brackets, uschar** codeptr,
+compileBracket(int options, int* brackets, unsigned char** codeptr,
     const UChar** ptrptr, const UChar* patternEnd, ErrorCode* errorcodeptr, int skipbytes,
     int* firstbyteptr, int* reqbyteptr, CompileData& cd)
+{
     const UChar* ptr = *ptrptr;
     uschar* code = *codeptr;
     uschar* last_branch = code;
     uschar* start_bracket = code;
+    unsigned char* code = *codeptr;
+    unsigned char* last_branch = code;
+    unsigned char* start_bracket = code;
     int firstbyte = REQ_UNSET;
     int reqbyte = REQ_UNSET;
 …
     /* Offset is set zero to mark that this bracket is still open */
     putOpcodeValueAtOffset(code, 1, 0);
+    putLinkValueAllowZero(code + 1, 0);
     code += 1 + LINK_SIZE + skipbytes;
 …
             int length = code - last_branch;
             do {
                 int prev_length = getOpcodeValueAtOffset(last_branch, 1);
                 putOpcodeValueAtOffset(last_branch, 1, length);
+                int prev_length = getLinkValueAllowZero(last_branch + 1);
+                putLinkValue(last_branch + 1, length);
                 length = prev_length;
                 last_branch -= length;
 …
             *code = OP_KET;
             putOpcodeValueAtOffset(code, 1, code - start_bracket);
+            putLinkValue(code + 1, code - start_bracket);
             code += 1 + LINK_SIZE;
 …
         *code = OP_ALT;
         putOpcodeValueAtOffset(code, 1, code - last_branch);
+        putLinkValue(code + 1, code - last_branch);
         last_branch = code;
         code += 1 + LINK_SIZE;
 …
     ASSERT_NOT_REACHED();
+}
 /*************************************************
 …
 */
 static bool branchIsAnchored(const uschar* code)
+static bool branchIsAnchored(const unsigned char* code)
+{
     const uschar* scode = firstSignificantOpCode(code);
+    const unsigned char* scode = firstSignificantOpcode(code);
     int op = *scode;
     /* Brackets */
     if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE)
+    if (op >= OP_BRA || op == OP_ASSERT)
         return bracketIsAnchored(scode);
 …
+}
 static bool bracketIsAnchored(const uschar* code)
+static bool bracketIsAnchored(const unsigned char* code)
+{
     do {
         if (!branchIsAnchored(code + 1 + LINK_SIZE))
             return false;
         code += getOpcodeValueAtOffset(code, 1);
+        code += getLinkValue(code + 1);
     } while (*code == OP_ALT);   /* Loop for each alternative */
     return true;
 …
 */
 static bool branchNeedsLineStart(const uschar* code, unsigned captureMap, unsigned backrefMap)
+static bool branchNeedsLineStart(const unsigned char* code, unsigned captureMap, unsigned backrefMap)
+{
     const uschar* scode = firstSignificantOpCode(code);
+    const unsigned char* scode = firstSignificantOpcode(code);
     int op = *scode;
 …
         int captureNum = op - OP_BRA;
         if (captureNum > EXTRACT_BASIC_MAX)
             captureNum = get2ByteOpcodeValueAtOffset(scode, 2 + LINK_SIZE);
+            captureNum = get2ByteValue(scode + 2 + LINK_SIZE);
         int bracketMask = (captureNum < 32) ? (1 << captureNum) : 1;
         return bracketNeedsLineStart(scode, captureMap | bracketMask, backrefMap);
 …
     /* Other brackets */
     if (op == OP_BRA || op == OP_ASSERT || op == OP_ONCE)
+    if (op == OP_BRA || op == OP_ASSERT)
         return bracketNeedsLineStart(scode, captureMap, backrefMap);
 …
+}
 static bool bracketNeedsLineStart(const uschar* code, unsigned captureMap, unsigned backrefMap)
+static bool bracketNeedsLineStart(const unsigned char* code, unsigned captureMap, unsigned backrefMap)
+{
     do {
         if (!branchNeedsLineStart(code + 1 + LINK_SIZE, captureMap, backrefMap))
             return false;
         code += getOpcodeValueAtOffset(code, 1);
+        code += getLinkValue(code + 1);
     } while (*code == OP_ALT);  /* Loop for each alternative */
     return true;
 …
 */
 static int branchFindFirstAssertedCharacter(const uschar* code, bool inassert)
+static int branchFindFirstAssertedCharacter(const unsigned char* code, bool inassert)
+{
     const uschar* scode = firstSignificantOpCodeSkippingAssertions(code);
+    const unsigned char* scode = firstSignificantOpcodeSkippingAssertions(code);
     int op = *scode;
 …
         case OP_BRA:
         case OP_ASSERT:
-        case OP_ONCE:
             return bracketFindFirstAssertedCharacter(scode, op == OP_ASSERT);
 …
+}
 static int bracketFindFirstAssertedCharacter(const uschar* code, bool inassert)
+static int bracketFindFirstAssertedCharacter(const unsigned char* code, bool inassert)
+{
     int c = -1;
 …
         else if (c != d)
             return -1;
         code += getOpcodeValueAtOffset(code, 1);
+        code += getLinkValue(code + 1);
     } while (*code == OP_ALT);
     return c;
 …
     unsigned brastackptr = 0;
     int brastack[BRASTACK_SIZE];
     uschar bralenstack[BRASTACK_SIZE];
+    unsigned char bralenstack[BRASTACK_SIZE];
     int bracount = 0;
 …
             case '\\':
                 c = check_escape(&ptr, patternEnd, &errorcode, bracount, false);
+                c = checkEscape(&ptr, patternEnd, &errorcode, bracount, false);
                 if (errorcode != 0)
                     return -1;
 …
                     if (c > 127) {
                         int i;
                         for (i = 0; i < _pcre_utf8_table1_size; i++)
                             if (c <= _pcre_utf8_table1[i]) break;
+                        for (i = 0; i < kjs_pcre_utf8_table1_size; i++)
+                            if (c <= kjs_pcre_utf8_table1[i]) break;
                         length += i;
                         lastitemlength += i;
 …
                         cd.top_backref = refnum;
                     length += 2;   /* For single back reference */
                     if (safelyCheckNextChar(ptr, patternEnd, '{') && is_counted_repeat(ptr + 2, patternEnd)) {
                         ptr = read_repeat_counts(ptr + 2, &minRepeats, &maxRepeats, &errorcode);
+                    if (safelyCheckNextChar(ptr, patternEnd, '{') && isCountedRepeat(ptr + 2, patternEnd)) {
+                        ptr = readRepeatCounts(ptr + 2, &minRepeats, &maxRepeats, &errorcode);
                         if (errorcode)
                             return -1;
 …
             case '{':
                 if (!is_counted_repeat(ptr+1, patternEnd))
+                if (!isCountedRepeat(ptr + 1, patternEnd))
                     goto NORMAL_CHAR;
                 ptr = read_repeat_counts(ptr+1, &minRepeats, &maxRepeats, &errorcode);
+                ptr = readRepeatCounts(ptr + 1, &minRepeats, &maxRepeats, &errorcode);
                 if (errorcode != 0)
                     return -1;
 …
                     if (*ptr == '\\') {
                         c = check_escape(&ptr, patternEnd, &errorcode, bracount, true);
+                        c = checkEscape(&ptr, patternEnd, &errorcode, bracount, true);
                         if (errorcode != 0)
                             return -1;
 …
                             if (safelyCheckNextChar(ptr, patternEnd, '\\')) {
                                 ptr++;
                                 d = check_escape(&ptr, patternEnd, &errorcode, bracount, true);
+                                d = checkEscape(&ptr, patternEnd, &errorcode, bracount, true);
                                 if (errorcode != 0)
                                     return -1;
 …
                             if ((d > 255 || (ignoreCase && d > 127))) {
                                 uschar buffer[6];
+                                unsigned char buffer[6];
                                 if (!class_utf8)         /* Allow for XCLASS overhead */
+                                {
 …
                                     int cc = c;
                                     int origd = d;
                                     while (get_othercase_range(&cc, origd, &occ, &ocd)) {
+                                    while (getOthercaseRange(&cc, origd, &occ, &ocd)) {
                                         if (occ >= c && ocd <= d)
                                             continue;   /* Skip embedded */
 …
                                         /* An extra item is needed */
                                         length += 1 + _pcre_ord2utf8(occ, buffer) +
                                         ((occ == ocd) ? 0 : _pcre_ord2utf8(ocd, buffer));
+                                        length += 1 + encodeUTF8(occ, buffer) +
+                                        ((occ == ocd) ? 0 : encodeUTF8(ocd, buffer));
+                                    }
+                                }
 …
                                 /* The length of the (possibly extended) range */
                                 length += 1 + _pcre_ord2utf8(c, buffer) + _pcre_ord2utf8(d, buffer);
+                                length += 1 + encodeUTF8(c, buffer) + encodeUTF8(d, buffer);
+                            }
 …
                         else {
                             if ((c > 255 || (ignoreCase && c > 127))) {
                                 uschar buffer[6];
+                                unsigned char buffer[6];
                                 class_optcount = 10;     /* Ensure > 1 */
                                 if (!class_utf8)         /* Allow for XCLASS overhead */
 …
                                     length += LINK_SIZE + 2;
+                                }
                                 length += (ignoreCase ? 2 : 1) * (1 + _pcre_ord2utf8(c, buffer));
+                                length += (ignoreCase ? 2 : 1) * (1 + encodeUTF8(c, buffer));
+                            }
+                        }
 …
                      we also need extra for wrapping the whole thing in a sub-pattern. */
                     if (safelyCheckNextChar(ptr, patternEnd, '{') && is_counted_repeat(ptr+2, patternEnd)) {
                         ptr = read_repeat_counts(ptr+2, &minRepeats, &maxRepeats, &errorcode);
+                    if (safelyCheckNextChar(ptr, patternEnd, '{') && isCountedRepeat(ptr + 2, patternEnd)) {
+                        ptr = readRepeatCounts(ptr + 2, &minRepeats, &maxRepeats, &errorcode);
                         if (errorcode != 0)
                             return -1;
 …
                 if (safelyCheckNextChar(ptr, patternEnd, '?')) {
                     switch (c = (ptr + 2 < patternEnd ? ptr[2] : 0)) {
                             /* Non-referencing groups and lookaheads just move the pointer on, and
                              then behave like a non-special bracket, except that they don't increment
                              the count of extracting brackets. Ditto for the "once only" bracket,
                              which is in Perl from version 5.005. */
+                        /* Non-referencing groups and lookaheads just move the pointer on, and
+                         then behave like a non-special bracket, except that they don't increment
+                         the count of extracting brackets. Ditto for the "once only" bracket,
+                         which is in Perl from version 5.005. */
                         case ':':
 …
                             break;
                             /* Else loop checking valid options until ) is met. Anything else is an
                              error. If we are without any brackets, i.e. at top level, the settings
                              act as if specified in the options, so massage the options immediately.
                              This is for backward compatibility with Perl 5.004. */
+                        /* Else loop checking valid options until ) is met. Anything else is an
+                         error. If we are without any brackets, i.e. at top level, the settings
+                         act as if specified in the options, so massage the options immediately.
+                         This is for backward compatibility with Perl 5.004. */
                         default:
 …
                     duplength = 0;
                 /* Leave ptr at the final char; for read_repeat_counts this happens
+                /* Leave ptr at the final char; for readRepeatCounts this happens
                  automatically; for the others we need an increment. */
                 if ((ptr + 1 < patternEnd) && (c = ptr[1]) == '{' && is_counted_repeat(ptr+2, patternEnd)) {
                     ptr = read_repeat_counts(ptr+2, &minRepeats, &maxRepeats, &errorcode);
+                if ((ptr + 1 < patternEnd) && (c = ptr[1]) == '{' && isCountedRepeat(ptr + 2, patternEnd)) {
+                    ptr = readRepeatCounts(ptr + 2, &minRepeats, &maxRepeats, &errorcode);
                     if (errorcode)
                         return -1;
 …
                 if (c > 127) {
                     int i;
                     for (i = 0; i < _pcre_utf8_table1_size; i++)
                         if (c <= _pcre_utf8_table1[i])
+                    for (i = 0; i < kjs_pcre_utf8_table1_size; i++)
+                        if (c <= kjs_pcre_utf8_table1[i])
                             break;
                     length += i;
 …
 */
 static JSRegExp* returnError(ErrorCode errorcode, const char** errorptr)
+static inline JSRegExp* returnError(ErrorCode errorcode, const char** errorptr)
+{
     *errorptr = error_text(errorcode);
+    *errorptr = errorText(errorcode);
     return 0;
+}
 …
      passed around in the compile data block. */
     const uschar* codeStart = (const uschar*)(re + 1);
+    const unsigned char* codeStart = (const unsigned char*)(re + 1);
     cd.start_code = codeStart;
     cd.start_pattern = (const UChar*)pattern;
 …
     const UChar* ptr = (const UChar*)pattern;
     const UChar* patternEnd = pattern + patternLength;
     uschar* code = (uschar*)codeStart;
+    unsigned char* code = (unsigned char*)codeStart;
     int firstbyte, reqbyte;
     int bracketCount = 0;

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 28793 in webkit for trunk/JavaScriptCore/pcre/pcre_compile.cpp

Legend:

trunk/JavaScriptCore/pcre/pcre_compile.cpp

Download in other formats: