Context Navigation

← Previous Change
Next Change →

pcre_compile.cpp

Timestamp:

Nov 13, 2007, 9:25:26 AM (18 years ago)

Author:

Darin Adler

Message:

JavaScriptCore:

Reviewed by Geoff.

fix http://bugs.webkit.org/show_bug.cgi?id=11231 RegExp bug when handling newline characters and a number of other differences between PCRE behvior and JavaScript regular expressions:

+ single-digit sequences like \4 should be treated as octal

character constants, unless there is a sufficient number
of brackets for them to be treated as backreferences

+ \8 turns into the character "8", not a binary zero character

followed by "8" (same for 9)

+ only the first 3 digits should be considered part of an

octal character constant (the old behavior was to decode
an arbitrarily long sequence and then mask with 0xFF)

+ if \x is followed by anything other than two valid hex digits,

then it should simply be treated a the letter "x"; that includes
not supporting the \x{41} syntax

+ if \u is followed by anything less than four valid hex digits,

then it should simply be treated a the letter "u"

+ an extra "+" should be a syntax error, rather than being treated

as the "possessive quantifier"

+ if a "]" character appears immediately after a "[" character that

starts a character class, then that's an empty character class,
rather than being the start of a character class that includes a
"]" character

+ a "$" should not match a terminating newline; we could have gotten

PCRE to handle this the way we wanted by passing an appropriate option

Test: fast/js/regexp-no-extensions.html

pcre/pcre_compile.cpp: (check_escape): Check backreferences against bracount to catch both overflows and things that should be treated as octal. Rewrite octal loop to not go on indefinitely. Rewrite both hex loops to match and remove \x{} support. (compile_branch): Restructure loops so that we don't special-case a "]" at the beginning of a character class. Remove code that treated "+" as the possessive quantifier. (jsRegExpCompile): Change the "]" handling here too.

pcre/pcre_exec.cpp: (match): Changed CIRC to match the DOLL implementation. Changed DOLL to remove handling of "terminating newline", a Perl concept which we don't need.

tests/mozilla/expected.html: Two tests are fixed now: ecma_3/RegExp/regress-100199.js and ecma_3/RegExp/regress-188206.js. One test fails now: ecma_3/RegExp/perlstress-002.js -- our success before was due to a bug (we treated all 1-character numeric escapes as backreferences). The date tests also now both expect success -- whatever was making them fail before was probably due to the time being close to a DST shift; maybe we need to get rid of those tests.

LayoutTests:

Reviewed by Geoff.

test for http://bugs.webkit.org/show_bug.cgi?id=11231 RegExp bug when handling newline characters and other regular expression behavior that is different for JavaScript and PCRE

fast/js/regexp-no-extensions-expected.txt: Added.
fast/js/regexp-no-extensions.html: Added.
fast/js/resources/regexp-no-extensions.js: Added.

File:

: 1 edited

trunk/JavaScriptCore/pcre/pcre_compile.cpp (modified) (15 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/JavaScriptCore/pcre/pcre_compile.cpp

-              r27730
+              r27752
+{
 const pcre_uchar *ptr = *ptrptr + 1;
 int c, i;
+int i;
 /* If backslash is at the end of the pattern, it's an error. */
 …
+}
 c = *ptr;
+int c = *ptr;
 /* Non-alphamerics are literals. For digits or letters, do an initial lookup in
 …
 else
+  {
-  const pcre_uchar *oldptr;
   switch (c)
+    {
+    /* A number of Perl escapes are not handled by PCRE. We give an explicit
+    error. */
+    /* The handling of escape sequences consisting of a string of digits
+    starting with one that is not zero is not straightforward. By experiment,
+    the way Perl works seems to be as follows:
+    Outside a character class, the digits are read as a decimal number. If the
+    number is less than 10, or if there are that many previous extracting
+    left brackets, then it is a back reference. Otherwise, up to three octal
+    digits are read to form an escaped byte. Thus \123 is likely to be octal
+(cf \0123, which is octal 012 followed by the literal 3). If the octal
+    value is greater than 377, the least significant 8 bits are taken. Inside a
+    character class, \ followed by a digit is always an octal number. */
+    /* Escape sequences starting with a non-zero digit are backreferences,
+    unless there are insufficient brackets, in which case they are octal
+    escape sequences. Those sequences end on the first non-octal character
+    or when we overflow 0-255, whichever comes first. */
     case '1': case '2': case '3': case '4': case '5':
 …
     if (!isclass)
+      {
       oldptr = ptr;
+      const pcre_uchar *oldptr = ptr;
       c -= '0';
       while (ptr + 1 < patternEnd && isASCIIDigit(ptr[1]))
+      while (ptr + 1 < patternEnd && isASCIIDigit(ptr[1]) && c <= bracount)
         c = c * 10 + *(++ptr) - '0';
       if (c < 10 || c <= bracount)
+      if (c <= bracount)
+        {
         c = -(ESC_REF + c);
 …
+      }
+    /* Handle an octal number following \. If the first digit is 8 or 9, Perl
+    generates a binary zero byte and treats the digit as a following literal.
+    Thus we have to pull back the pointer by one. */
+    /* Handle an octal number following \. If the first digit is 8 or 9,
+    this is not octal. */
     if ((c = *ptr) >= '8')
+      {
-      ptr--;
-      c = 0;
       break;
+      }
     /* \0 always starts an octal number, but we may drop through to here with a
 …
     case '0':
     c -= '0';
+    while (i++ < 2 && ptr + 1 < patternEnd && ptr[1] >= '0' && ptr[1] <= '7')
+        c = c * 8 + *(++ptr) - '0';
+    c &= 255;     /* Take least significant 8 bits */
+    for (i = 1; i <= 2; ++i)
+      {
+      if (ptr + i >= patternEnd || ptr[i] < '0' || ptr[i] > '7')
+        break;
+      int cc = c * 8 + ptr[i] - '0';
+      if (cc > 255)
+        break;
+      c = cc;
+      }
+    ptr += i - 1;
     break;
-    /* \x is complicated. \x{ddd} is a character number which can be greater
-    than 0xff in utf8 mode, but only if the ddd are hex digits. If not, { is
-    treated as a data character. */
     case 'x':
+    if (ptr + 1 < patternEnd && ptr[1] == '{')
+      {
+      const pcre_uchar *pt = ptr + 2;
+      int count = 0;
+      c = 0;
+      while (pt < patternEnd && isASCIIHexDigit(*pt))
+        {
+        register int cc = *pt++;
+        if (c == 0 && cc == '0') continue;     /* Leading zeroes */
+        count++;
+        if (cc >= 'a') cc -= 32;               /* Convert to upper case */
+        c = (c << 4) + cc - ((cc < 'A')? '0' : ('A' - 10));
+        }
+      if (pt < patternEnd && *pt == '}')
+        {
+        if (c < 0 || count > 8) *errorcodeptr = ERR3;
+        else if (c >= 0xD800 && c <= 0xDFFF) *errorcodeptr = ERR3; // half of surrogate pair
+        else if (c >= 0xFDD0 && c <= 0xFDEF) *errorcodeptr = ERR3; // ?
+        else if (c == 0xFFFE) *errorcodeptr = ERR3; // not a character
+        else if (c == 0xFFFF)  *errorcodeptr = ERR3; // not a character
+        else if (c > 0x10FFFF) *errorcodeptr = ERR3; // out of Unicode character range
+        ptr = pt;
+    c = 0;
+    for (i = 1; i <= 2; ++i)
+      {
+      if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i]))
+        {
+        c = 'x';
+        i = 1;
         break;
+        }
+      /* If the sequence of hex digits does not end with '}', then we don't
+      recognize this construct; fall through to the normal \x handling. */
+      }
+    /* Read just a single-byte hex-defined char */
+      int cc = ptr[i];
+      if (cc >= 'a') cc -= 32;             /* Convert to upper case */
+      c = c * 16 + cc - ((cc < 'A') ? '0' : ('A' - 10));
+      }
+    ptr += i - 1;
+    break;
+    case 'u':
     c = 0;
+    while (i++ < 2 && ptr + 1 < patternEnd && isASCIIHexDigit(ptr[1]))
+      {
+      int cc;                               /* Some compilers don't like ++ */
+      cc = *(++ptr);                        /* in initializers */
+      if (cc >= 'a') cc -= 32;              /* Convert to upper case */
+    for (i = 1; i <= 4; ++i)
+      {
+      if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i]))
+        {
+        c = 'u';
+        i = 1;
+        break;
+        }
+      int cc = ptr[i];
+      if (cc >= 'a') cc -= 32;             /* Convert to upper case */
       c = c * 16 + cc - ((cc < 'A')? '0' : ('A' - 10));
+      }
+    ptr += i - 1;
     break;
-    case 'u': {
-    const pcre_uchar *pt = ptr;
-    c = 0;
-    while (i++ < 4)
+      {
-      if (pt + 1 >= patternEnd || !isASCIIHexDigit(pt[1]))
+        {
-        pt = ptr;
-        c = 'u';
-        break;
+        }
-      else
+        {
-        int cc;                              /* Some compilers don't like ++ */
-        cc = *(++pt);                        /* in initializers */
-        if (cc >= 'a') cc -= 32;             /* Convert to upper case */
-        c = c * 16 + cc - ((cc < 'A')? '0' : ('A' - 10));
+        }
+      }
-    ptr = pt;
-    break;
+    }
     /* Other special escapes not starting with a digit are straightforward */
 …
   BOOL negate_class;
   BOOL should_flip_negation; /* If a negative special such as \S is used, we should negate the whole class to properly support Unicode. */
-  BOOL possessive_quantifier;
   BOOL is_quantifier;
   int class_charcount;
 …
     /* If the first character is '^', set the negation flag and skip it. */
     if ((c = *(++ptr)) == '^')
+    if (ptr[1] == '^')
+      {
       negate_class = true;
       c = *(++ptr);
+      ++ptr;
+      }
     else
 …
     memset(classbits, 0, 32 * sizeof(uschar));
+    /* Process characters until ] is reached. By writing this as a "do" it
+    means that an initial ] is taken as a data character. The first pass
+    /* Process characters until ] is reached. The first pass
     through the regex checked the overall syntax, so we don't need to be very
     strict here. At the start of the loop, c contains the first byte of the
     character. */
     do
+    while ((c = *(++ptr)) != ']')
+      {
       if (c > 127)
 …
+      }
-    /* Loop until ']' reached; the check for end of string happens inside the
-    loop. This "while" is the end of the "do" above. */
-    while ((c = *(++ptr)) != ']');
     /* If class_charcount is 1, we saw precisely one character whose value is
     less than 256. In non-UTF-8 mode we can always optimize. In UTF-8 mode, we
 …
     op_type = 0;                    /* Default single-char op codes */
-    possessive_quantifier = false;  /* Default not possessive quantifier */
     /* Save start of previous item, in case we have to move it up to make space
 …
     repeat type to the non-default. */
+    if (ptr + 1 < patternEnd && ptr[1] == '+')
+      {
+      repeat_type = 0;                  /* Force greedy */
+      possessive_quantifier = true;
+      ptr++;
+      }
+    else if (ptr + 1 < patternEnd && ptr[1] == '?')
+    if (ptr + 1 < patternEnd && ptr[1] == '?')
+      {
       repeat_type = 1;
 …
       *errorcodeptr = ERR11;
       goto FAILED;
+      }
-    /* If the character following a repeat is '+', we wrap the entire repeated
-    item inside OP_ONCE brackets. This is just syntactic sugar, taken from
-    Sun's Java package. The repeated item starts at tempcode, not at previous,
-    which might be the first part of a string whose (former) last char we
-    repeated. However, we don't support '+' after a greediness '?'. */
-    if (possessive_quantifier)
+      {
-      int len = code - tempcode;
-      memmove(tempcode + 1+LINK_SIZE, tempcode, len);
-      code += 1 + LINK_SIZE;
-      len += 1 + LINK_SIZE;
-      tempcode[0] = OP_ONCE;
-      *code++ = OP_KET;
-      PUTINC(code, 0, len);
-      PUT(tempcode, 1, len);
+      }
 …
     class_utf8 = false;
+    /* Written as a "do" so that an initial ']' is taken as data */
+    if (*ptr != 0) do
+      {
+      /* Outside \Q...\E, check for escapes */
+    for (; ptr < patternEnd && *ptr != ']'; ++ptr)
+      {
+      /* Check for escapes */
       if (*ptr == '\\')
 …
+        }
+      }
-    while (++ptr < patternEnd && *ptr != ']'); /* Concludes "do" above */
     if (ptr >= patternEnd)                          /* Missing terminating ']' */

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 27752 in webkit for trunk/JavaScriptCore/pcre/pcre_compile.cpp

Legend:

trunk/JavaScriptCore/pcre/pcre_compile.cpp

Download in other formats: