Ignore:
Timestamp:
Nov 13, 2007, 9:25:26 AM (18 years ago)
Author:
Darin Adler
Message:

JavaScriptCore:

Reviewed by Geoff.

+ single-digit sequences like \4 should be treated as octal

character constants, unless there is a sufficient number
of brackets for them to be treated as backreferences

+ \8 turns into the character "8", not a binary zero character

followed by "8" (same for 9)

+ only the first 3 digits should be considered part of an

octal character constant (the old behavior was to decode
an arbitrarily long sequence and then mask with 0xFF)

+ if \x is followed by anything other than two valid hex digits,

then it should simply be treated a the letter "x"; that includes
not supporting the \x{41} syntax

+ if \u is followed by anything less than four valid hex digits,

then it should simply be treated a the letter "u"

+ an extra "+" should be a syntax error, rather than being treated

as the "possessive quantifier"

+ if a "]" character appears immediately after a "[" character that

starts a character class, then that's an empty character class,
rather than being the start of a character class that includes a
"]" character

+ a "$" should not match a terminating newline; we could have gotten

PCRE to handle this the way we wanted by passing an appropriate option

Test: fast/js/regexp-no-extensions.html

  • pcre/pcre_compile.cpp: (check_escape): Check backreferences against bracount to catch both overflows and things that should be treated as octal. Rewrite octal loop to not go on indefinitely. Rewrite both hex loops to match and remove \x{} support. (compile_branch): Restructure loops so that we don't special-case a "]" at the beginning of a character class. Remove code that treated "+" as the possessive quantifier. (jsRegExpCompile): Change the "]" handling here too.
  • pcre/pcre_exec.cpp: (match): Changed CIRC to match the DOLL implementation. Changed DOLL to remove handling of "terminating newline", a Perl concept which we don't need.
  • tests/mozilla/expected.html: Two tests are fixed now: ecma_3/RegExp/regress-100199.js and ecma_3/RegExp/regress-188206.js. One test fails now: ecma_3/RegExp/perlstress-002.js -- our success before was due to a bug (we treated all 1-character numeric escapes as backreferences). The date tests also now both expect success -- whatever was making them fail before was probably due to the time being close to a DST shift; maybe we need to get rid of those tests.

LayoutTests:

Reviewed by Geoff.

  • fast/js/regexp-no-extensions-expected.txt: Added.
  • fast/js/regexp-no-extensions.html: Added.
  • fast/js/resources/regexp-no-extensions.js: Added.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/JavaScriptCore/pcre/pcre_compile.cpp

    r27730 r27752  
    162162{
    163163const pcre_uchar *ptr = *ptrptr + 1;
    164 int c, i;
     164int i;
    165165
    166166/* If backslash is at the end of the pattern, it's an error. */
     
    171171}
    172172
    173 c = *ptr;
     173int c = *ptr;
    174174
    175175/* Non-alphamerics are literals. For digits or letters, do an initial lookup in
     
    184184else
    185185  {
    186   const pcre_uchar *oldptr;
    187186  switch (c)
    188187    {
    189     /* A number of Perl escapes are not handled by PCRE. We give an explicit
    190     error. */
    191 
    192     /* The handling of escape sequences consisting of a string of digits
    193     starting with one that is not zero is not straightforward. By experiment,
    194     the way Perl works seems to be as follows:
    195 
    196     Outside a character class, the digits are read as a decimal number. If the
    197     number is less than 10, or if there are that many previous extracting
    198     left brackets, then it is a back reference. Otherwise, up to three octal
    199     digits are read to form an escaped byte. Thus \123 is likely to be octal
    200     123 (cf \0123, which is octal 012 followed by the literal 3). If the octal
    201     value is greater than 377, the least significant 8 bits are taken. Inside a
    202     character class, \ followed by a digit is always an octal number. */
     188    /* Escape sequences starting with a non-zero digit are backreferences,
     189    unless there are insufficient brackets, in which case they are octal
     190    escape sequences. Those sequences end on the first non-octal character
     191    or when we overflow 0-255, whichever comes first. */
    203192
    204193    case '1': case '2': case '3': case '4': case '5':
     
    207196    if (!isclass)
    208197      {
    209       oldptr = ptr;
     198      const pcre_uchar *oldptr = ptr;
    210199      c -= '0';
    211       while (ptr + 1 < patternEnd && isASCIIDigit(ptr[1]))
     200      while (ptr + 1 < patternEnd && isASCIIDigit(ptr[1]) && c <= bracount)
    212201        c = c * 10 + *(++ptr) - '0';
    213       if (c < 10 || c <= bracount)
     202      if (c <= bracount)
    214203        {
    215204        c = -(ESC_REF + c);
     
    219208      }
    220209
    221     /* Handle an octal number following \. If the first digit is 8 or 9, Perl
    222     generates a binary zero byte and treats the digit as a following literal.
    223     Thus we have to pull back the pointer by one. */
     210    /* Handle an octal number following \. If the first digit is 8 or 9,
     211    this is not octal. */
    224212
    225213    if ((c = *ptr) >= '8')
    226       {
    227       ptr--;
    228       c = 0;
    229214      break;
    230       }
    231215
    232216    /* \0 always starts an octal number, but we may drop through to here with a
     
    235219    case '0':
    236220    c -= '0';
    237     while (i++ < 2 && ptr + 1 < patternEnd && ptr[1] >= '0' && ptr[1] <= '7')
    238         c = c * 8 + *(++ptr) - '0';
    239     c &= 255;     /* Take least significant 8 bits */
     221    for (i = 1; i <= 2; ++i)
     222      {
     223      if (ptr + i >= patternEnd || ptr[i] < '0' || ptr[i] > '7')
     224        break;
     225      int cc = c * 8 + ptr[i] - '0';
     226      if (cc > 255)
     227        break;
     228      c = cc;
     229      }
     230    ptr += i - 1;
    240231    break;
    241232
    242     /* \x is complicated. \x{ddd} is a character number which can be greater
    243     than 0xff in utf8 mode, but only if the ddd are hex digits. If not, { is
    244     treated as a data character. */
    245 
    246233    case 'x':
    247     if (ptr + 1 < patternEnd && ptr[1] == '{')
    248       {
    249       const pcre_uchar *pt = ptr + 2;
    250       int count = 0;
    251 
    252       c = 0;
    253       while (pt < patternEnd && isASCIIHexDigit(*pt))
    254         {
    255         register int cc = *pt++;
    256         if (c == 0 && cc == '0') continue;     /* Leading zeroes */
    257         count++;
    258 
    259         if (cc >= 'a') cc -= 32;               /* Convert to upper case */
    260         c = (c << 4) + cc - ((cc < 'A')? '0' : ('A' - 10));
    261         }
    262 
    263       if (pt < patternEnd && *pt == '}')
    264         {
    265         if (c < 0 || count > 8) *errorcodeptr = ERR3;
    266         else if (c >= 0xD800 && c <= 0xDFFF) *errorcodeptr = ERR3; // half of surrogate pair
    267         else if (c >= 0xFDD0 && c <= 0xFDEF) *errorcodeptr = ERR3; // ?
    268         else if (c == 0xFFFE) *errorcodeptr = ERR3; // not a character
    269         else if (c == 0xFFFF)  *errorcodeptr = ERR3; // not a character
    270         else if (c > 0x10FFFF) *errorcodeptr = ERR3; // out of Unicode character range
    271         ptr = pt;
     234    c = 0;
     235    for (i = 1; i <= 2; ++i)
     236      {
     237      if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i]))
     238        {
     239        c = 'x';
     240        i = 1;
    272241        break;
    273242        }
    274 
    275       /* If the sequence of hex digits does not end with '}', then we don't
    276       recognize this construct; fall through to the normal \x handling. */
    277       }
    278 
    279     /* Read just a single-byte hex-defined char */
    280 
     243      int cc = ptr[i];
     244      if (cc >= 'a') cc -= 32;             /* Convert to upper case */
     245      c = c * 16 + cc - ((cc < 'A') ? '0' : ('A' - 10));
     246      }
     247    ptr += i - 1;
     248    break;
     249
     250    case 'u':
    281251    c = 0;
    282     while (i++ < 2 && ptr + 1 < patternEnd && isASCIIHexDigit(ptr[1]))
    283       {
    284       int cc;                               /* Some compilers don't like ++ */
    285       cc = *(++ptr);                        /* in initializers */
    286       if (cc >= 'a') cc -= 32;              /* Convert to upper case */
     252    for (i = 1; i <= 4; ++i)
     253      {
     254      if (ptr + i >= patternEnd || !isASCIIHexDigit(ptr[i]))
     255        {
     256        c = 'u';
     257        i = 1;
     258        break;
     259        }
     260      int cc = ptr[i];
     261      if (cc >= 'a') cc -= 32;             /* Convert to upper case */
    287262      c = c * 16 + cc - ((cc < 'A')? '0' : ('A' - 10));
    288263      }
     264    ptr += i - 1;
    289265    break;
    290 
    291     case 'u': {
    292     const pcre_uchar *pt = ptr;
    293     c = 0;
    294     while (i++ < 4)
    295       {
    296       if (pt + 1 >= patternEnd || !isASCIIHexDigit(pt[1]))
    297         {
    298         pt = ptr;
    299         c = 'u';
    300         break;
    301         }
    302       else
    303         {
    304         int cc;                              /* Some compilers don't like ++ */
    305         cc = *(++pt);                        /* in initializers */
    306         if (cc >= 'a') cc -= 32;             /* Convert to upper case */
    307         c = c * 16 + cc - ((cc < 'A')? '0' : ('A' - 10));
    308         }
    309       }
    310     ptr = pt;
    311     break;
    312     }
    313266
    314267    /* Other special escapes not starting with a digit are straightforward */
     
    934887  BOOL negate_class;
    935888  BOOL should_flip_negation; /* If a negative special such as \S is used, we should negate the whole class to properly support Unicode. */
    936   BOOL possessive_quantifier;
    937889  BOOL is_quantifier;
    938890  int class_charcount;
     
    1026978    /* If the first character is '^', set the negation flag and skip it. */
    1027979
    1028     if ((c = *(++ptr)) == '^')
     980    if (ptr[1] == '^')
    1029981      {
    1030982      negate_class = true;
    1031       c = *(++ptr);
     983      ++ptr;
    1032984      }
    1033985    else
     
    10531005    memset(classbits, 0, 32 * sizeof(uschar));
    10541006
    1055     /* Process characters until ] is reached. By writing this as a "do" it
    1056     means that an initial ] is taken as a data character. The first pass
     1007    /* Process characters until ] is reached. The first pass
    10571008    through the regex checked the overall syntax, so we don't need to be very
    10581009    strict here. At the start of the loop, c contains the first byte of the
    10591010    character. */
    10601011
    1061     do
     1012    while ((c = *(++ptr)) != ']')
    10621013      {
    10631014      if (c > 127)
     
    12861237      }
    12871238
    1288     /* Loop until ']' reached; the check for end of string happens inside the
    1289     loop. This "while" is the end of the "do" above. */
    1290 
    1291     while ((c = *(++ptr)) != ']');
    1292 
    12931239    /* If class_charcount is 1, we saw precisely one character whose value is
    12941240    less than 256. In non-UTF-8 mode we can always optimize. In UTF-8 mode, we
     
    14311377
    14321378    op_type = 0;                    /* Default single-char op codes */
    1433     possessive_quantifier = false;  /* Default not possessive quantifier */
    14341379
    14351380    /* Save start of previous item, in case we have to move it up to make space
     
    14441389    repeat type to the non-default. */
    14451390
    1446     if (ptr + 1 < patternEnd && ptr[1] == '+')
    1447       {
    1448       repeat_type = 0;                  /* Force greedy */
    1449       possessive_quantifier = true;
    1450       ptr++;
    1451       }
    1452     else if (ptr + 1 < patternEnd && ptr[1] == '?')
     1391    if (ptr + 1 < patternEnd && ptr[1] == '?')
    14531392      {
    14541393      repeat_type = 1;
     
    18291768      *errorcodeptr = ERR11;
    18301769      goto FAILED;
    1831       }
    1832 
    1833     /* If the character following a repeat is '+', we wrap the entire repeated
    1834     item inside OP_ONCE brackets. This is just syntactic sugar, taken from
    1835     Sun's Java package. The repeated item starts at tempcode, not at previous,
    1836     which might be the first part of a string whose (former) last char we
    1837     repeated. However, we don't support '+' after a greediness '?'. */
    1838 
    1839     if (possessive_quantifier)
    1840       {
    1841       int len = code - tempcode;
    1842       memmove(tempcode + 1+LINK_SIZE, tempcode, len);
    1843       code += 1 + LINK_SIZE;
    1844       len += 1 + LINK_SIZE;
    1845       tempcode[0] = OP_ONCE;
    1846       *code++ = OP_KET;
    1847       PUTINC(code, 0, len);
    1848       PUT(tempcode, 1, len);
    18491770      }
    18501771
     
    27362657    class_utf8 = false;
    27372658
    2738     /* Written as a "do" so that an initial ']' is taken as data */
    2739 
    2740     if (*ptr != 0) do
    2741       {
    2742       /* Outside \Q...\E, check for escapes */
     2659    for (; ptr < patternEnd && *ptr != ']'; ++ptr)
     2660      {
     2661      /* Check for escapes */
    27432662
    27442663      if (*ptr == '\\')
     
    28902809        }
    28912810      }
    2892     while (++ptr < patternEnd && *ptr != ']'); /* Concludes "do" above */
    28932811
    28942812    if (ptr >= patternEnd)                          /* Missing terminating ']' */
Note: See TracChangeset for help on using the changeset viewer.