Ignore:
Timestamp:
Nov 11, 2007, 10:56:13 AM (18 years ago)
Author:
Darin Adler
Message:

JavaScriptCore:

Reviewed by Sam.

This is a combination of converting to C++, tweaking the API, and adding
some additional optimizations.

Future steps will involve getting rid of the use of UTF-8 completely
(we'll use UTF-16 exclusively instead), eliminating more source files,
and some more speed-ups.

SunSpider says the current round is an 0.9% speed-up overall, and a
5.3% speed-up for regexp.

  • kjs/regexp.cpp: (KJS::RegExp::RegExp): Changed to use the error message without calling strdup on it and to pass the new types and options. (KJS::RegExp::~RegExp): Removed the now-unneeded free of the error message. (KJS::RegExp::match): Pass the new types and options.
  • kjs/regexp.h: Update type of m_constructionError.
  • pcre/AUTHORS: Update to reflect the status of the project -- we don't include the Google parts, and this isn't the PCRE library, per se.
  • pcre/COPYING: Ditto.
  • pcre/dftables.cpp: Copied from JavaScriptCore/pcre/dftables.c. (main): Removed unneeded ctype_digit.
  • pcre/pcre.h: Convert to C++, tweak API a bit. Use UChar instead of JSRegExpChar.
  • pcre/pcre_compile.cpp: Copied from JavaScriptCore/pcre/pcre_compile.c. Moved a lot of private stuff used only within this file here from pcre_internal.h. Renumbered the error codes. (error_text): Use a single string with embedded nulls for the error text (I got this idea from newer versions of PCRE). (check_escape): Changed return type to be enum instead of int. Replaced ctype_digit uses with isASCIIDigit. (is_counted_repeat): Ditto. (read_repeat_counts): Ditto. (first_significant_code): Ditto. (find_fixedlength): Ditto. (could_be_empty_branch): Ditto. (compile_branch): Ditto. Also removed some code that handles changing options. JavaScript doesn't have any of the features that allow options to change. (compile_regex): Updated for change to options parameter. (is_anchored): Ditto. (find_firstassertedchar): Ditto. (jsRegExpCompile): Changed to take separate flags instead of an options int. Also changed to call new/delete instead of pcre_malloc/free. (jsRegExpFree): Ditto.
  • pcre/pcre_exec.cpp: Copied from JavaScriptCore/pcre/pcre_exec.c. Added a case that uses computed goto for the opcode loop, but did not turn it on. Changed the RMATCH macro to handle returns more efficiently by putting the where pointer in the new frame instead of the old one, allowing us to branch to the return with a single statement. Switched to new/delete from pcre_malloc/free. Changed many RRETURN callers to not set the return value since it's already set correctly. Replaced the rrc variable with an is_match variable. Values other than "match" and "no match" are now handled differently. This allows us to remove the code to check for those cases in various rules. (match): All the case statements use a macro BEGIN_OPCODE instead. And all the continue statements, or break statements that break out of the outer case use a macro NEXT_OPCODE instead. Replaced a few if statements with assertions. (jsRegExpExecute): Use new/delete instead of pcre_malloc/free. Removed unused start_match field from the match block.
  • pcre/pcre_internal.h: Moved the last few configuration macros from pcre-config.h in here. Removed various unused types. Converted from JSRegExpChar to UChar. Eliminated pcre_malloc/free. Replaced the opcode enum with a macro that can be used in multiple places. Unfortunately we lose the comments for each opcode; we should find a place to put those back. Removed ctype_digit.
  • pcre/pcre_maketables.cpp: Copied from JavaScriptCore/pcre/pcre_maketables.c. (pcre_maketables): Got rid of the conditional code that allows this to be compiled in -- it's only used for dftables now (and soon may be obsolete entirely). Changed code for cbit_digit to not use isdigit, and took the "_" case out of the loop. Removed ctype_digit.
  • pcre/pcre_ord2utf8.cpp: Copied from JavaScriptCore/pcre/pcre_ord2utf8.c.
  • pcre/pcre_tables.cpp: Copied from JavaScriptCore/pcre/pcre_tables.c. Moved _pcre_OP_lengths out of here into pcre_exec.cpp.
  • pcre/pcre_ucp_searchfuncs.cpp: Copied from JavaScriptCore/pcre/pcre_ucp_searchfuncs.c. Updated for other file name changes.
  • pcre/ucpinternal.h: Updated header.
  • wtf/ASCIICType.h: (WTF::isASCIIDigit): Removed a branch by changing from && to & for this operation. Also added an overload that takes an int because that's useful for PCRE. Later we could optimize for int and overload other functions in this file; stuck to this simple one for now.
  • wtf/unicode/icu/UnicodeIcu.h: Removed unused isUpper.
  • wtf/unicode/qt4/UnicodeQt4.h: Ditto.
  • pcre/LICENCE: Removed.
  • pcre/pcre-config.h: Removed.
  • wtf/FastMallocPCRE.cpp: Removed.
  • pcre/dftables.c: Renamed to cpp.
  • pcre/pcre_compile.c: Ditto.
  • pcre/pcre_exec.c: Ditto.
  • pcre/pcre_maketables.c: Ditto.
  • pcre/pcre_ord2utf8.c: Ditto.
  • pcre/pcre_tables.c: Ditto.
  • pcre/pcre_ucp_searchfuncs.c: Ditto.
  • pcre/pcre_xclass.c: Ditto.
  • pcre/ucptable.c: Ditto.

WebCore:

Reviewed by Sam.

  • updated for JSRegExp function changes
  • platform/RegularExpression.cpp: (WebCore::RegularExpression::Private::compile): (WebCore::RegularExpression::match):
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/JavaScriptCore/pcre/pcre_internal.h

    r27422 r27686  
    1 /*************************************************
    2 *      Perl-Compatible Regular Expressions       *
    3 *************************************************/
    4 
    5 
    6 /* PCRE is a library of functions to support regular expressions whose syntax
    7 and semantics are as close as possible to those of the Perl 5 language.
    8 
    9                        Written by Philip Hazel
     1/* This is JavaScriptCore's variant of the PCRE library. While this library
     2started out as a copy of PCRE, many of the features of PCRE have been
     3removed. This library now supports only the regular expression features
     4required by the JavaScript language specification, and has only the functions
     5needed by JavaScriptCore and the rest of WebKit.
     6
     7                 Originally written by Philip Hazel
    108           Copyright (c) 1997-2006 University of Cambridge
    11            Copyright (c) 2004, 2005 Apple Computer, Inc.
     9    Copyright (C) 2002, 2004, 2006, 2007 Apple Inc. All rights reserved.
    1210
    1311-----------------------------------------------------------------------------
     
    5452#endif
    5553
    56 #define _pcre_OP_lengths kjs_pcre_OP_lengths
     54/* The value of LINK_SIZE determines the number of bytes used to store links as
     55offsets within the compiled regex. The default is 2, which allows for compiled
     56patterns up to 64K long. This covers the vast majority of cases. However, PCRE
     57can also be compiled to use 3 or 4 bytes instead. This allows for longer
     58patterns in extreme cases. On systems that support it, "configure" can be used
     59to override this default. */
     60
     61#define LINK_SIZE   2
     62
     63/* The value of MATCH_LIMIT determines the default number of times the internal
     64match() function can be called during a single execution of pcre_exec(). There
     65is a runtime interface for setting a different limit. The limit exists in order
     66to catch runaway regular expressions that take for ever to determine that they
     67do not match. The default is set very large so that it does not accidentally
     68catch legitimate cases. On systems that support it, "configure" can be used to
     69override this default default. */
     70
     71#define MATCH_LIMIT 10000000
     72
     73/* The above limit applies to all calls of match(), whether or not they
     74increase the recursion depth. In some environments it is desirable to limit the
     75depth of recursive calls of match() more strictly, in order to restrict the
     76maximum amount of stack (or heap, if NO_RECURSE is defined) that is used. The
     77value of MATCH_LIMIT_RECURSION applies only to recursive calls of match(). To
     78have any useful effect, it must be less than the value of MATCH_LIMIT. There is
     79a runtime method for setting a different limit. On systems that support it,
     80"configure" can be used to override this default default. */
     81
     82#define MATCH_LIMIT_RECURSION MATCH_LIMIT
     83
    5784#define _pcre_default_tables kjs_pcre_default_tables
    5885#define _pcre_ord2utf8 kjs_pcre_ord2utf8
    59 #define _pcre_printint kjs_pcre_printint
    60 #define _pcre_try_flipped kjs_pcre_try_flipped
    61 #define _pcre_ucp_findchar kjs_pcre_ucp_findchar
    6286#define _pcre_utf8_table1 kjs_pcre_utf8_table1
    6387#define _pcre_utf8_table1_size  kjs_pcre_utf8_table1_size
     
    6589#define _pcre_utf8_table3 kjs_pcre_utf8_table3
    6690#define _pcre_utf8_table4 kjs_pcre_utf8_table4
    67 #define _pcre_utt kjs_pcre_utt
    68 #define _pcre_utt_size kjs_pcre_utt_size
    69 #define _pcre_valid_utf8 kjs_pcre_valid_utf8
    7091#define _pcre_xclass kjs_pcre_xclass
    7192
     
    86107#define DPRINTF(p) /*nothing*/
    87108#endif
    88 
    89 
    90 /* Get the definitions provided by running "configure" */
    91 
    92 #include "pcre-config.h"
    93109
    94110/* Standard C headers plus the external interface definition. The only time
     
    115131typedef JSRegExp pcre;
    116132
    117 typedef JSRegExpChar pcre_char;
    118 typedef JSRegExpChar pcre_uchar;
    119 typedef const JSRegExpChar* USPTR;
    120 
    121 /* Temporary fastMalloc/fastFree until we port to C++. */
    122 #ifdef __cplusplus
    123 extern "C" {
    124 #endif
    125 extern void* (*pcre_malloc)(size_t);
    126 extern void (*pcre_free)(void*);
    127 #ifdef __cplusplus
    128 } /* extern "C" */
    129 #endif
     133typedef UChar pcre_char;
     134typedef UChar pcre_uchar;
     135typedef const UChar* USPTR;
    130136
    131137/* PCRE keeps offsets in its compiled code as 2-byte quantities (always stored
     
    287293#define ISMIDCHAR(c) IS_TRAILING_SURROGATE(c)
    288294
    289 /* If the pointer is not at the start of a character, move it back until
    290 it is. Called only in UTF-8 mode. */
    291 
    292295#define BACKCHAR(eptr) while(ISMIDCHAR(*eptr)) eptr--;
    293 
    294 
    295 /* In case there is no definition of offsetof() provided - though any proper
    296 Standard C system should have one. */
    297 
    298 #ifndef offsetof
    299 #define offsetof(p_type,field) ((size_t)&(((p_type *)0)->field))
    300 #endif
    301 
    302 
    303 /* Private options flags start at the most significant end of the four bytes,
    304 but skip the top bit so we can use ints for convenience without getting tangled
    305 with negative values. The public options defined in pcre.h start at the least
    306 significant end. Make sure they don't overlap! */
    307296
    308297#define PCRE_FIRSTSET      0x40000000  /* first_byte is set */
     
    310299#define PCRE_STARTLINE     0x10000000  /* start after \n for multiline */
    311300#define PCRE_ANCHORED      0x02000000  /* can't use partial with this regex */
    312 #define PCRE_CASELESS      JS_REGEXP_CASELESS
    313 #define PCRE_MULTILINE     JS_REGEXP_MULTILINE
     301#define PCRE_CASELESS      0x00000001
     302#define PCRE_MULTILINE     0x00000002
    314303
    315304/* Negative values for the firstchar and reqchar variables */
     
    365354must also be updated to match. */
    366355
    367 enum {
    368   OP_END,                   /* End of pattern */
    369 
    370   /* Values corresponding to backslashed metacharacters */
    371 
    372   OP_NOT_WORD_BOUNDARY,     /* \B */
    373   OP_WORD_BOUNDARY,         /* \b */
    374   OP_NOT_DIGIT,             /* \D */
    375   OP_DIGIT,                 /* \d */
    376   OP_NOT_WHITESPACE,        /* \S */
    377   OP_WHITESPACE,            /* \s */
    378   OP_NOT_WORDCHAR,          /* \W */
    379   OP_WORDCHAR,              /* \w */
    380 
    381   OP_ANY,                   /* . -- Match any character */
    382 
    383   OP_CIRC,                  /* ^ */
    384   OP_DOLL,                  /* $ */
    385   OP_CHAR,                  /* Match one character, casefully */
    386   OP_CHARNC,                /* Match one character, caselessly */
    387   OP_ASCII_CHAR,            /* Match one ASCII (0-127) character. */
    388   OP_ASCII_LETTER_NC,       /* Match one ASCII letter, caselessly. */
    389   OP_NOT,                   /* Match anything but the following char */
    390 
    391   OP_STAR,                  /* The maximizing and minimizing versions of */
    392   OP_MINSTAR,               /* all these opcodes must come in pairs, with */
    393   OP_PLUS,                  /* the minimizing one second. */
    394   OP_MINPLUS,               /* This first set applies to single characters */
    395   OP_QUERY,
    396   OP_MINQUERY,
    397   OP_UPTO,                  /* From 0 to n matches */
    398   OP_MINUPTO,
    399   OP_EXACT,                 /* Exactly n matches */
    400 
    401   OP_NOTSTAR,               /* This set applies to "not" single characters */
    402   OP_NOTMINSTAR,
    403   OP_NOTPLUS,
    404   OP_NOTMINPLUS,
    405   OP_NOTQUERY,
    406   OP_NOTMINQUERY,
    407   OP_NOTUPTO,
    408   OP_NOTMINUPTO,
    409   OP_NOTEXACT,
    410 
    411   OP_TYPESTAR,              /* This set applies to character types such as \d */
    412   OP_TYPEMINSTAR,
    413   OP_TYPEPLUS,
    414   OP_TYPEMINPLUS,
    415   OP_TYPEQUERY,
    416   OP_TYPEMINQUERY,
    417   OP_TYPEUPTO,
    418   OP_TYPEMINUPTO,
    419   OP_TYPEEXACT,
    420 
    421   OP_CRSTAR,                /* These are for character classes and back refs */
    422   OP_CRMINSTAR,
    423   OP_CRPLUS,
    424   OP_CRMINPLUS,
    425   OP_CRQUERY,
    426   OP_CRMINQUERY,
    427   OP_CRRANGE,               /* These are different to the three sets above. */
    428   OP_CRMINRANGE,
    429 
    430   OP_CLASS,                 /* Match a character class, chars < 256 only */
    431   OP_NCLASS,                /* Same, but the bitmap was created from a negative
    432                                class - the difference is relevant when a UTF-8
    433                                character > 255 is encountered. */
    434 
    435   OP_XCLASS,                /* Extended class for handling UTF-8 chars within the
    436                                class. This does both positive and negative. */
    437 
    438   OP_REF,                   /* Match a back reference */
    439 
    440   OP_ALT,                   /* Start of alternation */
    441   OP_KET,                   /* End of group that doesn't have an unbounded repeat */
    442   OP_KETRMAX,               /* These two must remain together and in this */
    443   OP_KETRMIN,               /* order. They are for groups the repeat for ever. */
    444 
    445   /* The assertions must come before ONCE and COND */
    446 
    447   OP_ASSERT,                /* Positive lookahead */
    448   OP_ASSERT_NOT,            /* Negative lookahead */
    449 
    450   /* ONCE and COND must come after the assertions, with ONCE first, as there's
    451   a test for >= ONCE for a subpattern that isn't an assertion. */
    452 
    453   OP_ONCE,                  /* Once matched, don't back up into the subpattern */
    454 
    455   OP_BRAZERO,               /* These two must remain together and in this */
    456   OP_BRAMINZERO,            /* order. */
    457 
    458   OP_BRANUMBER,             /* Used for extracting brackets whose number is greater
    459                                than can fit into an opcode. */
    460 
    461   OP_BRA                    /* This and greater values are used for brackets that
    462                                extract substrings up to EXTRACT_BASIC_MAX. After
    463                                that, use is made of OP_BRANUMBER. */
    464 };
     356#define FOR_EACH_OPCODE(macro) \
     357    macro(END) \
     358    \
     359    macro(NOT_WORD_BOUNDARY) \
     360    macro(WORD_BOUNDARY) \
     361    macro(NOT_DIGIT) \
     362    macro(DIGIT) \
     363    macro(NOT_WHITESPACE) \
     364    macro(WHITESPACE) \
     365    macro(NOT_WORDCHAR) \
     366    macro(WORDCHAR) \
     367    \
     368    macro(ANY) \
     369    \
     370    macro(CIRC) \
     371    macro(DOLL) \
     372    macro(CHAR) \
     373    macro(CHARNC) \
     374    macro(ASCII_CHAR) \
     375    macro(ASCII_LETTER_NC) \
     376    macro(NOT) \
     377    \
     378    macro(STAR) \
     379    macro(MINSTAR) \
     380    macro(PLUS) \
     381    macro(MINPLUS) \
     382    macro(QUERY) \
     383    macro(MINQUERY) \
     384    macro(UPTO) \
     385    macro(MINUPTO) \
     386    macro(EXACT) \
     387    \
     388    macro(NOTSTAR) \
     389    macro(NOTMINSTAR) \
     390    macro(NOTPLUS) \
     391    macro(NOTMINPLUS) \
     392    macro(NOTQUERY) \
     393    macro(NOTMINQUERY) \
     394    macro(NOTUPTO) \
     395    macro(NOTMINUPTO) \
     396    macro(NOTEXACT) \
     397    \
     398    macro(TYPESTAR) \
     399    macro(TYPEMINSTAR) \
     400    macro(TYPEPLUS) \
     401    macro(TYPEMINPLUS) \
     402    macro(TYPEQUERY) \
     403    macro(TYPEMINQUERY) \
     404    macro(TYPEUPTO) \
     405    macro(TYPEMINUPTO) \
     406    macro(TYPEEXACT) \
     407    \
     408    macro(CRSTAR) \
     409    macro(CRMINSTAR) \
     410    macro(CRPLUS) \
     411    macro(CRMINPLUS) \
     412    macro(CRQUERY) \
     413    macro(CRMINQUERY) \
     414    macro(CRRANGE) \
     415    macro(CRMINRANGE) \
     416    \
     417    macro(CLASS) \
     418    macro(NCLASS) \
     419    macro(XCLASS) \
     420    \
     421    macro(REF) \
     422    \
     423    macro(ALT) \
     424    macro(KET) \
     425    macro(KETRMAX) \
     426    macro(KETRMIN) \
     427    \
     428    macro(ASSERT) \
     429    macro(ASSERT_NOT) \
     430    \
     431    macro(ONCE) \
     432    \
     433    macro(BRAZERO) \
     434    macro(BRAMINZERO) \
     435    macro(BRANUMBER) \
     436    macro(BRA)
     437
     438#define OPCODE_ENUM_VALUE(opcode) OP_##opcode,
     439enum { FOR_EACH_OPCODE(OPCODE_ENUM_VALUE) };
    465440
    466441/* WARNING WARNING WARNING: There is an implicit assumption in pcre.c and
     
    475450
    476451#define EXTRACT_BASIC_MAX  100
    477 
    478452
    479453/* This macro defines the length of fixed length operations in the compiled
     
    522496
    523497
    524 /* Error code numbers. They are given names so that they can more easily be
    525 tracked. */
    526 
    527 enum { ERR0,  ERR1,  ERR2,  ERR3,  ERR4,  ERR5,  ERR6,  ERR7,  ERR8,  ERR9,
    528        ERR10, ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19,
    529        ERR20, ERR21, ERR22, ERR23, ERR24, ERR25, ERR26, ERR27, ERR28, ERR29,
    530        ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
    531        ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47 };
    532 
    533498/* The real format of the start of the pcre block; the index of names and the
    534499code vector run on as long as necessary after the end. We store an explicit
     
    573538} compile_data;
    574539
    575 /* When compiling in a mode that doesn't use recursive calls to match(),
    576 a structure is used to remember local variables on the heap. It is defined in
    577 pcre.c, close to the match() function, so that it is easy to keep it in step
    578 with any changes of local variable. However, the pointer to the current frame
    579 must be saved in some "static" place over a longjmp(). We declare the
    580 structure here so that we can put a pointer in the match_data structure.
    581 NOTE: This isn't used for a "normal" compilation of pcre. */
    582 
    583 struct heapframe;
    584 
    585 /* Structure for passing "static" information around between the functions
    586 doing traditional NFA matching, so that they are thread-safe. */
    587 
    588 typedef struct match_data {
    589   unsigned long int match_call_count;      /* As it says */
    590   int   *offset_vector;         /* Offset vector */
    591   int    offset_end;            /* One past the end */
    592   int    offset_max;            /* The maximum usable for return data */
    593   const uschar *lcc;            /* Points to lower casing table */
    594   const uschar *ctypes;         /* Points to table of type maps */
    595   BOOL   offset_overflow;       /* Set if too many extractions */
    596   USPTR  start_subject;         /* Start of the subject string */
    597   USPTR  end_subject;           /* End of the subject string */
    598   USPTR  start_match;           /* Start of this match attempt */
    599   USPTR  end_match_ptr;         /* Subject position at end match */
    600   int    end_offset_top;        /* Highwater mark at end of match */
    601   BOOL   multiline;
    602   BOOL   caseless;
    603 } match_data;
    604 
    605540/* Bit definitions for entries in the pcre_ctypes table. */
    606541
    607542#define ctype_space   0x01
    608 #define ctype_digit   0x04
    609543#define ctype_xdigit  0x08
    610544#define ctype_word    0x10   /* alphameric or '_' */
     
    651585extern const uschar _pcre_default_tables[];
    652586
    653 extern const uschar _pcre_OP_lengths[];
    654 
    655587
    656588/* Internal shared functions. These are functions that are used by more than
Note: See TracChangeset for help on using the changeset viewer.