Context Navigation

← Previous Change
Next Change →

Lexer.cpp

Timestamp:

Apr 29, 2015, 9:33:12 AM (10 years ago)

Author:

Darin Adler

Message:

[ES6] Implement Unicode code point escapes
https://bugs.webkit.org/show_bug.cgi?id=144377

Reviewed by Antti Koivisto.

Source/JavaScriptCore:

parser/Lexer.cpp: Moved the UnicodeHexValue class in here from

the header. Made it a non-member class so it doesn't need to be part
of a template. Made it use UChar32 instead of int for the value to
make it clearer what goes into this class.
(JSC::ParsedUnicodeEscapeValue::isIncomplete): Added. Replaces the
old type() function.
(JSC::Lexer<CharacterType>::parseUnicodeEscape): Renamed from
parseFourDigitUnicodeHex and added support for code point escapes.
(JSC::isLatin1): Added an overload for UChar32.
(JSC::isIdentStart): Changed this to take UChar32; no caller tries
to call it with a UChar, so no need to overload for that type for now.
(JSC::isNonLatin1IdentPart): Changed argument type to UChar32 for clarity.
Also added FIXME about a subtle ES6 change that we might want to make later.
(JSC::isIdentPart): Changed this to take UChar32; no caller tries
to call it with a UChar, so no need to overload for that type for now.
(JSC::isIdentPartIncludingEscapeTemplate): Made this a template so that we
don't need to repeat the code twice. Added code to handle code point escapes.
(JSC::isIdentPartIncludingEscape): Call the template instead of having the
code in line.
(JSC::Lexer<CharacterType>::recordUnicodeCodePoint): Added.
(JSC::Lexer<CharacterType>::parseIdentifierSlowCase): Made small tweaks and
updated to call parseUnicodeEscape instead of parseFourDigitUnicodeHex.
(JSC::Lexer<CharacterType>::parseComplexEscape): Call parseUnicodeEscape
instead of parseFourDigitUnicodeHex. Move the code to handle "\u" before
the code that handles the escapes, since the code point escape code now
consumes characters while parsing rather than peeking ahead. Test case
covers this: Symptom would be that "\u{" would evaluate to "u" instead of
giving a syntax error.

parser/Lexer.h: Updated for above changes.

runtime/StringConstructor.cpp:

(JSC::stringFromCodePoint): Use ICU's UCHAR_MAX_VALUE instead of writing
out 0x10FFFF; clearer this way.

Source/WebCore:

Test: js/unicode-escape-sequences.html

css/CSSParser.cpp:

(WebCore::CSSParser::parseEscape): Use ICU's UCHAR_MAX_VALUE instead of writing
out 0x10FFFF; clearer this way. Also use our replacementCharacter instead of
writing out 0xFFFD.

html/parser/HTMLEntityParser.cpp:

(WebCore::isAlphaNumeric): Deleted.
(WebCore::HTMLEntityParser::legalEntityFor): Use ICU's UCHAR_MAX_VALUE and
U_IS_SURROGATE instead of writing the code out. Didn't use U_IS_UNICODE_CHAR
because that also includes U_IS_UNICODE_NONCHAR and thus would change behavior,
but maye it's something we want to do in the future.
(WebCore::HTMLEntityParser::consumeNamedEntity): Use isASCIIAlphanumeric instead
of a the function in this file that does the same thing less efficiently.

html/parser/InputStreamPreprocessor.h:

(WebCore::InputStreamPreprocessor::processNextInputCharacter): Use
replacementCharacter from CharacterNames.h instead of writing out 0xFFFd.

xml/parser/CharacterReferenceParserInlines.h:

(WebCore::consumeCharacterReference): Use ICU's UCHAR_MAX_VALUE instead of
defining our own local highestValidCharacter constant.

LayoutTests:

js/script-tests/unicode-escape-sequences.js: Added.
js/unicode-escape-sequences-expected.txt: Added.
js/unicode-escape-sequences.html: Added. Generated with make-script-test-wrappers.

File:

: 1 edited

trunk/Source/JavaScriptCore/parser/Lexer.cpp (modified) (11 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/Source/JavaScriptCore/parser/Lexer.cpp

-              r183373
+              r183552
+}
+template <typename T>
+typename Lexer<T>::UnicodeHexValue Lexer<T>::parseFourDigitUnicodeHex()
+{
+    T char1 = peek(1);
+    T char2 = peek(2);
+    T char3 = peek(3);
+    if (UNLIKELY(!isASCIIHexDigit(m_current) || !isASCIIHexDigit(char1) || !isASCIIHexDigit(char2) || !isASCIIHexDigit(char3)))
+        return UnicodeHexValue((m_code + 4) >= m_codeEnd ? UnicodeHexValue::IncompleteHex : UnicodeHexValue::InvalidHex);
+    int result = convertUnicode(m_current, char1, char2, char3);
+struct ParsedUnicodeEscapeValue {
+    ParsedUnicodeEscapeValue(UChar32 value)
+        : m_value(value)
+    {
+        ASSERT(isValid());
+    }
+    enum SpecialValueType { Incomplete = -2, Invalid = -1 };
+    ParsedUnicodeEscapeValue(SpecialValueType type)
+        : m_value(type)
+    {
+    }
+    bool isValid() const { return m_value >= 0; }
+    bool isIncomplete() const { return m_value == Incomplete; }
+    UChar32 value() const
+    {
+        ASSERT(isValid());
+        return m_value;
+    }
+private:
+    UChar32 m_value;
+};
+template<typename CharacterType> ParsedUnicodeEscapeValue Lexer<CharacterType>::parseUnicodeEscape()
+{
+    if (m_current == '{') {
+        shift();
+        UChar32 codePoint = 0;
+        do {
+            if (!isASCIIHexDigit(m_current))
+                return m_current ? ParsedUnicodeEscapeValue::Invalid : ParsedUnicodeEscapeValue::Incomplete;
+            codePoint = (codePoint << 4) | toASCIIHexValue(m_current);
+            if (codePoint > UCHAR_MAX_VALUE)
+                return ParsedUnicodeEscapeValue::Invalid;
+            shift();
+        } while (m_current != '}');
+        shift();
+        return codePoint;
+    }
+    auto character2 = peek(1);
+    auto character3 = peek(2);
+    auto character4 = peek(3);
+    if (UNLIKELY(!isASCIIHexDigit(m_current) || !isASCIIHexDigit(character2) || !isASCIIHexDigit(character3) || !isASCIIHexDigit(character4)))
+        return (m_code + 4) >= m_codeEnd ? ParsedUnicodeEscapeValue::Incomplete : ParsedUnicodeEscapeValue::Invalid;
+    auto result = convertUnicode(m_current, character2, character3, character4);
     shift();
     shift();
     shift();
     shift();
     return UnicodeHexValue(result);
+    return result;
+}
 …
+}
+static ALWAYS_INLINE bool isLatin1(UChar32 c)
+{
+    return !(c & ~0xFF);
+}
 static inline bool isIdentStart(LChar c)
+{
 …
+}
 static inline bool isIdentStart(UChar c)
+static inline bool isIdentStart(UChar32 c)
+{
     return isLatin1(c) ? isIdentStart(static_cast<LChar>(c)) : isNonLatin1IdentStart(c);
+}
+static NEVER_INLINE bool isNonLatin1IdentPart(int c)
+{
+static NEVER_INLINE bool isNonLatin1IdentPart(UChar32 c)
+{
+    // FIXME: ES6 says this should be based on the Unicode property ID_Continue now instead.
     return (U_GET_GC_MASK(c) & (U_GC_L_MASK | U_GC_MN_MASK | U_GC_MC_MASK | U_GC_ND_MASK | U_GC_PC_MASK)) || c == 0x200C || c == 0x200D;
+}
 …
+}
+static ALWAYS_INLINE bool isIdentPart(UChar32 c)
+{
+    return isLatin1(c) ? isIdentPart(static_cast<LChar>(c)) : isNonLatin1IdentPart(c);
+}
 static ALWAYS_INLINE bool isIdentPart(UChar c)
+{
+    return isLatin1(c) ? isIdentPart(static_cast<LChar>(c)) : isNonLatin1IdentPart(c);
+}
+template <typename T>
+bool isUnicodeEscapeIdentPart(const T* code)
+{
+    T char1 = code[0];
+    T char2 = code[1];
+    T char3 = code[2];
+    T char4 = code[3];
+    if (!isASCIIHexDigit(char1) || !isASCIIHexDigit(char2) || !isASCIIHexDigit(char3) || !isASCIIHexDigit(char4))
+    return isIdentPart(static_cast<UChar32>(c));
+}
+template<typename CharacterType> ALWAYS_INLINE bool isIdentPartIncludingEscapeTemplate(const CharacterType* code, const CharacterType* codeEnd)
+{
+    if (isIdentPart(code[0]))
+        return true;
+    // Shortest sequence handled below is \u{0}, which is 5 characters.
+    if (!(code[0] == '\\' && codeEnd - code >= 5 && code[1] == 'u'))
         return false;
+    return isIdentPart(Lexer<T>::convertUnicode(char1, char2, char3, char4));
+    if (code[2] == '{') {
+        UChar32 codePoint = 0;
+        const CharacterType* pointer;
+        for (pointer = &code[3]; pointer < codeEnd; ++pointer) {
+            auto digit = *pointer;
+            if (!isASCIIHexDigit(digit))
+                break;
+            codePoint = (codePoint << 4) | toASCIIHexValue(digit);
+            if (codePoint > UCHAR_MAX_VALUE)
+                return false;
+        }
+        return isIdentPart(codePoint) && pointer < codeEnd && *pointer == '}';
+    }
+    // Shortest sequence handled below is \uXXXX, which is 6 characters.
+    if (codeEnd - code < 6)
+        return false;
+    auto character1 = code[2];
+    auto character2 = code[3];
+    auto character3 = code[4];
+    auto character4 = code[5];
+    return isASCIIHexDigit(character1) && isASCIIHexDigit(character2) && isASCIIHexDigit(character3) && isASCIIHexDigit(character4)
+        && isIdentPart(Lexer<LChar>::convertUnicode(character1, character2, character3, character4));
+}
 static ALWAYS_INLINE bool isIdentPartIncludingEscape(const LChar* code, const LChar* codeEnd)
+{
+    if (isIdentPart(*code))
+        return true;
+    return (*code == '\\' && ((codeEnd - code) >= 6) && code[1] == 'u' && isUnicodeEscapeIdentPart(code+2));
+    return isIdentPartIncludingEscapeTemplate(code, codeEnd);
+}
 static ALWAYS_INLINE bool isIdentPartIncludingEscape(const UChar* code, const UChar* codeEnd)
+{
+    if (isIdentPart(*code))
+        return true;
+    return (*code == '\\' && ((codeEnd - code) >= 6) && code[1] == 'u' && isUnicodeEscapeIdentPart(code+2));
+    return isIdentPartIncludingEscapeTemplate(code, codeEnd);
+}
 …
+}
+template<typename CharacterType> inline void Lexer<CharacterType>::recordUnicodeCodePoint(UChar32 codePoint)
+{
+    ASSERT(codePoint >= 0);
+    ASSERT(codePoint <= UCHAR_MAX_VALUE);
+    if (U_IS_BMP(codePoint))
+        record16(codePoint);
+    else {
+        UChar codeUnits[2] = { U16_LEAD(codePoint), U16_TRAIL(codePoint) };
+        append16(codeUnits, 2);
+    }
+}
 #if !ASSERT_DISABLED
 bool isSafeBuiltinIdentifier(VM& vm, const Identifier* ident)
 …
      * be used as a safety net while implementing builtins.
      */
+    // FIXME: How can a debug-only assertion be a safety net?
     if (*ident == vm.propertyNames->builtinNames().callPublicName())
         return false;
 …
+}
+template <typename T>
+template <bool shouldCreateIdentifier> JSTokenType Lexer<T>::parseIdentifierSlowCase(JSTokenData* tokenData, unsigned lexerFlags, bool strictMode)
+template<typename CharacterType> template<bool shouldCreateIdentifier> JSTokenType Lexer<CharacterType>::parseIdentifierSlowCase(JSTokenData* tokenData, unsigned lexerFlags, bool strictMode)
+{
     const ptrdiff_t remaining = m_codeEnd - m_code;
     const T* identifierStart = currentSourcePtr();
+    auto identifierStart = currentSourcePtr();
     bool bufferRequired = false;
 …
             return atEnd() ? UNTERMINATED_IDENTIFIER_ESCAPE_ERRORTOK : INVALID_IDENTIFIER_ESCAPE_ERRORTOK;
         shift();
         UnicodeHexValue character = parseFourDigitUnicodeHex();
+        auto character = parseUnicodeEscape();
         if (UNLIKELY(!character.isValid()))
+            return character.valueType() == UnicodeHexValue::IncompleteHex ? UNTERMINATED_IDENTIFIER_UNICODE_ESCAPE_ERRORTOK : INVALID_IDENTIFIER_UNICODE_ESCAPE_ERRORTOK;
+        UChar ucharacter = static_cast<UChar>(character.value());
+        if (UNLIKELY(m_buffer16.size() ? !isIdentPart(ucharacter) : !isIdentStart(ucharacter)))
+            return character.isIncomplete() ? UNTERMINATED_IDENTIFIER_UNICODE_ESCAPE_ERRORTOK : INVALID_IDENTIFIER_UNICODE_ESCAPE_ERRORTOK;
+        if (UNLIKELY(m_buffer16.size() ? !isIdentPart(character.value()) : !isIdentStart(character.value())))
             return INVALID_IDENTIFIER_UNICODE_ESCAPE_ERRORTOK;
         if (shouldCreateIdentifier)
             record16(ucharacter);
+            recordUnicodeCodePoint(character.value());
         identifierStart = currentSourcePtr();
+    }
     int identifierLength;
     const Identifier* ident = 0;
+    const Identifier* ident = nullptr;
     if (shouldCreateIdentifier) {
         if (!bufferRequired) {
 …
         tokenData->ident = ident;
     } else
         tokenData->ident = 0;
+        tokenData->ident = nullptr;
     if (LIKELY(!bufferRequired && !(lexerFlags & LexerFlagsIgnoreReservedWords))) {
 …
     if (m_current == 'u') {
         shift();
-        UnicodeHexValue character = parseFourDigitUnicodeHex();
-        if (character.isValid()) {
-            if (shouldBuildStrings)
-                record16(character.value());
-            return StringParsedSuccessfully;
+        }
         if (escapeParseMode == EscapeParseMode::String && m_current == stringQuoteCharacter) {
 …
+        }
+        auto character = parseUnicodeEscape();
+        if (character.isValid()) {
+            if (shouldBuildStrings)
+                recordUnicodeCodePoint(character.value());
+            return StringParsedSuccessfully;
+        }
         m_lexErrorMessage = ASCIILiteral("\\u can only be followed by a Unicode character sequence");
         return character.valueType() == UnicodeHexValue::IncompleteHex ? StringUnterminated : StringCannotBeParsed;
+        return character.isIncomplete() ? StringUnterminated : StringCannotBeParsed;
+    }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 183552 in webkit for trunk/Source/JavaScriptCore/parser/Lexer.cpp

Legend:

trunk/Source/JavaScriptCore/parser/Lexer.cpp

Download in other formats: