-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
This is how Rust identifiers' lexical syntax is defined: https://doc.rust-lang.org/reference/identifiers.html
This is how the lexer for Rust identifiers is implemented:
rust/compiler/rustc_lexer/src/lib.rs
Lines 264 to 297 in ce0d64e
/// True if `c` is valid as a first character of an identifier. | |
/// See [Rust language reference](https://doc.rust-lang.org/reference/identifiers.html) for | |
/// a formal definition of valid identifier name. | |
pub fn is_id_start(c: char) -> bool { | |
// This is XID_Start OR '_' (which formally is not a XID_Start). | |
// We also add fast-path for ascii idents | |
('a'..='z').contains(&c) | |
|| ('A'..='Z').contains(&c) | |
|| c == '_' | |
|| (c > '\x7f' && unicode_xid::UnicodeXID::is_xid_start(c)) | |
} | |
/// True if `c` is valid as a non-first character of an identifier. | |
/// See [Rust language reference](https://doc.rust-lang.org/reference/identifiers.html) for | |
/// a formal definition of valid identifier name. | |
pub fn is_id_continue(c: char) -> bool { | |
// This is exactly XID_Continue. | |
// We also add fast-path for ascii idents | |
('a'..='z').contains(&c) | |
|| ('A'..='Z').contains(&c) | |
|| ('0'..='9').contains(&c) | |
|| c == '_' | |
|| (c > '\x7f' && unicode_xid::UnicodeXID::is_xid_continue(c)) | |
} | |
/// The passed string is lexically an identifier. | |
pub fn is_ident(string: &str) -> bool { | |
let mut chars = string.chars(); | |
if let Some(start) = chars.next() { | |
is_id_start(start) && chars.all(is_id_continue) | |
} else { | |
false | |
} | |
} |
The specification says it should start with ASCII alphabetic and continue with ASCII alphanumeric or underscore. But the implementation uses http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax which is much more general than that as far as I understand.
I think one of code or lang ref should be updated, but I'm not sure which one.
(I didn't check lexing for other tokens, it might be useful to compare others with the language reference's definitions too)