ScannerRequests » History » Version 11
« Previous -
Version 11/22
(diff) -
Next » -
Current version
Kornelius Kalnbach, 09/12/2010 12:42 PM
Scanner Requests¶
Scanners are the heart of CodeRay. They split input code into tokens and classify them.
Each language has its own scanner: You can see what languages are currently supported in the repository.
Why is the CodeRay language support list so short?¶
CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high software quality.
Special attention is paid to the scanners: Every CodeRay scanner is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well.
I need a new Scanner - What can I do?¶
Here's what you can do to speed up the development of a new scanner:
- Request it! File a new ticket unless it already exists add a +1 or something to existing tickets to show your interest.
- Upload or link to example code in the ticket discussion.
- Typical code in large quantities is very helpful, also for benchmarking.
- But we also need the most weird and strange code you can find to make the scanner.
- Provide links to useful information about the language lexic, such as:
- a list of reserved words (Did you know that "void" is a JavaScript keyword?)
- rules for string and number literals (Can a double quoted string contain a newline?)
- rules for comments and other token types (Does Language have a special syntax for multiline comments?)
- a description of any unusual syntactic features (There's this weird %w() thing in Ruby...)
- If there are different versions / implementations / dialects of this language: How do they differ?
- Give examples for good and bad highlighters / syntax definitions for the language (usually from editors or other libraries),
- Find more example code!
Also, read the next paragraph.
I want to write a Scanner myself¶
Wow, you're brave! Writing CodeRay scanners is not an easy task because:
- You need excellent knowledge about the language you want to scan. Every language has a dark side!
- You need good knowledge of (Ruby) regular expressions.
- There's no documentation to speak of.
- But this is a wiki hint hint ;o)
But it has been done before, so go and try it!
- You should still request the scanner (as described above) and announce that you are working on a patch yourself.
- Check out the repository and try the test suite (
rake test:scanners[:lang]
). - Copy a scanner of your choice as a base. You would know what language comes closest.
- Create a test case directory in
test/scanners
. - --- Advertisement --- (No, just kidding.)
- Write your scanner!
- Also, look into
lib/coderay/scanners/_map.rb
andlib/coderay/helpers/file_type.rb
. - Make a patch (scanner, test cases and other changes) and upload it to the ticket.
- Follow the following discussion.
- Prepare to be added to the THX list.
Contact me (murphy rubychan de) if you have any questions.
How does a Scanner look?¶
For example, the JSON scanner:
Warning: This is the 0.9 API; 1.0 will change it, see ticket #142 and look at the 1.0-compatible version of this example.
# Namespace; use this form instead of CodeRay::Scanners to avoid messages like
# "uninitialized constant CodeRay" when testing it.
module CodeRay
module Scanners
# Always inherit from CodeRay::Scanners::Scanner.
#
# Scanner inherits directly from StringScanner, the Ruby class for fast
# string scanning. Read the documentation to understand what's going on here:
#
# http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
class JSON < Scanner
# This means your scanner is providing a token stream, that is, it doesn't
# touch a token after it has been added to tokens. Streamable scanners
# can be used in streaming mode to minimize memory usage.
include Streamable
# Scanners are plugins and must be registered like this.
register_for :json
# See the WordList documentation.
CONSTANTS = %w( true false null )
IDENT_KIND = WordList.new(:key).add(CONSTANTS, :reserved)
ESCAPE = / [bfnrt\\"\/] /x
UNICODE_ESCAPE = / u[a-fA-F0-9]{4} /x
# This is the only method you need to define. It scans code.
#
# tokens is a Tokens or TokenStream object. Use tokens << [token, kind] to
# add tokens to it.
#
# options is reserved for later use.
#
# scan_tokens must return the tokens variable it was given.
#
# You are completely free to use any style you want, just make sure tokens
# gets what it needs. But typically, a Scanner follows the following scheme.
#
# See http://json.org/ for a definition of the JSON lexic/grammar.
def scan_tokens tokens, options
# The scanner is always in a certain state, which is :initial by default.
# We use local variables and symbols to maximize speed.
state = :initial
# Sometimes, you need a stack. Ruby arrays are perfect for this.
stack = []
# Define more flags and variables as you need them.
string_delimiter = nil
key_expected = false
# The main loop. eos? is true when the end of the code is reached.
until eos?
# You can either add directly to tokens or set these variables. See
# the end of this method to understand how they are used.
kind = nil
match = nil
# Depending on the state, we want to do different things.
case state
# Normally, we use this case.
when :initial
# I like the / ... /x style regexps because white space makes them more
# readable. x means white space is ignored.
if match = scan(/ \s+ | \\\n /x)
# White space and masked line ends are :space. We're using the tokens <<
# style here instead of setting kind and match, because it is faster.
# Just make sure you never send an empty token! /\s*/ for example would be
# very bad (actually creating infinite loops).
tokens << [match, :space]
next
elsif match = scan(/ [:,\[{\]}] /x)
# Operators of JSON. stack is used to determine where we are. stack and
# key_expected are set depending on which operator was found.
# key_expected is used to decide whether a "quoted" thing should be
# classified as key or string.
kind = :operator
case match
when '{' then stack << :object; key_expected = true
when '[' then stack << :array
when ':' then key_expected = false
when ',' then key_expected = true if stack.last == :object
when '}', ']' then stack.pop # no error recovery, but works for valid JSON
end
elsif match = scan(/ true | false | null /x)
# These are the only idents that are allowed in JSON. Normally, IDENT_KIND
# would be used to tell keywords and idents apart.
kind = IDENT_KIND[match]
elsif match = scan(/-?(?:0|[1-9]\d*)/)
# Pay attention to the details: JSON doesn't allow numbers like 00.
kind = :integer
if scan(/\.\d+(?:[eE][-+]?\d+)?|[eE][-+]?\d+/)
match << matched
kind = :float
end
elsif match = scan(/"/)
# A "quoted" token was found, and we know whether it is a key or a string.
state = key_expected ? :key : :string
# This opens a token group.
tokens << [:open, state]
kind = :delimiter
else
# Don't forget to add this case: If we reach invalid code, we try to discard
# chars one by one and mark them as :error.
getch
kind = :error
end
# String scanning is a bit more complicated, so we use another state for it.
# The scanner stays in :string state until the string ends or an error occurs.
#
# JSON uses the same notation for strings and keys. We want keys to be in a
# different color, but the lexical rules are the same. This is why we use this
# case also for the :key state.
when :string, :key
# Another if-elsif-else-switch, for strings this time.
if scan(/[^\\"]+/)
# Everything that is not \ or " is just string content.
kind = :content
elsif scan(/"/)
# A " is found, which means this string or key is ending here.
# A special token class, :delimiter, is used for tokens like this one.
tokens << ['"', :delimiter]
# Always close your token groups!
tokens << [:close, state]
# We're going back to normal scanning here.
state = :initial
# Skip the rest of the loop, since we used tokens <<.
next
elsif scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox)
# A valid special character should be classified as :char.
kind = :char
elsif scan(/\\./m)
# Anything else that is escaped (including \n, we use the m modifier) is
# just content.
kind = :content
elsif scan(/ \\ | $ /x)
# A string that suddenly ends in the middle, or reaches the end of the
# line. This is an error; we go back to :initial now.
tokens << [:close, :delimiter]
kind = :error
state = :initial
else
# Nice for debugging. Should never happen.
raise_inspect "else case \" reached; %p not handled." % peek(1), tokens
end
else
# Nice for debugging. Should never happen.
raise_inspect 'Unknown state: %p' % [state], tokens
end
# Unless the match local variable was set, use matched.
match ||= matched
# Debugging. Empty tokens and undefined kind are bad.
if $CODERAY_DEBUG and not kind
raise_inspect 'Error token %p in line %d' %
[[match, kind], line], tokens
end
raise_inspect 'Empty token', tokens unless match
# Finally, add the token and loop.
tokens << [match, kind]
end
# If we still have a string or key token group open, close it.
if [:string, :key].include? state
tokens << [:close, state]
end
# Return tokens. This is the only rule to follow.
tokens
end
end
end
end
Highlighted with CodeRay, yeah :D