Project

General

Profile

ScannerRequests » History » Version 20

Version 19 (Kornelius Kalnbach, 03/31/2012 11:33 PM) → Version 20/22 (Kornelius Kalnbach, 03/31/2012 11:37 PM)

h1. Scanner Requests

Scanners are the heart of CodeRay. They split input code into tokens and classify them.

Each language has its own scanner: You can see what languages are currently supported in the "repository":https://github.com/rubychan/coderay.

h2. Why is the CodeRay language support list so short?

CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high software quality.

Special attention is paid to the scanners: Every CodeRay scanner is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well.

h2. I need a new Scanner - What can I do?

Here's what you can do to speed up the development of a new scanner:

# Request it! File a "new ticket":http://odd-eyed-code.org/projects/coderay/issues/new unless it already "exists":http://odd-eyed-code.org/projects/coderay/issues?query_id=3 or add a +1 or something to existing tickets to show your interest.
# Upload or link to *example code* in the ticket discussion.
#* Typical code in large quantities is very helpful, also for benchmarking.
#* But we also need the most *weird and strange code* you can find to make the scanner.
# Provide links to useful *information about the language lexic*, such as:
#* a list of reserved words (Did you know that "void" is a JavaScript keyword?)
#* rules for string and number literals (Can a double quoted string contain a newline?)
#* rules for comments and other token types (Does Language have a special syntax for multiline comments?)
#* a description of any unusual syntactic features (There's this weird %w() thing in Ruby...)
#* If there are different versions / implementations / dialects of this language: How do they differ?
# Give examples for *good and bad highlighters / syntax definitions* for the language (usually from editors or other libraries),
# Find *more example code*!

Also, read the next section.

h2. I want to write a Scanner myself

Wow, you're brave! Writing CodeRay scanners is not an easy task because:

* You need excellent knowledge about the language you want to scan. Every language has a dark side!
* You need good knowledge of (Ruby) regular expressions.
* There's no documentation to speak of.
** But this is a wiki ^hint hint^ ;o)

But it has been done before, so go and try it!

# You should still request the scanner (as described above) and announce that you are working on a patch yourself.
# Check out the [[Repository]] and try the [[Test Suite]].
# Copy a scanner of your choice as a base. You would know what language comes closest.
# Make sure you have run @rake test:scanners@ to get the scanner test suite.
#
Create a test case directory in @test/scanners/<lang>@ and add example files for @test/scanners@. After that, @rake test:scanner:<lang>@ is your language. friend.
# Run your tests cases with @rake test:scanner:<lang>@ and write --- Advertisement --- (No, just kidding.)
# Write
your scanner!
# Also, look into @lib/coderay/scanners/_map.rb@ and @lib/coderay/helpers/file_type.rb@.
# Make a patch (scanner, test cases and other changes) and upload it to the ticket.
# Follow the following discussion.
# Prepare to be added to the THX list.

Contact me (murphy rubychan de) if you have any questions.

h2. How does a Scanner look?

For example, the JSON scanner:

<pre><code class="ruby">
# Namespace; use this form instead of CodeRay::Scanners to avoid messages like
# "uninitialized constant CodeRay" when testing it.
module CodeRay
module Scanners

# Always inherit from CodeRay::Scanners::Scanner.
#
# Scanner inherits directly from StringScanner, the Ruby class for fast
# string scanning. Read the documentation to understand what's going on here:
#
# http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
class JSON < Scanner

# Deprecation notice: The Streamable module is gone.

# Scanners are plugins and must be registered like this:
register_for :json

# You can provide a file extension associated with this language.
file_extension 'json'

# List all token kinds that are not considered to be running code
# in this language. For a typical language, this would just be
# :comment, but for a data or markup language like JSON, no tokens
# should count as Line of Code.
KINDS_NOT_LOC = [
:float, :char, :content, :delimiter,
:error, :integer, :operator, :value,
] # :nodoc:

# See the WordList documentation.
CONSTANTS = %w( true false null )
IDENT_KIND = WordList.new(:key).add(CONSTANTS, :value)

ESCAPE = / [bfnrt\\"\/] /x
UNICODE_ESCAPE = / u[a-fA-F0-9]{4} /x

# This is the only method you need to define. It scans code.
#
# encoder is an object which encodes tokens. It provides the following API:
# * encoder.text_token(text, kind) for tokens
# * encoder.begin_group(kind) and encoder.end_group(kind) for token groups
# * encoder.begin_line(kind) and encoder.end_line(kind) for line tokens
#
# options is a hash. Standard options are:
# * keep_state: Try to save the current scanner state and restore it in the
# next call of scan_tokens.
#
# scan_tokens must return the encoder variable it was given.
#
# You are completely free to use any style you want, just make sure encoder
# gets what it needs. But typically, a Scanner follows the following scheme:
def scan_tokens encoder, options

# The scanner is always in a certain state, which is :initial by default.
# We use local variables and symbols to maximize speed.
state = :initial

# Sometimes, you need a stack. Ruby arrays are perfect for this.
stack = []

# Define more flags and variables as you need them.
key_expected = false

# The main loop; eos? is true when the end of the code is reached.
until eos?

# Deprecation notice: The use of local variables kind and match no longer
# recommended.

# Depending on the state, we want to do different things.
case state

# Normally, we use this case.
when :initial
# I like the / ... /x style regexps because white space makes them more
# readable. x means white space is ignored.
if match = scan(/ \s+ /x)
# White space and masked line ends are :space.
# Make sure you never send an empty token! /\s*/ for example would be
# very bad (actually creating an infinite loop).
encoder.text_token match, :space
elsif match = scan(/ [:,\[{\]}] /x)
# Operators of JSON. stack is used to determine where we are. stack and
# key_expected are set depending on which operator was found.
# key_expected is used to decide whether a "quoted" thing should be
# classified as key or string.
encoder.text_token match, :operator
case match
when '{' then stack << :object; key_expected = true
when '[' then stack << :array
when ':' then key_expected = false
when ',' then key_expected = true if stack.last == :object
when '}', ']' then stack.pop # no error recovery, but works for valid JSON
end
elsif match = scan(/ true | false | null /x)
# These are the only idents that are allowed in JSON. Normally, IDENT_KIND
# would be used to tell keywords and idents apart.
encoder.text_token match, IDENT_KIND[match]
elsif match = scan(/ -? (?: 0 | [1-9]\d* ) /x)
# Pay attention to the details: JSON doesn't allow numbers like 00.
if scan(/ \.\d+ (?:[eE][-+]?\d+)? | [eE][-+]? \d+ /x)
match << matched
encoder.text_token match, :float
else
encoder.text_token match, :integer
end
elsif match = scan(/"/)
# A "quoted" token was found, and we know whether it is a key or a string.
state = key_expected ? :key : :string
# This opens a token group and encodes the delimiter token.
encoder.begin_group state
encoder.text_token match, :delimiter
else
# Don't forget to add this case: If we reach invalid code, we try to discard
# chars one by one and mark them as :error.
encoder.text_token getch, :error
end

# String scanning is a bit more complicated, so we use another state for it.
# The scanner stays in :string state until the string ends or an error occurs.
#
# JSON uses the same notation for strings and keys. We want keys to be in a
# different color, but the lexical rules are the same. This is why we use this
# case also for the :key state.
when :string, :key
# Another if-elsif-else-switch, for strings this time.
if match = scan(/[^\\"]+/)
# Everything that is not \ or " is just string content.
encoder.text_token match, :content
elsif match = scan(/"/)
# A " is found, which means this string or key is ending here.
# A special token class, :delimiter, is used for tokens like this one.
encoder.text_token match, :delimiter
# Always close your token groups using the right token kind!
encoder.end_group state
# We're going back to normal scanning here.
state = :initial
# Deprecation notice: Don't use "next" any more.
elsif match = scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox)
# A valid special character should be classified as :char.
encoder.text_token match, :char
elsif match = scan(/\\./m)
# Anything else that is escaped (including \n, we use the m modifier) is
# just content.
encoder.text_token match, :content
elsif match = scan(/ \\ | $ /x)
# A string that suddenly ends in the middle, or reaches the end of the
# line. This is an error; we go back to :initial now.
encoder.end_group state
encoder.text_token match, :error
state = :initial
else
# Nice for debugging. Should never happen.
raise_inspect "else case \" reached; %p not handled." % [peek(1)], encoder
end

else
# Nice for debugging. Should never happen.
raise_inspect 'Unknown state: %p' % [state], encoder

end

# Deprecation notice: The block using the match local variable block is gone.
end

# If we still have a string or key token group open, close it.
if [:string, :key].include? state
encoder.end_group state
end

# Return the encoder.
encoder
end

end

end
end
</code></pre>

Highlighted with CodeRay, yeah :D