ScannerRequests » History » Version 11

« Previous - Version 11/22 (diff) - Next » - Current version
Kornelius Kalnbach, 09/12/2010 12:42 PM

Scanner Requests¶

Scanners are the heart of CodeRay. They split input code into tokens and classify them.

Each language has its own scanner: You can see what languages are currently supported in the repository.

Why is the CodeRay language support list so short?¶

CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high software quality.

Special attention is paid to the scanners: Every CodeRay scanner is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well.

I need a new Scanner - What can I do?¶

Here's what you can do to speed up the development of a new scanner:

Request it! File a new ticket unless it already exists add a +1 or something to existing tickets to show your interest.
Upload or link to example code in the ticket discussion.
- Typical code in large quantities is very helpful, also for benchmarking.
- But we also need the most weird and strange code you can find to make the scanner.
Provide links to useful information about the language lexic, such as:
- a list of reserved words (Did you know that "void" is a JavaScript keyword?)
- rules for string and number literals (Can a double quoted string contain a newline?)
- rules for comments and other token types (Does Language have a special syntax for multiline comments?)
- a description of any unusual syntactic features (There's this weird %w() thing in Ruby...)
- If there are different versions / implementations / dialects of this language: How do they differ?
Give examples for good and bad highlighters / syntax definitions for the language (usually from editors or other libraries),
Find more example code!

Also, read the next paragraph.

I want to write a Scanner myself¶

Wow, you're brave! Writing CodeRay scanners is not an easy task because:

You need excellent knowledge about the language you want to scan. Every language has a dark side!
You need good knowledge of (Ruby) regular expressions.
There's no documentation to speak of.
- But this is a wiki ^{hint hint} ;o)

But it has been done before, so go and try it!

You should still request the scanner (as described above) and announce that you are working on a patch yourself.
Check out the repository and try the test suite (rake test:scanners[:lang]).
Copy a scanner of your choice as a base. You would know what language comes closest.
Create a test case directory in test/scanners.
--- Advertisement --- (No, just kidding.)
Write your scanner!
Also, look into lib/coderay/scanners/_map.rb and lib/coderay/helpers/file_type.rb.
Make a patch (scanner, test cases and other changes) and upload it to the ticket.
Follow the following discussion.
Prepare to be added to the THX list.

Contact me (murphy rubychan de) if you have any questions.

How does a Scanner look?¶

For example, the JSON scanner:

Warning: This is the 0.9 API; 1.0 will change it, see ticket #142 and look at the 1.0-compatible version of this example.

# Namespace; use this form instead of CodeRay::Scanners to avoid messages like
# "uninitialized constant CodeRay" when testing it.
module CodeRay
module Scanners

  # Always inherit from CodeRay::Scanners::Scanner.
  # 
  # Scanner inherits directly from StringScanner, the Ruby class for fast
  # string scanning. Read the documentation to understand what's going on here:
  # 
  #   http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
  class JSON < Scanner

    # This means your scanner is providing a token stream, that is, it doesn't
    # touch a token after it has been added to tokens. Streamable scanners
    # can be used in streaming mode to minimize memory usage.
    include Streamable

    # Scanners are plugins and must be registered like this.
    register_for :json

    # See the WordList documentation.
    CONSTANTS = %w( true false null )
    IDENT_KIND = WordList.new(:key).add(CONSTANTS, :reserved)

    ESCAPE = / [bfnrt\\"\/] /x
    UNICODE_ESCAPE =  / u[a-fA-F0-9]{4} /x

    # This is the only method you need to define. It scans code.
    # 
    # tokens is a Tokens or TokenStream object. Use tokens << [token, kind] to
    # add tokens to it.
    # 
    # options is reserved for later use.
    # 
    # scan_tokens must return the tokens variable it was given.
    # 
    # You are completely free to use any style you want, just make sure tokens
    # gets what it needs. But typically, a Scanner follows the following scheme.
    # 
    # See http://json.org/ for a definition of the JSON lexic/grammar.
    def scan_tokens tokens, options

      # The scanner is always in a certain state, which is :initial by default.
      # We use local variables and symbols to maximize speed.
      state = :initial

      # Sometimes, you need a stack. Ruby arrays are perfect for this.
      stack = []

      # Define more flags and variables as you need them.
      string_delimiter = nil
      key_expected = false

      # The main loop. eos? is true when the end of the code is reached.
      until eos?

        # You can either add directly to tokens or set these variables. See
        # the end of this method to understand how they are used.
        kind = nil
        match = nil

        # Depending on the state, we want to do different things.
        case state

        # Normally, we use this case.
        when :initial
          # I like the / ... /x style regexps because white space makes them more
          # readable. x means white space is ignored.
          if match = scan(/ \s+ | \\\n /x)
            # White space and masked line ends are :space. We're using the tokens <<
            # style here instead of setting kind and match, because it is faster.
            # Just make sure you never send an empty token! /\s*/ for example would be
            # very bad (actually creating infinite loops).
            tokens << [match, :space]
            next
          elsif match = scan(/ [:,\[{\]}] /x)
            # Operators of JSON. stack is used to determine where we are. stack and
            # key_expected are set depending on which operator was found.
            # key_expected is used to decide whether a "quoted" thing should be
            # classified as key or string.
            kind = :operator
            case match
            when '{' then stack << :object; key_expected = true
            when '[' then stack << :array
            when ':' then key_expected = false
            when ',' then key_expected = true if stack.last == :object
            when '}', ']' then stack.pop  # no error recovery, but works for valid JSON
            end
          elsif match = scan(/ true | false | null /x)
            # These are the only idents that are allowed in JSON. Normally, IDENT_KIND
            # would be used to tell keywords and idents apart.
            kind = IDENT_KIND[match]
          elsif match = scan(/-?(?:0|[1-9]\d*)/)
            # Pay attention to the details: JSON doesn't allow numbers like 00.
            kind = :integer
            if scan(/\.\d+(?:[eE][-+]?\d+)?|[eE][-+]?\d+/)
              match << matched
              kind = :float
            end
          elsif match = scan(/"/)
            # A "quoted" token was found, and we know whether it is a key or a string.
            state = key_expected ? :key : :string
            # This opens a token group.
            tokens << [:open, state]
            kind = :delimiter
          else
            # Don't forget to add this case: If we reach invalid code, we try to discard
            # chars one by one and mark them as :error.
            getch
            kind = :error
          end

        # String scanning is a bit more complicated, so we use another state for it.
        # The scanner stays in :string state until the string ends or an error occurs.
        # 
        # JSON uses the same notation for strings and keys. We want keys to be in a
        # different color, but the lexical rules are the same. This is why we use this
        # case also for the :key state.
        when :string, :key
          # Another if-elsif-else-switch, for strings this time.
          if scan(/[^\\"]+/)
            # Everything that is not \ or " is just string content.
            kind = :content
          elsif scan(/"/)
            # A " is found, which means this string or key is ending here.
            # A special token class, :delimiter, is used for tokens like this one.
            tokens << ['"', :delimiter]
            # Always close your token groups!
            tokens << [:close, state]
            # We're going back to normal scanning here.
            state = :initial
            # Skip the rest of the loop, since we used tokens <<.
            next
          elsif scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox)
            # A valid special character should be classified as :char.
            kind = :char
          elsif scan(/\\./m)
            # Anything else that is escaped (including \n, we use the m modifier) is
            # just content.
            kind = :content
          elsif scan(/ \\ | $ /x)
            # A string that suddenly ends in the middle, or reaches the end of the
            # line. This is an error; we go back to :initial now.
            tokens << [:close, :delimiter]
            kind = :error
            state = :initial
          else
            # Nice for debugging. Should never happen.
            raise_inspect "else case \" reached; %p not handled." % peek(1), tokens
          end

        else
          # Nice for debugging. Should never happen.
          raise_inspect 'Unknown state: %p' % [state], tokens

        end

        # Unless the match local variable was set, use matched.
        match ||= matched
        # Debugging. Empty tokens and undefined kind are bad.
        if $CODERAY_DEBUG and not kind
          raise_inspect 'Error token %p in line %d' %
            [[match, kind], line], tokens
        end
        raise_inspect 'Empty token', tokens unless match

        # Finally, add the token and loop.
        tokens << [match, kind]

      end

      # If we still have a string or key token group open, close it.
      if [:string, :key].include? state
        tokens << [:close, state]
      end

      # Return tokens. This is the only rule to follow.
      tokens
    end

  end

end
end

Highlighted with CodeRay, yeah :D

Files (0)

Project

General

Profile

CodeRay

Wiki