Project

General

Profile

ScannerRequests » History » Version 11

Kornelius Kalnbach, 09/12/2010 12:42 PM

1 1 Kornelius Kalnbach
h1. Scanner Requests
2 1 Kornelius Kalnbach
3 1 Kornelius Kalnbach
Scanners are the heart of CodeRay. They split input code into tokens and classify them.
4 1 Kornelius Kalnbach
5 1 Kornelius Kalnbach
Each language has its own scanner: You can see what languages are currently supported in the "repository":http://code.licenser.net/repositories/browse/coderay/trunk/lib/coderay/scanners.
6 1 Kornelius Kalnbach
7 1 Kornelius Kalnbach
h2. Why is the CodeRay language support list so short?
8 1 Kornelius Kalnbach
9 1 Kornelius Kalnbach
CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high software quality.
10 1 Kornelius Kalnbach
11 9 Kornelius Kalnbach
Special attention is paid to the scanners: Every CodeRay scanner is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well.
12 1 Kornelius Kalnbach
13 1 Kornelius Kalnbach
h2. I need a new Scanner - What can I do?
14 1 Kornelius Kalnbach
15 1 Kornelius Kalnbach
Here's what you can do to speed up the development of a new scanner:
16 1 Kornelius Kalnbach
17 1 Kornelius Kalnbach
# Request it! File a "new ticket":http://code.licenser.net/projects/coderay/issues/new unless it already "exists":http://code.licenser.net/projects/coderay/issues?query_id=3; add a +1 or something to existing tickets to show your interest.
18 1 Kornelius Kalnbach
# Upload or link to *example code* in the ticket discussion.
19 1 Kornelius Kalnbach
#* Typical code in large quantities is very helpful, also for benchmarking.
20 1 Kornelius Kalnbach
#* But we also need the most *weird and strange code* you can find to make the scanner.
21 1 Kornelius Kalnbach
# Provide links to useful *information about the language lexic*, such as:
22 1 Kornelius Kalnbach
#* a list of reserved words (Did you know that "void" is a JavaScript keyword?)
23 1 Kornelius Kalnbach
#* rules for string and number literals (Can a double quoted string contain a newline?)
24 8 Kornelius Kalnbach
#* rules for comments and other token types (Does Language have a special syntax for multiline comments?)
25 1 Kornelius Kalnbach
#* a description of any unusual syntactic features (There's this weird %w() thing in Ruby...)
26 1 Kornelius Kalnbach
#* If there are different versions / implementations / dialects of this language: How do they differ?
27 8 Kornelius Kalnbach
# Give examples for *good and bad highlighters / syntax definitions* for the language (usually from editors or other libraries),
28 1 Kornelius Kalnbach
# Find *more example code*!
29 1 Kornelius Kalnbach
30 1 Kornelius Kalnbach
Also, read the next paragraph.
31 1 Kornelius Kalnbach
32 1 Kornelius Kalnbach
h2. I want to write a Scanner myself
33 1 Kornelius Kalnbach
34 1 Kornelius Kalnbach
Wow, you're brave! Writing CodeRay scanners is not an easy task because:
35 1 Kornelius Kalnbach
36 1 Kornelius Kalnbach
* You need excellent knowledge about the language you want to scan. Every language has a dark side!
37 1 Kornelius Kalnbach
* You need good knowledge of (Ruby) regular expressions.
38 1 Kornelius Kalnbach
* There's no documentation to speak of.
39 1 Kornelius Kalnbach
** But this is a wiki ^hint hint^ ;o)
40 1 Kornelius Kalnbach
41 1 Kornelius Kalnbach
But it has been done before, so go and try it!
42 1 Kornelius Kalnbach
43 1 Kornelius Kalnbach
# You should still request the scanner (as described above) and announce that you are working on a patch yourself.
44 6 Kornelius Kalnbach
# Check out the "repository":http://code.licenser.net/wiki/coderay/Repository and try the test suite (@rake test:scanners[:lang]@).
45 1 Kornelius Kalnbach
# Copy a scanner of your choice as a base. You would know what language comes closest.
46 1 Kornelius Kalnbach
# Create a test case directory in @test/scanners@.
47 1 Kornelius Kalnbach
# --- Advertisement --- (No, just kidding.)
48 1 Kornelius Kalnbach
# Write your scanner!
49 1 Kornelius Kalnbach
# Also, look into @lib/coderay/scanners/_map.rb@ and @lib/coderay/helpers/file_type.rb@.
50 1 Kornelius Kalnbach
# Make a patch (scanner, test cases and other changes) and upload it to the ticket.
51 1 Kornelius Kalnbach
# Follow the following discussion.
52 1 Kornelius Kalnbach
# Prepare to be added to the THX list.
53 1 Kornelius Kalnbach
54 7 Kornelius Kalnbach
Contact me (murphy rubychan de) if you have any questions.
55 3 Kornelius Kalnbach
56 3 Kornelius Kalnbach
h2. How does a Scanner look?
57 3 Kornelius Kalnbach
58 11 Kornelius Kalnbach
For example, the JSON scanner:
59 11 Kornelius Kalnbach
60 11 Kornelius Kalnbach
_Warning: This is the 0.9 API; 1.0 will change it, see ticket #142 and look at the [[ScannerNewAPI|1.0-compatible version of this example]]._
61 3 Kornelius Kalnbach
62 3 Kornelius Kalnbach
<pre><code class="ruby">
63 4 Kornelius Kalnbach
# Namespace; use this form instead of CodeRay::Scanners to avoid messages like
64 4 Kornelius Kalnbach
# "uninitialized constant CodeRay" when testing it.
65 3 Kornelius Kalnbach
module CodeRay
66 3 Kornelius Kalnbach
module Scanners
67 3 Kornelius Kalnbach
  
68 4 Kornelius Kalnbach
  # Always inherit from CodeRay::Scanners::Scanner.
69 4 Kornelius Kalnbach
  # 
70 4 Kornelius Kalnbach
  # Scanner inherits directly from StringScanner, the Ruby class for fast
71 4 Kornelius Kalnbach
  # string scanning. Read the documentation to understand what's going on here:
72 4 Kornelius Kalnbach
  # 
73 4 Kornelius Kalnbach
  #   http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
74 3 Kornelius Kalnbach
  class JSON < Scanner
75 3 Kornelius Kalnbach
    
76 4 Kornelius Kalnbach
    # This means your scanner is providing a token stream, that is, it doesn't
77 4 Kornelius Kalnbach
    # touch a token after it has been added to tokens. Streamable scanners
78 4 Kornelius Kalnbach
    # can be used in streaming mode to minimize memory usage.
79 3 Kornelius Kalnbach
    include Streamable
80 3 Kornelius Kalnbach
    
81 4 Kornelius Kalnbach
    # Scanners are plugins and must be registered like this.
82 3 Kornelius Kalnbach
    register_for :json
83 3 Kornelius Kalnbach
    
84 4 Kornelius Kalnbach
    # See the WordList documentation.
85 3 Kornelius Kalnbach
    CONSTANTS = %w( true false null )
86 3 Kornelius Kalnbach
    IDENT_KIND = WordList.new(:key).add(CONSTANTS, :reserved)
87 3 Kornelius Kalnbach
    
88 3 Kornelius Kalnbach
    ESCAPE = / [bfnrt\\"\/] /x
89 3 Kornelius Kalnbach
    UNICODE_ESCAPE =  / u[a-fA-F0-9]{4} /x
90 3 Kornelius Kalnbach
    
91 4 Kornelius Kalnbach
    # This is the only method you need to define. It scans code.
92 4 Kornelius Kalnbach
    # 
93 4 Kornelius Kalnbach
    # tokens is a Tokens or TokenStream object. Use tokens << [token, kind] to
94 4 Kornelius Kalnbach
    # add tokens to it.
95 4 Kornelius Kalnbach
    # 
96 4 Kornelius Kalnbach
    # options is reserved for later use.
97 4 Kornelius Kalnbach
    # 
98 4 Kornelius Kalnbach
    # scan_tokens must return the tokens variable it was given.
99 4 Kornelius Kalnbach
    # 
100 4 Kornelius Kalnbach
    # You are completely free to use any style you want, just make sure tokens
101 4 Kornelius Kalnbach
    # gets what it needs. But typically, a Scanner follows the following scheme.
102 4 Kornelius Kalnbach
    # 
103 4 Kornelius Kalnbach
    # See http://json.org/ for a definition of the JSON lexic/grammar.
104 3 Kornelius Kalnbach
    def scan_tokens tokens, options
105 3 Kornelius Kalnbach
      
106 4 Kornelius Kalnbach
      # The scanner is always in a certain state, which is :initial by default.
107 4 Kornelius Kalnbach
      # We use local variables and symbols to maximize speed.
108 3 Kornelius Kalnbach
      state = :initial
109 4 Kornelius Kalnbach
      
110 4 Kornelius Kalnbach
      # Sometimes, you need a stack. Ruby arrays are perfect for this.
111 3 Kornelius Kalnbach
      stack = []
112 4 Kornelius Kalnbach
      
113 4 Kornelius Kalnbach
      # Define more flags and variables as you need them.
114 3 Kornelius Kalnbach
      string_delimiter = nil
115 3 Kornelius Kalnbach
      key_expected = false
116 3 Kornelius Kalnbach
      
117 4 Kornelius Kalnbach
      # The main loop. eos? is true when the end of the code is reached.
118 3 Kornelius Kalnbach
      until eos?
119 3 Kornelius Kalnbach
        
120 4 Kornelius Kalnbach
        # You can either add directly to tokens or set these variables. See
121 4 Kornelius Kalnbach
        # the end of this method to understand how they are used.
122 3 Kornelius Kalnbach
        kind = nil
123 3 Kornelius Kalnbach
        match = nil
124 3 Kornelius Kalnbach
        
125 4 Kornelius Kalnbach
        # Depending on the state, we want to do different things.
126 3 Kornelius Kalnbach
        case state
127 3 Kornelius Kalnbach
        
128 4 Kornelius Kalnbach
        # Normally, we use this case.
129 3 Kornelius Kalnbach
        when :initial
130 4 Kornelius Kalnbach
          # I like the / ... /x style regexps because white space makes them more
131 4 Kornelius Kalnbach
          # readable. x means white space is ignored.
132 3 Kornelius Kalnbach
          if match = scan(/ \s+ | \\\n /x)
133 4 Kornelius Kalnbach
            # White space and masked line ends are :space. We're using the tokens <<
134 4 Kornelius Kalnbach
            # style here instead of setting kind and match, because it is faster.
135 4 Kornelius Kalnbach
            # Just make sure you never send an empty token! /\s*/ for example would be
136 4 Kornelius Kalnbach
            # very bad (actually creating infinite loops).
137 3 Kornelius Kalnbach
            tokens << [match, :space]
138 3 Kornelius Kalnbach
            next
139 3 Kornelius Kalnbach
          elsif match = scan(/ [:,\[{\]}] /x)
140 4 Kornelius Kalnbach
            # Operators of JSON. stack is used to determine where we are. stack and
141 4 Kornelius Kalnbach
            # key_expected are set depending on which operator was found.
142 4 Kornelius Kalnbach
            # key_expected is used to decide whether a "quoted" thing should be
143 4 Kornelius Kalnbach
            # classified as key or string.
144 3 Kornelius Kalnbach
            kind = :operator
145 3 Kornelius Kalnbach
            case match
146 3 Kornelius Kalnbach
            when '{' then stack << :object; key_expected = true
147 3 Kornelius Kalnbach
            when '[' then stack << :array
148 3 Kornelius Kalnbach
            when ':' then key_expected = false
149 3 Kornelius Kalnbach
            when ',' then key_expected = true if stack.last == :object
150 3 Kornelius Kalnbach
            when '}', ']' then stack.pop  # no error recovery, but works for valid JSON
151 3 Kornelius Kalnbach
            end
152 3 Kornelius Kalnbach
          elsif match = scan(/ true | false | null /x)
153 4 Kornelius Kalnbach
            # These are the only idents that are allowed in JSON. Normally, IDENT_KIND
154 4 Kornelius Kalnbach
            # would be used to tell keywords and idents apart.
155 3 Kornelius Kalnbach
            kind = IDENT_KIND[match]
156 3 Kornelius Kalnbach
          elsif match = scan(/-?(?:0|[1-9]\d*)/)
157 4 Kornelius Kalnbach
            # Pay attention to the details: JSON doesn't allow numbers like 00.
158 3 Kornelius Kalnbach
            kind = :integer
159 3 Kornelius Kalnbach
            if scan(/\.\d+(?:[eE][-+]?\d+)?|[eE][-+]?\d+/)
160 3 Kornelius Kalnbach
              match << matched
161 3 Kornelius Kalnbach
              kind = :float
162 3 Kornelius Kalnbach
            end
163 3 Kornelius Kalnbach
          elsif match = scan(/"/)
164 4 Kornelius Kalnbach
            # A "quoted" token was found, and we know whether it is a key or a string.
165 3 Kornelius Kalnbach
            state = key_expected ? :key : :string
166 4 Kornelius Kalnbach
            # This opens a token group.
167 3 Kornelius Kalnbach
            tokens << [:open, state]
168 3 Kornelius Kalnbach
            kind = :delimiter
169 3 Kornelius Kalnbach
          else
170 4 Kornelius Kalnbach
            # Don't forget to add this case: If we reach invalid code, we try to discard
171 4 Kornelius Kalnbach
            # chars one by one and mark them as :error.
172 3 Kornelius Kalnbach
            getch
173 3 Kornelius Kalnbach
            kind = :error
174 3 Kornelius Kalnbach
          end
175 3 Kornelius Kalnbach
          
176 4 Kornelius Kalnbach
        # String scanning is a bit more complicated, so we use another state for it.
177 4 Kornelius Kalnbach
        # The scanner stays in :string state until the string ends or an error occurs.
178 4 Kornelius Kalnbach
        # 
179 4 Kornelius Kalnbach
        # JSON uses the same notation for strings and keys. We want keys to be in a
180 4 Kornelius Kalnbach
        # different color, but the lexical rules are the same. This is why we use this
181 4 Kornelius Kalnbach
        # case also for the :key state.
182 3 Kornelius Kalnbach
        when :string, :key
183 4 Kornelius Kalnbach
          # Another if-elsif-else-switch, for strings this time.
184 3 Kornelius Kalnbach
          if scan(/[^\\"]+/)
185 4 Kornelius Kalnbach
            # Everything that is not \ or " is just string content.
186 3 Kornelius Kalnbach
            kind = :content
187 3 Kornelius Kalnbach
          elsif scan(/"/)
188 4 Kornelius Kalnbach
            # A " is found, which means this string or key is ending here.
189 4 Kornelius Kalnbach
            # A special token class, :delimiter, is used for tokens like this one.
190 3 Kornelius Kalnbach
            tokens << ['"', :delimiter]
191 4 Kornelius Kalnbach
            # Always close your token groups!
192 3 Kornelius Kalnbach
            tokens << [:close, state]
193 4 Kornelius Kalnbach
            # We're going back to normal scanning here.
194 3 Kornelius Kalnbach
            state = :initial
195 4 Kornelius Kalnbach
            # Skip the rest of the loop, since we used tokens <<.
196 3 Kornelius Kalnbach
            next
197 3 Kornelius Kalnbach
          elsif scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox)
198 4 Kornelius Kalnbach
            # A valid special character should be classified as :char.
199 3 Kornelius Kalnbach
            kind = :char
200 3 Kornelius Kalnbach
          elsif scan(/\\./m)
201 4 Kornelius Kalnbach
            # Anything else that is escaped (including \n, we use the m modifier) is
202 4 Kornelius Kalnbach
            # just content.
203 3 Kornelius Kalnbach
            kind = :content
204 3 Kornelius Kalnbach
          elsif scan(/ \\ | $ /x)
205 4 Kornelius Kalnbach
            # A string that suddenly ends in the middle, or reaches the end of the
206 4 Kornelius Kalnbach
            # line. This is an error; we go back to :initial now.
207 3 Kornelius Kalnbach
            tokens << [:close, :delimiter]
208 3 Kornelius Kalnbach
            kind = :error
209 3 Kornelius Kalnbach
            state = :initial
210 3 Kornelius Kalnbach
          else
211 4 Kornelius Kalnbach
            # Nice for debugging. Should never happen.
212 3 Kornelius Kalnbach
            raise_inspect "else case \" reached; %p not handled." % peek(1), tokens
213 3 Kornelius Kalnbach
          end
214 3 Kornelius Kalnbach
          
215 1 Kornelius Kalnbach
        else
216 4 Kornelius Kalnbach
          # Nice for debugging. Should never happen.
217 4 Kornelius Kalnbach
          raise_inspect 'Unknown state: %p' % [state], tokens
218 3 Kornelius Kalnbach
          
219 3 Kornelius Kalnbach
        end
220 1 Kornelius Kalnbach
        
221 4 Kornelius Kalnbach
        # Unless the match local variable was set, use matched.
222 1 Kornelius Kalnbach
        match ||= matched
223 4 Kornelius Kalnbach
        # Debugging. Empty tokens and undefined kind are bad.
224 5 Kornelius Kalnbach
        if $CODERAY_DEBUG and not kind
225 3 Kornelius Kalnbach
          raise_inspect 'Error token %p in line %d' %
226 3 Kornelius Kalnbach
            [[match, kind], line], tokens
227 3 Kornelius Kalnbach
        end
228 3 Kornelius Kalnbach
        raise_inspect 'Empty token', tokens unless match
229 1 Kornelius Kalnbach
        
230 4 Kornelius Kalnbach
        # Finally, add the token and loop.
231 3 Kornelius Kalnbach
        tokens << [match, kind]
232 3 Kornelius Kalnbach
        
233 3 Kornelius Kalnbach
      end
234 1 Kornelius Kalnbach
      
235 4 Kornelius Kalnbach
      # If we still have a string or key token group open, close it.
236 3 Kornelius Kalnbach
      if [:string, :key].include? state
237 3 Kornelius Kalnbach
        tokens << [:close, state]
238 3 Kornelius Kalnbach
      end
239 1 Kornelius Kalnbach
      
240 4 Kornelius Kalnbach
      # Return tokens. This is the only rule to follow.
241 3 Kornelius Kalnbach
      tokens
242 3 Kornelius Kalnbach
    end
243 3 Kornelius Kalnbach
    
244 3 Kornelius Kalnbach
  end
245 3 Kornelius Kalnbach
  
246 3 Kornelius Kalnbach
end
247 3 Kornelius Kalnbach
end
248 1 Kornelius Kalnbach
</code></pre>
249 4 Kornelius Kalnbach
250 4 Kornelius Kalnbach
Highlighted with CodeRay, yeah :D