ScannerRequests » History » Version 7
Kornelius Kalnbach, 02/13/2010 01:27 AM
1 | 1 | Kornelius Kalnbach | h1. Scanner Requests |
---|---|---|---|
2 | 1 | Kornelius Kalnbach | |
3 | 1 | Kornelius Kalnbach | Scanners are the heart of CodeRay. They split input code into tokens and classify them. |
4 | 1 | Kornelius Kalnbach | |
5 | 1 | Kornelius Kalnbach | Each language has its own scanner: You can see what languages are currently supported in the "repository":http://code.licenser.net/repositories/browse/coderay/trunk/lib/coderay/scanners. |
6 | 1 | Kornelius Kalnbach | |
7 | 1 | Kornelius Kalnbach | h2. Why is the CodeRay language support list so short? |
8 | 1 | Kornelius Kalnbach | |
9 | 1 | Kornelius Kalnbach | CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high software quality. |
10 | 1 | Kornelius Kalnbach | |
11 | 2 | Kornelius Kalnbach | Special attention is paid to the scanners: every CodeRay scanner is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well. |
12 | 1 | Kornelius Kalnbach | |
13 | 1 | Kornelius Kalnbach | h2. I need a new Scanner - What can I do? |
14 | 1 | Kornelius Kalnbach | |
15 | 1 | Kornelius Kalnbach | Here's what you can do to speed up the development of a new scanner: |
16 | 1 | Kornelius Kalnbach | |
17 | 1 | Kornelius Kalnbach | # Request it! File a "new ticket":http://code.licenser.net/projects/coderay/issues/new unless it already "exists":http://code.licenser.net/projects/coderay/issues?query_id=3; add a +1 or something to existing tickets to show your interest. |
18 | 1 | Kornelius Kalnbach | # Upload or link to *example code* in the ticket discussion. |
19 | 1 | Kornelius Kalnbach | #* Typical code in large quantities is very helpful, also for benchmarking. |
20 | 1 | Kornelius Kalnbach | #* But we also need the most *weird and strange code* you can find to make the scanner. |
21 | 1 | Kornelius Kalnbach | # Provide links to useful *information about the language lexic*, such as: |
22 | 1 | Kornelius Kalnbach | #* a list of reserved words (Did you know that "void" is a JavaScript keyword?) |
23 | 1 | Kornelius Kalnbach | #* rules for string and number literals (Can a double quoted string contain a newline?) |
24 | 1 | Kornelius Kalnbach | #* rules for comments and other token types (Does XYZ have a special syntax for multiline comments?) |
25 | 1 | Kornelius Kalnbach | #* a description of any unusual syntactic features (There's this weird %w() thing in Ruby...) |
26 | 1 | Kornelius Kalnbach | #* If there are different versions / implementations / dialects of this language: How do they differ? |
27 | 1 | Kornelius Kalnbach | # Give examples for *good and bad highlighters / syntax definitions* for the language (usually from editors or other libraries) |
28 | 1 | Kornelius Kalnbach | # Find *more example code*! |
29 | 1 | Kornelius Kalnbach | |
30 | 1 | Kornelius Kalnbach | Also, read the next paragraph. |
31 | 1 | Kornelius Kalnbach | |
32 | 1 | Kornelius Kalnbach | h2. I want to write a Scanner myself |
33 | 1 | Kornelius Kalnbach | |
34 | 1 | Kornelius Kalnbach | Wow, you're brave! Writing CodeRay scanners is not an easy task because: |
35 | 1 | Kornelius Kalnbach | |
36 | 1 | Kornelius Kalnbach | * You need excellent knowledge about the language you want to scan. Every language has a dark side! |
37 | 1 | Kornelius Kalnbach | * You need good knowledge of (Ruby) regular expressions. |
38 | 1 | Kornelius Kalnbach | * There's no documentation to speak of. |
39 | 1 | Kornelius Kalnbach | ** But this is a wiki ^hint hint^ ;o) |
40 | 1 | Kornelius Kalnbach | |
41 | 1 | Kornelius Kalnbach | But it has been done before, so go and try it! |
42 | 1 | Kornelius Kalnbach | |
43 | 1 | Kornelius Kalnbach | # You should still request the scanner (as described above) and announce that you are working on a patch yourself. |
44 | 6 | Kornelius Kalnbach | # Check out the "repository":http://code.licenser.net/wiki/coderay/Repository and try the test suite (@rake test:scanners[:lang]@). |
45 | 1 | Kornelius Kalnbach | # Copy a scanner of your choice as a base. You would know what language comes closest. |
46 | 1 | Kornelius Kalnbach | # Create a test case directory in @test/scanners@. |
47 | 1 | Kornelius Kalnbach | # --- Advertisement --- (No, just kidding.) |
48 | 1 | Kornelius Kalnbach | # Write your scanner! |
49 | 1 | Kornelius Kalnbach | # Also, look into @lib/coderay/scanners/_map.rb@ and @lib/coderay/helpers/file_type.rb@. |
50 | 1 | Kornelius Kalnbach | # Make a patch (scanner, test cases and other changes) and upload it to the ticket. |
51 | 1 | Kornelius Kalnbach | # Follow the following discussion. |
52 | 1 | Kornelius Kalnbach | # Prepare to be added to the THX list. |
53 | 1 | Kornelius Kalnbach | |
54 | 7 | Kornelius Kalnbach | Contact me (murphy rubychan de) if you have any questions. |
55 | 3 | Kornelius Kalnbach | |
56 | 3 | Kornelius Kalnbach | h2. How does a Scanner look? |
57 | 3 | Kornelius Kalnbach | |
58 | 3 | Kornelius Kalnbach | For example, the JSON scanner: |
59 | 3 | Kornelius Kalnbach | |
60 | 3 | Kornelius Kalnbach | <pre><code class="ruby"> |
61 | 4 | Kornelius Kalnbach | # Namespace; use this form instead of CodeRay::Scanners to avoid messages like |
62 | 4 | Kornelius Kalnbach | # "uninitialized constant CodeRay" when testing it. |
63 | 3 | Kornelius Kalnbach | module CodeRay |
64 | 3 | Kornelius Kalnbach | module Scanners |
65 | 3 | Kornelius Kalnbach | |
66 | 4 | Kornelius Kalnbach | # Always inherit from CodeRay::Scanners::Scanner. |
67 | 4 | Kornelius Kalnbach | # |
68 | 4 | Kornelius Kalnbach | # Scanner inherits directly from StringScanner, the Ruby class for fast |
69 | 4 | Kornelius Kalnbach | # string scanning. Read the documentation to understand what's going on here: |
70 | 4 | Kornelius Kalnbach | # |
71 | 4 | Kornelius Kalnbach | # http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html |
72 | 3 | Kornelius Kalnbach | class JSON < Scanner |
73 | 3 | Kornelius Kalnbach | |
74 | 4 | Kornelius Kalnbach | # This means your scanner is providing a token stream, that is, it doesn't |
75 | 4 | Kornelius Kalnbach | # touch a token after it has been added to tokens. Streamable scanners |
76 | 4 | Kornelius Kalnbach | # can be used in streaming mode to minimize memory usage. |
77 | 3 | Kornelius Kalnbach | include Streamable |
78 | 3 | Kornelius Kalnbach | |
79 | 4 | Kornelius Kalnbach | # Scanners are plugins and must be registered like this. |
80 | 3 | Kornelius Kalnbach | register_for :json |
81 | 3 | Kornelius Kalnbach | |
82 | 4 | Kornelius Kalnbach | # See the WordList documentation. |
83 | 3 | Kornelius Kalnbach | CONSTANTS = %w( true false null ) |
84 | 3 | Kornelius Kalnbach | IDENT_KIND = WordList.new(:key).add(CONSTANTS, :reserved) |
85 | 3 | Kornelius Kalnbach | |
86 | 3 | Kornelius Kalnbach | ESCAPE = / [bfnrt\\"\/] /x |
87 | 3 | Kornelius Kalnbach | UNICODE_ESCAPE = / u[a-fA-F0-9]{4} /x |
88 | 3 | Kornelius Kalnbach | |
89 | 4 | Kornelius Kalnbach | # This is the only method you need to define. It scans code. |
90 | 4 | Kornelius Kalnbach | # |
91 | 4 | Kornelius Kalnbach | # tokens is a Tokens or TokenStream object. Use tokens << [token, kind] to |
92 | 4 | Kornelius Kalnbach | # add tokens to it. |
93 | 4 | Kornelius Kalnbach | # |
94 | 4 | Kornelius Kalnbach | # options is reserved for later use. |
95 | 4 | Kornelius Kalnbach | # |
96 | 4 | Kornelius Kalnbach | # scan_tokens must return the tokens variable it was given. |
97 | 4 | Kornelius Kalnbach | # |
98 | 4 | Kornelius Kalnbach | # You are completely free to use any style you want, just make sure tokens |
99 | 4 | Kornelius Kalnbach | # gets what it needs. But typically, a Scanner follows the following scheme. |
100 | 4 | Kornelius Kalnbach | # |
101 | 4 | Kornelius Kalnbach | # See http://json.org/ for a definition of the JSON lexic/grammar. |
102 | 3 | Kornelius Kalnbach | def scan_tokens tokens, options |
103 | 3 | Kornelius Kalnbach | |
104 | 4 | Kornelius Kalnbach | # The scanner is always in a certain state, which is :initial by default. |
105 | 4 | Kornelius Kalnbach | # We use local variables and symbols to maximize speed. |
106 | 3 | Kornelius Kalnbach | state = :initial |
107 | 4 | Kornelius Kalnbach | |
108 | 4 | Kornelius Kalnbach | # Sometimes, you need a stack. Ruby arrays are perfect for this. |
109 | 3 | Kornelius Kalnbach | stack = [] |
110 | 4 | Kornelius Kalnbach | |
111 | 4 | Kornelius Kalnbach | # Define more flags and variables as you need them. |
112 | 3 | Kornelius Kalnbach | string_delimiter = nil |
113 | 3 | Kornelius Kalnbach | key_expected = false |
114 | 3 | Kornelius Kalnbach | |
115 | 4 | Kornelius Kalnbach | # The main loop. eos? is true when the end of the code is reached. |
116 | 3 | Kornelius Kalnbach | until eos? |
117 | 3 | Kornelius Kalnbach | |
118 | 4 | Kornelius Kalnbach | # You can either add directly to tokens or set these variables. See |
119 | 4 | Kornelius Kalnbach | # the end of this method to understand how they are used. |
120 | 3 | Kornelius Kalnbach | kind = nil |
121 | 3 | Kornelius Kalnbach | match = nil |
122 | 3 | Kornelius Kalnbach | |
123 | 4 | Kornelius Kalnbach | # Depending on the state, we want to do different things. |
124 | 3 | Kornelius Kalnbach | case state |
125 | 3 | Kornelius Kalnbach | |
126 | 4 | Kornelius Kalnbach | # Normally, we use this case. |
127 | 3 | Kornelius Kalnbach | when :initial |
128 | 4 | Kornelius Kalnbach | # I like the / ... /x style regexps because white space makes them more |
129 | 4 | Kornelius Kalnbach | # readable. x means white space is ignored. |
130 | 3 | Kornelius Kalnbach | if match = scan(/ \s+ | \\\n /x) |
131 | 4 | Kornelius Kalnbach | # White space and masked line ends are :space. We're using the tokens << |
132 | 4 | Kornelius Kalnbach | # style here instead of setting kind and match, because it is faster. |
133 | 4 | Kornelius Kalnbach | # Just make sure you never send an empty token! /\s*/ for example would be |
134 | 4 | Kornelius Kalnbach | # very bad (actually creating infinite loops). |
135 | 3 | Kornelius Kalnbach | tokens << [match, :space] |
136 | 3 | Kornelius Kalnbach | next |
137 | 3 | Kornelius Kalnbach | elsif match = scan(/ [:,\[{\]}] /x) |
138 | 4 | Kornelius Kalnbach | # Operators of JSON. stack is used to determine where we are. stack and |
139 | 4 | Kornelius Kalnbach | # key_expected are set depending on which operator was found. |
140 | 4 | Kornelius Kalnbach | # key_expected is used to decide whether a "quoted" thing should be |
141 | 4 | Kornelius Kalnbach | # classified as key or string. |
142 | 3 | Kornelius Kalnbach | kind = :operator |
143 | 3 | Kornelius Kalnbach | case match |
144 | 3 | Kornelius Kalnbach | when '{' then stack << :object; key_expected = true |
145 | 3 | Kornelius Kalnbach | when '[' then stack << :array |
146 | 3 | Kornelius Kalnbach | when ':' then key_expected = false |
147 | 3 | Kornelius Kalnbach | when ',' then key_expected = true if stack.last == :object |
148 | 3 | Kornelius Kalnbach | when '}', ']' then stack.pop # no error recovery, but works for valid JSON |
149 | 3 | Kornelius Kalnbach | end |
150 | 3 | Kornelius Kalnbach | elsif match = scan(/ true | false | null /x) |
151 | 4 | Kornelius Kalnbach | # These are the only idents that are allowed in JSON. Normally, IDENT_KIND |
152 | 4 | Kornelius Kalnbach | # would be used to tell keywords and idents apart. |
153 | 3 | Kornelius Kalnbach | kind = IDENT_KIND[match] |
154 | 3 | Kornelius Kalnbach | elsif match = scan(/-?(?:0|[1-9]\d*)/) |
155 | 4 | Kornelius Kalnbach | # Pay attention to the details: JSON doesn't allow numbers like 00. |
156 | 3 | Kornelius Kalnbach | kind = :integer |
157 | 3 | Kornelius Kalnbach | if scan(/\.\d+(?:[eE][-+]?\d+)?|[eE][-+]?\d+/) |
158 | 3 | Kornelius Kalnbach | match << matched |
159 | 3 | Kornelius Kalnbach | kind = :float |
160 | 3 | Kornelius Kalnbach | end |
161 | 3 | Kornelius Kalnbach | elsif match = scan(/"/) |
162 | 4 | Kornelius Kalnbach | # A "quoted" token was found, and we know whether it is a key or a string. |
163 | 3 | Kornelius Kalnbach | state = key_expected ? :key : :string |
164 | 4 | Kornelius Kalnbach | # This opens a token group. |
165 | 3 | Kornelius Kalnbach | tokens << [:open, state] |
166 | 3 | Kornelius Kalnbach | kind = :delimiter |
167 | 3 | Kornelius Kalnbach | else |
168 | 4 | Kornelius Kalnbach | # Don't forget to add this case: If we reach invalid code, we try to discard |
169 | 4 | Kornelius Kalnbach | # chars one by one and mark them as :error. |
170 | 3 | Kornelius Kalnbach | getch |
171 | 3 | Kornelius Kalnbach | kind = :error |
172 | 3 | Kornelius Kalnbach | end |
173 | 3 | Kornelius Kalnbach | |
174 | 4 | Kornelius Kalnbach | # String scanning is a bit more complicated, so we use another state for it. |
175 | 4 | Kornelius Kalnbach | # The scanner stays in :string state until the string ends or an error occurs. |
176 | 4 | Kornelius Kalnbach | # |
177 | 4 | Kornelius Kalnbach | # JSON uses the same notation for strings and keys. We want keys to be in a |
178 | 4 | Kornelius Kalnbach | # different color, but the lexical rules are the same. This is why we use this |
179 | 4 | Kornelius Kalnbach | # case also for the :key state. |
180 | 3 | Kornelius Kalnbach | when :string, :key |
181 | 4 | Kornelius Kalnbach | # Another if-elsif-else-switch, for strings this time. |
182 | 3 | Kornelius Kalnbach | if scan(/[^\\"]+/) |
183 | 4 | Kornelius Kalnbach | # Everything that is not \ or " is just string content. |
184 | 3 | Kornelius Kalnbach | kind = :content |
185 | 3 | Kornelius Kalnbach | elsif scan(/"/) |
186 | 4 | Kornelius Kalnbach | # A " is found, which means this string or key is ending here. |
187 | 4 | Kornelius Kalnbach | # A special token class, :delimiter, is used for tokens like this one. |
188 | 3 | Kornelius Kalnbach | tokens << ['"', :delimiter] |
189 | 4 | Kornelius Kalnbach | # Always close your token groups! |
190 | 3 | Kornelius Kalnbach | tokens << [:close, state] |
191 | 4 | Kornelius Kalnbach | # We're going back to normal scanning here. |
192 | 3 | Kornelius Kalnbach | state = :initial |
193 | 4 | Kornelius Kalnbach | # Skip the rest of the loop, since we used tokens <<. |
194 | 3 | Kornelius Kalnbach | next |
195 | 3 | Kornelius Kalnbach | elsif scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox) |
196 | 4 | Kornelius Kalnbach | # A valid special character should be classified as :char. |
197 | 3 | Kornelius Kalnbach | kind = :char |
198 | 3 | Kornelius Kalnbach | elsif scan(/\\./m) |
199 | 4 | Kornelius Kalnbach | # Anything else that is escaped (including \n, we use the m modifier) is |
200 | 4 | Kornelius Kalnbach | # just content. |
201 | 3 | Kornelius Kalnbach | kind = :content |
202 | 3 | Kornelius Kalnbach | elsif scan(/ \\ | $ /x) |
203 | 4 | Kornelius Kalnbach | # A string that suddenly ends in the middle, or reaches the end of the |
204 | 4 | Kornelius Kalnbach | # line. This is an error; we go back to :initial now. |
205 | 3 | Kornelius Kalnbach | tokens << [:close, :delimiter] |
206 | 3 | Kornelius Kalnbach | kind = :error |
207 | 3 | Kornelius Kalnbach | state = :initial |
208 | 3 | Kornelius Kalnbach | else |
209 | 4 | Kornelius Kalnbach | # Nice for debugging. Should never happen. |
210 | 3 | Kornelius Kalnbach | raise_inspect "else case \" reached; %p not handled." % peek(1), tokens |
211 | 3 | Kornelius Kalnbach | end |
212 | 3 | Kornelius Kalnbach | |
213 | 1 | Kornelius Kalnbach | else |
214 | 4 | Kornelius Kalnbach | # Nice for debugging. Should never happen. |
215 | 4 | Kornelius Kalnbach | raise_inspect 'Unknown state: %p' % [state], tokens |
216 | 3 | Kornelius Kalnbach | |
217 | 3 | Kornelius Kalnbach | end |
218 | 1 | Kornelius Kalnbach | |
219 | 4 | Kornelius Kalnbach | # Unless the match local variable was set, use matched. |
220 | 1 | Kornelius Kalnbach | match ||= matched |
221 | 4 | Kornelius Kalnbach | # Debugging. Empty tokens and undefined kind are bad. |
222 | 5 | Kornelius Kalnbach | if $CODERAY_DEBUG and not kind |
223 | 3 | Kornelius Kalnbach | raise_inspect 'Error token %p in line %d' % |
224 | 3 | Kornelius Kalnbach | [[match, kind], line], tokens |
225 | 3 | Kornelius Kalnbach | end |
226 | 3 | Kornelius Kalnbach | raise_inspect 'Empty token', tokens unless match |
227 | 1 | Kornelius Kalnbach | |
228 | 4 | Kornelius Kalnbach | # Finally, add the token and loop. |
229 | 3 | Kornelius Kalnbach | tokens << [match, kind] |
230 | 3 | Kornelius Kalnbach | |
231 | 3 | Kornelius Kalnbach | end |
232 | 1 | Kornelius Kalnbach | |
233 | 4 | Kornelius Kalnbach | # If we still have a string or key token group open, close it. |
234 | 3 | Kornelius Kalnbach | if [:string, :key].include? state |
235 | 3 | Kornelius Kalnbach | tokens << [:close, state] |
236 | 3 | Kornelius Kalnbach | end |
237 | 1 | Kornelius Kalnbach | |
238 | 4 | Kornelius Kalnbach | # Return tokens. This is the only rule to follow. |
239 | 3 | Kornelius Kalnbach | tokens |
240 | 3 | Kornelius Kalnbach | end |
241 | 3 | Kornelius Kalnbach | |
242 | 3 | Kornelius Kalnbach | end |
243 | 3 | Kornelius Kalnbach | |
244 | 3 | Kornelius Kalnbach | end |
245 | 3 | Kornelius Kalnbach | end |
246 | 1 | Kornelius Kalnbach | </code></pre> |
247 | 4 | Kornelius Kalnbach | |
248 | 4 | Kornelius Kalnbach | Highlighted with CodeRay, yeah :D |