ScannerRequests » History » Version 21
Kornelius Kalnbach, 06/24/2012 01:47 PM
1 | 1 | Kornelius Kalnbach | h1. Scanner Requests |
---|---|---|---|
2 | 1 | Kornelius Kalnbach | |
3 | 1 | Kornelius Kalnbach | Scanners are the heart of CodeRay. They split input code into tokens and classify them. |
4 | 1 | Kornelius Kalnbach | |
5 | 16 | Kornelius Kalnbach | Each language has its own scanner: You can see what languages are currently supported in the "repository":https://github.com/rubychan/coderay. |
6 | 1 | Kornelius Kalnbach | |
7 | 1 | Kornelius Kalnbach | h2. Why is the CodeRay language support list so short? |
8 | 1 | Kornelius Kalnbach | |
9 | 1 | Kornelius Kalnbach | CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high software quality. |
10 | 1 | Kornelius Kalnbach | |
11 | 21 | Kornelius Kalnbach | Special attention is paid to the scanners: Every "CodeRay scanner":https://github.com/rubychan/coderay/tree/master/lib/coderay/scanners is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well. |
12 | 1 | Kornelius Kalnbach | |
13 | 1 | Kornelius Kalnbach | h2. I need a new Scanner - What can I do? |
14 | 1 | Kornelius Kalnbach | |
15 | 1 | Kornelius Kalnbach | Here's what you can do to speed up the development of a new scanner: |
16 | 1 | Kornelius Kalnbach | |
17 | 18 | Kornelius Kalnbach | # Request it! File a "new ticket":http://odd-eyed-code.org/projects/coderay/issues/new unless it already "exists":http://odd-eyed-code.org/projects/coderay/issues?query_id=3 or add a +1 or something to existing tickets to show your interest. |
18 | 1 | Kornelius Kalnbach | # Upload or link to *example code* in the ticket discussion. |
19 | 1 | Kornelius Kalnbach | #* Typical code in large quantities is very helpful, also for benchmarking. |
20 | 1 | Kornelius Kalnbach | #* But we also need the most *weird and strange code* you can find to make the scanner. |
21 | 1 | Kornelius Kalnbach | # Provide links to useful *information about the language lexic*, such as: |
22 | 1 | Kornelius Kalnbach | #* a list of reserved words (Did you know that "void" is a JavaScript keyword?) |
23 | 1 | Kornelius Kalnbach | #* rules for string and number literals (Can a double quoted string contain a newline?) |
24 | 8 | Kornelius Kalnbach | #* rules for comments and other token types (Does Language have a special syntax for multiline comments?) |
25 | 1 | Kornelius Kalnbach | #* a description of any unusual syntactic features (There's this weird %w() thing in Ruby...) |
26 | 1 | Kornelius Kalnbach | #* If there are different versions / implementations / dialects of this language: How do they differ? |
27 | 8 | Kornelius Kalnbach | # Give examples for *good and bad highlighters / syntax definitions* for the language (usually from editors or other libraries), |
28 | 1 | Kornelius Kalnbach | # Find *more example code*! |
29 | 1 | Kornelius Kalnbach | |
30 | 14 | Kornelius Kalnbach | Also, read the next section. |
31 | 1 | Kornelius Kalnbach | |
32 | 1 | Kornelius Kalnbach | h2. I want to write a Scanner myself |
33 | 1 | Kornelius Kalnbach | |
34 | 1 | Kornelius Kalnbach | Wow, you're brave! Writing CodeRay scanners is not an easy task because: |
35 | 1 | Kornelius Kalnbach | |
36 | 1 | Kornelius Kalnbach | * You need excellent knowledge about the language you want to scan. Every language has a dark side! |
37 | 1 | Kornelius Kalnbach | * You need good knowledge of (Ruby) regular expressions. |
38 | 1 | Kornelius Kalnbach | * There's no documentation to speak of. |
39 | 1 | Kornelius Kalnbach | ** But this is a wiki ^hint hint^ ;o) |
40 | 1 | Kornelius Kalnbach | |
41 | 1 | Kornelius Kalnbach | But it has been done before, so go and try it! |
42 | 1 | Kornelius Kalnbach | |
43 | 1 | Kornelius Kalnbach | # You should still request the scanner (as described above) and announce that you are working on a patch yourself. |
44 | 19 | Kornelius Kalnbach | # Check out the [[Repository]] and try the [[Test Suite]]. |
45 | 1 | Kornelius Kalnbach | # Copy a scanner of your choice as a base. You would know what language comes closest. |
46 | 20 | Kornelius Kalnbach | # Make sure you have run @rake test:scanners@ to get the scanner test suite. |
47 | 20 | Kornelius Kalnbach | # Create a test case directory in @test/scanners/<lang>@ and add example files for your language. |
48 | 20 | Kornelius Kalnbach | # Run your tests cases with @rake test:scanner:<lang>@ and write your scanner! |
49 | 1 | Kornelius Kalnbach | # Also, look into @lib/coderay/scanners/_map.rb@ and @lib/coderay/helpers/file_type.rb@. |
50 | 1 | Kornelius Kalnbach | # Make a patch (scanner, test cases and other changes) and upload it to the ticket. |
51 | 1 | Kornelius Kalnbach | # Follow the following discussion. |
52 | 1 | Kornelius Kalnbach | # Prepare to be added to the THX list. |
53 | 1 | Kornelius Kalnbach | |
54 | 7 | Kornelius Kalnbach | Contact me (murphy rubychan de) if you have any questions. |
55 | 3 | Kornelius Kalnbach | |
56 | 3 | Kornelius Kalnbach | h2. How does a Scanner look? |
57 | 3 | Kornelius Kalnbach | |
58 | 11 | Kornelius Kalnbach | For example, the JSON scanner: |
59 | 11 | Kornelius Kalnbach | |
60 | 3 | Kornelius Kalnbach | <pre><code class="ruby"> |
61 | 4 | Kornelius Kalnbach | # Namespace; use this form instead of CodeRay::Scanners to avoid messages like |
62 | 4 | Kornelius Kalnbach | # "uninitialized constant CodeRay" when testing it. |
63 | 3 | Kornelius Kalnbach | module CodeRay |
64 | 3 | Kornelius Kalnbach | module Scanners |
65 | 3 | Kornelius Kalnbach | |
66 | 4 | Kornelius Kalnbach | # Always inherit from CodeRay::Scanners::Scanner. |
67 | 4 | Kornelius Kalnbach | # |
68 | 4 | Kornelius Kalnbach | # Scanner inherits directly from StringScanner, the Ruby class for fast |
69 | 4 | Kornelius Kalnbach | # string scanning. Read the documentation to understand what's going on here: |
70 | 4 | Kornelius Kalnbach | # |
71 | 4 | Kornelius Kalnbach | # http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html |
72 | 1 | Kornelius Kalnbach | class JSON < Scanner |
73 | 3 | Kornelius Kalnbach | |
74 | 17 | Kornelius Kalnbach | # Deprecation notice: The Streamable module is gone. |
75 | 1 | Kornelius Kalnbach | |
76 | 17 | Kornelius Kalnbach | # Scanners are plugins and must be registered like this: |
77 | 1 | Kornelius Kalnbach | register_for :json |
78 | 1 | Kornelius Kalnbach | |
79 | 17 | Kornelius Kalnbach | # You can provide a file extension associated with this language. |
80 | 17 | Kornelius Kalnbach | file_extension 'json' |
81 | 17 | Kornelius Kalnbach | |
82 | 17 | Kornelius Kalnbach | # List all token kinds that are not considered to be running code |
83 | 17 | Kornelius Kalnbach | # in this language. For a typical language, this would just be |
84 | 17 | Kornelius Kalnbach | # :comment, but for a data or markup language like JSON, no tokens |
85 | 17 | Kornelius Kalnbach | # should count as Line of Code. |
86 | 17 | Kornelius Kalnbach | KINDS_NOT_LOC = [ |
87 | 17 | Kornelius Kalnbach | :float, :char, :content, :delimiter, |
88 | 17 | Kornelius Kalnbach | :error, :integer, :operator, :value, |
89 | 17 | Kornelius Kalnbach | ] # :nodoc: |
90 | 17 | Kornelius Kalnbach | |
91 | 1 | Kornelius Kalnbach | # See the WordList documentation. |
92 | 1 | Kornelius Kalnbach | CONSTANTS = %w( true false null ) |
93 | 17 | Kornelius Kalnbach | IDENT_KIND = WordList.new(:key).add(CONSTANTS, :value) |
94 | 1 | Kornelius Kalnbach | |
95 | 1 | Kornelius Kalnbach | ESCAPE = / [bfnrt\\"\/] /x |
96 | 17 | Kornelius Kalnbach | UNICODE_ESCAPE = / u[a-fA-F0-9]{4} /x |
97 | 4 | Kornelius Kalnbach | |
98 | 4 | Kornelius Kalnbach | # This is the only method you need to define. It scans code. |
99 | 4 | Kornelius Kalnbach | # |
100 | 17 | Kornelius Kalnbach | # encoder is an object which encodes tokens. It provides the following API: |
101 | 17 | Kornelius Kalnbach | # * encoder.text_token(text, kind) for tokens |
102 | 17 | Kornelius Kalnbach | # * encoder.begin_group(kind) and encoder.end_group(kind) for token groups |
103 | 17 | Kornelius Kalnbach | # * encoder.begin_line(kind) and encoder.end_line(kind) for line tokens |
104 | 1 | Kornelius Kalnbach | # |
105 | 17 | Kornelius Kalnbach | # options is a hash. Standard options are: |
106 | 17 | Kornelius Kalnbach | # * keep_state: Try to save the current scanner state and restore it in the |
107 | 17 | Kornelius Kalnbach | # next call of scan_tokens. |
108 | 1 | Kornelius Kalnbach | # |
109 | 17 | Kornelius Kalnbach | # scan_tokens must return the encoder variable it was given. |
110 | 1 | Kornelius Kalnbach | # |
111 | 17 | Kornelius Kalnbach | # You are completely free to use any style you want, just make sure encoder |
112 | 17 | Kornelius Kalnbach | # gets what it needs. But typically, a Scanner follows the following scheme: |
113 | 17 | Kornelius Kalnbach | def scan_tokens encoder, options |
114 | 3 | Kornelius Kalnbach | |
115 | 3 | Kornelius Kalnbach | # The scanner is always in a certain state, which is :initial by default. |
116 | 1 | Kornelius Kalnbach | # We use local variables and symbols to maximize speed. |
117 | 4 | Kornelius Kalnbach | state = :initial |
118 | 3 | Kornelius Kalnbach | |
119 | 3 | Kornelius Kalnbach | # Sometimes, you need a stack. Ruby arrays are perfect for this. |
120 | 4 | Kornelius Kalnbach | stack = [] |
121 | 4 | Kornelius Kalnbach | |
122 | 3 | Kornelius Kalnbach | # Define more flags and variables as you need them. |
123 | 4 | Kornelius Kalnbach | key_expected = false |
124 | 4 | Kornelius Kalnbach | |
125 | 17 | Kornelius Kalnbach | # The main loop; eos? is true when the end of the code is reached. |
126 | 1 | Kornelius Kalnbach | until eos? |
127 | 1 | Kornelius Kalnbach | |
128 | 17 | Kornelius Kalnbach | # Deprecation notice: The use of local variables kind and match no longer |
129 | 17 | Kornelius Kalnbach | # recommended. |
130 | 4 | Kornelius Kalnbach | |
131 | 1 | Kornelius Kalnbach | # Depending on the state, we want to do different things. |
132 | 1 | Kornelius Kalnbach | case state |
133 | 1 | Kornelius Kalnbach | |
134 | 4 | Kornelius Kalnbach | # Normally, we use this case. |
135 | 1 | Kornelius Kalnbach | when :initial |
136 | 3 | Kornelius Kalnbach | # I like the / ... /x style regexps because white space makes them more |
137 | 3 | Kornelius Kalnbach | # readable. x means white space is ignored. |
138 | 17 | Kornelius Kalnbach | if match = scan(/ \s+ /x) |
139 | 17 | Kornelius Kalnbach | # White space and masked line ends are :space. |
140 | 17 | Kornelius Kalnbach | # Make sure you never send an empty token! /\s*/ for example would be |
141 | 17 | Kornelius Kalnbach | # very bad (actually creating an infinite loop). |
142 | 17 | Kornelius Kalnbach | encoder.text_token match, :space |
143 | 3 | Kornelius Kalnbach | elsif match = scan(/ [:,\[{\]}] /x) |
144 | 3 | Kornelius Kalnbach | # Operators of JSON. stack is used to determine where we are. stack and |
145 | 1 | Kornelius Kalnbach | # key_expected are set depending on which operator was found. |
146 | 1 | Kornelius Kalnbach | # key_expected is used to decide whether a "quoted" thing should be |
147 | 1 | Kornelius Kalnbach | # classified as key or string. |
148 | 17 | Kornelius Kalnbach | encoder.text_token match, :operator |
149 | 1 | Kornelius Kalnbach | case match |
150 | 1 | Kornelius Kalnbach | when '{' then stack << :object; key_expected = true |
151 | 1 | Kornelius Kalnbach | when '[' then stack << :array |
152 | 1 | Kornelius Kalnbach | when ':' then key_expected = false |
153 | 1 | Kornelius Kalnbach | when ',' then key_expected = true if stack.last == :object |
154 | 1 | Kornelius Kalnbach | when '}', ']' then stack.pop # no error recovery, but works for valid JSON |
155 | 3 | Kornelius Kalnbach | end |
156 | 3 | Kornelius Kalnbach | elsif match = scan(/ true | false | null /x) |
157 | 1 | Kornelius Kalnbach | # These are the only idents that are allowed in JSON. Normally, IDENT_KIND |
158 | 4 | Kornelius Kalnbach | # would be used to tell keywords and idents apart. |
159 | 17 | Kornelius Kalnbach | encoder.text_token match, IDENT_KIND[match] |
160 | 17 | Kornelius Kalnbach | elsif match = scan(/ -? (?: 0 | [1-9]\d* ) /x) |
161 | 1 | Kornelius Kalnbach | # Pay attention to the details: JSON doesn't allow numbers like 00. |
162 | 17 | Kornelius Kalnbach | if scan(/ \.\d+ (?:[eE][-+]?\d+)? | [eE][-+]? \d+ /x) |
163 | 4 | Kornelius Kalnbach | match << matched |
164 | 17 | Kornelius Kalnbach | encoder.text_token match, :float |
165 | 17 | Kornelius Kalnbach | else |
166 | 17 | Kornelius Kalnbach | encoder.text_token match, :integer |
167 | 3 | Kornelius Kalnbach | end |
168 | 4 | Kornelius Kalnbach | elsif match = scan(/"/) |
169 | 3 | Kornelius Kalnbach | # A "quoted" token was found, and we know whether it is a key or a string. |
170 | 3 | Kornelius Kalnbach | state = key_expected ? :key : :string |
171 | 17 | Kornelius Kalnbach | # This opens a token group and encodes the delimiter token. |
172 | 17 | Kornelius Kalnbach | encoder.begin_group state |
173 | 17 | Kornelius Kalnbach | encoder.text_token match, :delimiter |
174 | 4 | Kornelius Kalnbach | else |
175 | 3 | Kornelius Kalnbach | # Don't forget to add this case: If we reach invalid code, we try to discard |
176 | 4 | Kornelius Kalnbach | # chars one by one and mark them as :error. |
177 | 17 | Kornelius Kalnbach | encoder.text_token getch, :error |
178 | 4 | Kornelius Kalnbach | end |
179 | 3 | Kornelius Kalnbach | |
180 | 3 | Kornelius Kalnbach | # String scanning is a bit more complicated, so we use another state for it. |
181 | 4 | Kornelius Kalnbach | # The scanner stays in :string state until the string ends or an error occurs. |
182 | 3 | Kornelius Kalnbach | # |
183 | 3 | Kornelius Kalnbach | # JSON uses the same notation for strings and keys. We want keys to be in a |
184 | 4 | Kornelius Kalnbach | # different color, but the lexical rules are the same. This is why we use this |
185 | 4 | Kornelius Kalnbach | # case also for the :key state. |
186 | 3 | Kornelius Kalnbach | when :string, :key |
187 | 3 | Kornelius Kalnbach | # Another if-elsif-else-switch, for strings this time. |
188 | 17 | Kornelius Kalnbach | if match = scan(/[^\\"]+/) |
189 | 4 | Kornelius Kalnbach | # Everything that is not \ or " is just string content. |
190 | 17 | Kornelius Kalnbach | encoder.text_token match, :content |
191 | 17 | Kornelius Kalnbach | elsif match = scan(/"/) |
192 | 3 | Kornelius Kalnbach | # A " is found, which means this string or key is ending here. |
193 | 3 | Kornelius Kalnbach | # A special token class, :delimiter, is used for tokens like this one. |
194 | 17 | Kornelius Kalnbach | encoder.text_token match, :delimiter |
195 | 17 | Kornelius Kalnbach | # Always close your token groups using the right token kind! |
196 | 17 | Kornelius Kalnbach | encoder.end_group state |
197 | 3 | Kornelius Kalnbach | # We're going back to normal scanning here. |
198 | 1 | Kornelius Kalnbach | state = :initial |
199 | 17 | Kornelius Kalnbach | # Deprecation notice: Don't use "next" any more. |
200 | 17 | Kornelius Kalnbach | elsif match = scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox) |
201 | 3 | Kornelius Kalnbach | # A valid special character should be classified as :char. |
202 | 17 | Kornelius Kalnbach | encoder.text_token match, :char |
203 | 17 | Kornelius Kalnbach | elsif match = scan(/\\./m) |
204 | 4 | Kornelius Kalnbach | # Anything else that is escaped (including \n, we use the m modifier) is |
205 | 1 | Kornelius Kalnbach | # just content. |
206 | 17 | Kornelius Kalnbach | encoder.text_token match, :content |
207 | 17 | Kornelius Kalnbach | elsif match = scan(/ \\ | $ /x) |
208 | 3 | Kornelius Kalnbach | # A string that suddenly ends in the middle, or reaches the end of the |
209 | 3 | Kornelius Kalnbach | # line. This is an error; we go back to :initial now. |
210 | 17 | Kornelius Kalnbach | encoder.end_group state |
211 | 17 | Kornelius Kalnbach | encoder.text_token match, :error |
212 | 1 | Kornelius Kalnbach | state = :initial |
213 | 4 | Kornelius Kalnbach | else |
214 | 3 | Kornelius Kalnbach | # Nice for debugging. Should never happen. |
215 | 17 | Kornelius Kalnbach | raise_inspect "else case \" reached; %p not handled." % [peek(1)], encoder |
216 | 3 | Kornelius Kalnbach | end |
217 | 1 | Kornelius Kalnbach | |
218 | 4 | Kornelius Kalnbach | else |
219 | 3 | Kornelius Kalnbach | # Nice for debugging. Should never happen. |
220 | 17 | Kornelius Kalnbach | raise_inspect 'Unknown state: %p' % [state], encoder |
221 | 3 | Kornelius Kalnbach | |
222 | 1 | Kornelius Kalnbach | end |
223 | 4 | Kornelius Kalnbach | |
224 | 17 | Kornelius Kalnbach | # Deprecation notice: The block using the match local variable block is gone. |
225 | 3 | Kornelius Kalnbach | end |
226 | 3 | Kornelius Kalnbach | |
227 | 3 | Kornelius Kalnbach | # If we still have a string or key token group open, close it. |
228 | 3 | Kornelius Kalnbach | if [:string, :key].include? state |
229 | 17 | Kornelius Kalnbach | encoder.end_group state |
230 | 3 | Kornelius Kalnbach | end |
231 | 17 | Kornelius Kalnbach | |
232 | 17 | Kornelius Kalnbach | # Return the encoder. |
233 | 17 | Kornelius Kalnbach | encoder |
234 | 1 | Kornelius Kalnbach | end |
235 | 1 | Kornelius Kalnbach | |
236 | 1 | Kornelius Kalnbach | end |
237 | 1 | Kornelius Kalnbach | |
238 | 1 | Kornelius Kalnbach | end |
239 | 1 | Kornelius Kalnbach | end |
240 | 1 | Kornelius Kalnbach | </code></pre> |
241 | 1 | Kornelius Kalnbach | |
242 | 1 | Kornelius Kalnbach | Highlighted with CodeRay, yeah :D |