RFC - compiler refactoring - spans

OK, I won’t have time to write the blog post soon, so I’ll try to write up the short version of my findings here.

Getting rid of expansion ids

The idea is actually pretty simple: Instead of recording expansion information in the ExpansionInfo list in the CodeMap, we can just treat each expansion as a new (virtual) file in the codebase and allocate spans from there. So for every call to assert!() or any other macro we create a new FileMap in the CodeMap, which also stores the information currently found in ExpansionInfo. All spans from this particular expansion are then allocated within the range of this virtual file map. This way we don’t have to set aside a fixed number of bits for the expansion id.

Encoding Spans

I had the idea that we could encode spans like strings with small-string optimization: Let’s say we want a span to be X bits large. If we can represent the whole span in X-1 bits, we do so. Else, we set a tag bit that tells us that the other X-1 are to be interpreted as an index into a side-table in the codemap. This side table stores spans in a format that can represent all possibilities (like our current span format).

So, let’s say we want spans as they occur in the AST to take up only 32 bits. Then we could encode them as:

........ ........ ........ ........
TPPPPPPP PPPPPPPP PPPPPPPP LLLLLLLL 

That is, we have the tag bit, 23 bits for the position of the span within the codemap and 7 bits for the length. Everything that cannot be represented this way because its position is greater than 2^23 or its length is greater than 2^7 is stored in the side table.

Note that the side table is needed when actually interpreting the span (e.g. when reporting an error) but not when copying the span around. A potential Span struct could thus conveniently derive(Copy). Also note that this kind of side table does not have the negative effects of the one mentioned in the OP, as it doesn’t rely on node IDs or memory addresses.

More Flexible Span Encodings

We don’t have to restrict ourselves to just one tag bit. If we use two bits, we can use the value 0 as the “side table marker” and the values 1, 2, and 3 as identifiers for different encodings. That is, we could have an encoding that looks like this:

........ ........ ........ ........
00IIIIII IIIIIIII IIIIIIII IIIIIIII or
01PPPPPP PPPPPPPP PPPPPPPP LLLLLLLL or
10PPPPPP PPPPPPPP PPPPPPPP PPPPLLLL or
11PPPPPP PPPPPPPP PPPLLLLL LLLLLLLL 

An encoding like the above would catch a few more spans, because it can represent all short spans (length < 15) with positions up to 2^26.

Finding an Optimal Encoding

In order to find the best encoding (at least regarding to my sample codebases) I have …

(1) … modified rustc so it would dump all spans found in the ast-map into a simple binary file. Spans are dumped after macro expansion and importing other crates, so this should give a pretty accurate picture of what spans can occur, and …

(2) … written a program that analyzes the dumped span data with regards to different encodings.

The program can take an encoding as described above and test it against a code base (i.e. the span data dumped by the modified rustc). It will then tell you what percentage of the spans can be represented without a side table entry.

Given a number of bits to use in total and a number of bits to use for the encoding tag, it can also generate all possible encodings for those numbers. If the tag is one bit, we can exhaustively search the problem space to find the best encoding. For more than one bit, the problem space gets too large to search exhaustively, but the program uses a simple randomized tabu search that finds good results pretty quickly.

I once let my Core i7 run at full CPU utilization on all 8 hardware threads for 10 hours—which was just enough to try out all ~20 billion possible encodings for a 2 bit tag. The tabu search yielded the optimal result in 10 seconds :).

Results

My test used the complete rustc+std code base (everything that gets compiled by make rustc) as reference. The data also includes imported filemaps (see PR #22235) and encodings are tested with and without the above mentioned expansion-id optimization.

The following table shows the best results with different settings. Encodings are represented as LxPyEz which means x length-bits, y position-bits, and z expansion-id-bits. The number next to the encoding is the percentage of spans in the codebase that can be encoded without side table entry by the given encoding.

|               | 32 bit (1 bit tag) | 32 bit (2 bit tag)                |
| ------------- | ------------------ | --------------------------------- |
| w/ ExpnId     | (L7P24E0) (65%)    | (L4P17E9, L6P24E0, L9P21E0) (69%) |
| w/o ExpnId    | (L7P24) (94%)      | (L6P24, L7P23, L9P21) (96%)       |

As can be seen, using the expansion-id optimizations allows for more than 90% of all spans to be represented with only 32 bits, which sounds pretty good to me. The flexible encoding also brings a few percent, but not much. However, I suspect that the flexible encoding can better keep up with even larger code bases.

Conclusion

If I haven’t made a mistake in my analysis somewhere (which is entirely possible) we should be able to store spans in 32 bits. Also, with the virtual-filemap/expansion-id optimization it should be possible to store all spans within 64bit without needing a side-table.

EDIT: Meh, discourse can’t do html tables… EDIT2: And it swallows stuff …

7 Likes