Testing the compiler

The work on new error formats has raised some interesting questions about how to test the compiler. The current testing framework works roughly as follows:

  • run-pass: the test must execute, but compilation is allowed to issue warnings
  • compile-fail:
    • you must annotate errors/warnings that are expected
    • any unexpected errors/warnings will make the test fail
    • if you add a NOTE or HELP annotation, then you must annotate all notes and helps
      • in that case, any unexpected errors/warnings will make the test fail
    • however, if you don’t have any NOTE or HELP annotations in the test, then notes/helps are generally ignored
    • you can also add arbitrary annotations that just look for specific lines
  • ui (this one is relatively new)
    • a file alongside the test captures the exact output of the compiler
    • intended for testing the ā€œround-tripā€ experience, including all formatting details
    • example Rust file and expected output
    • there is a script to automatically capture the current output and update the reference files en masse
      • the idea is that you can then browse the git diff to see the effect of your change
      • also easy for people reviewing your PR, since they’ll see the diffs

This setup has I think largely evolved following a particular philosophy: it’s also one that I have actively encouraged thus far. Specifically, that we should be able to write tests that test precisely the thing we are intending to test and avoid incidental presentation details. So, for example, if we are writing a type-checking test, we might want to check that a particular bit of code will fail to compile for the expected reason, but we don’t necessarily care what error message it gives – and if we do want to test that, we should write a test targeting specifically the error-message (ideally, probably, a ui test, that captures the full presentation).

Until recently, I’ve been advocating this POV for a couple of reasons. For one thing, it avoids a lot of painful redundancy in the tests. For example, consider a test like this one, which tests for a number of different scenarios, all of which generate the same error message. And actually that test is just one of a set of many tests. Now that particular message (ā€œcannot inferā€) has a lot of notes and things attached to it – but is it really valuable to test all of those notes on each of these variations? Also, it sometimes takes a lot of time of time to keep those annotations up to date, particularly if you make a small change to wording. Is it worth it?

On the other hand, I’ve been rethinking things a bit as a result of the error message update. In particular, it is handy sometimes to have a bunch of tests for random scenarios, even if those tests were not intended to test notes. I think we caught at least a few (maybe more than a few?) bugs around custom notes as a result of having those extra annotations. Essentially, checking the full details (all notes etc) lets you get the most out of your existing test base, in some sense.

But by that logic, it would seem that we should move things to the ui test suite entirely, which captures all aspects of the compiler’s behavior. For example, subtle changes to the span can change what columns are highlighted – this can have a very big impact on how usable the compiler is, especially since a lot of error messages rely on the span to give context – but it doesn’t get tested at all by the compile-fail setup, which is focused on lines.

The scripts that update the ui tests are also an interesting factor, since they mean that it can actually be less work to keep the test suite in sync after a big change than compile-fail, which can require a lot of laborious changes. (Though some argue that it makes it too easy to update the expected output, since you don’t necessarily think carefully about why something changed; I think that’s largely a problem that reviews will catch, though.)

So, what do people think? Should we continue having a ā€œtargetedā€ testing philosophy? Should we go for maximal testing? Does that imply that ui tests are the way we should go, or is that too much detail (and, if so, why?)

cc @jntrnr

1 Like

This is related, more than directly an answer but....

I would still like to see pluggable test runners. With work done on them out of tree, at some point, we could see great features we'd like to use for testing rustc itself, and then use them.

Testing binaries is one of the weakest parts of Rust's overall testing setup, and rustc is just another rust binary.

4 Likes

FWIW Cargo’s used a ā€œmaximal testingā€ approach from the beginning, where it just literally spawns cargo as a subprocess and matches the exact stdout/stderr/exit status of the child. This is fantastic for seeing changes have rippling effects, actually seeing what’ll get presented to users, and preventing accidental regressions. Cargo’s test suite is less easy to blanket-update than the ui test suite, but with a script that concern should be moot.

I’d personally be in favor of more ui tests as it’s a great way to see exactly what is getting printed out to the user, and visually you can evaluate it rather than ā€œwell at least the information is thereā€.

Throw in my +1 for basically moving the compile-fail directory to ui tests. Not testing the spans, especially when (like you say), they are becoming a more important part of the error experience - well, that just doesn’t seem smart. By comparison, putting everything in ui tests, we can catch any span regressions.

Do we know how this will impact a ā€œmake checkā€ time? Are UI tests slower/faster? So long as it’s in the same ballpark it’s probably a net win.

For the purposes of making rustc as polished as possible I think it makes sense to move toward the ā€˜ui’ testing model of capturing all of rustc’s output for every failure case.

There are two things though I’d like you to keep in mind.

First, alternate implementations. As more implementations of Rust appear they will need to test themselves against the reference implementation. Realistically, this means they will be using our test suite and our test suite needs to evolve into a usable reference test suite. The JetBrains people are already using our test suite this way (cc @matklad). Exact UI testing is in opposition to this goal - it seems likely to me that alternate implementations would not be inclined (or maybe even able) to reproduce rustc’s output, and certainly it would not be the initial focus of any implementation of Rust. Some simple affordances can make this work though - e.g. the test harness can have a more forgiving mode where it’s only checking error codes (it’s reasonable for our error codes to be part of the language spec that other implementations have to match).

Second, internationalization. Although it isn’t certain that we will internationalize the compiler, there will continue to be pressure to do so. I don’t know what that would mean for testing - it may not even be worth running the test suite in other languages, or i18n might have its own set of tests.

So I had some thoughts that are relevant to internalization, I think.

I was thinking that for a given .rs input, it’d be nice to be able to have possible ā€œreferenceā€ files, corresponding to distinct modes etc. For example, one might have a .json reference, to show the JSON we should expect and also something for testing just the formatted output. If we add additional --error-format options, such as some for accessibility or i18n, those would be relevant too.

I guess that we’d have to compile a single .rs file once per kind of output, unless we can generate more than one at a time from a single run.

For alternate implementations, that’s a good point, and I hadn’t thought about it much – it seems like if we had the JSON output though it’d be pretty easy to transform that into something minimal that other optimizations can test (and it also seems like something other implementations might be able to match). But basically yeah a kind of ā€œfuzzy modeā€ makes sense.

To clarify, we are using test suite in a very coarse way: we test that the parser does not produce errors for valid files and does not do something weird like array index out of bounds for invalid ones. Waiting for https://github.com/rust-lang/rust/issues/30942 to make something more intellijent :slight_smile:

Regarding alternative implementations, it might be useful to be able to dump not only compiler’s output, but also some intermediate artifacts, like parse trees, resolution info, inferred types. Not that it should be done today: as long as you have test files and a working compiler, you should be able to create such implementation agnostic test suite on demand.

EDIT: and what can also be useful for alternative implementations is a ā€œprogressive test suiteā€, from simple test to complex ones, to allow to develop an alternative compiler in a TDD style.

Ah, that reminds me of something else. I was thinking that I would like to unify run-pass and compile-fail. I want to provide some kind of header indicating whether compilation should succeed and whether the test should be executed. This could also be done just by reference files. i.e., if each test could have multiple reference files, then:

  • at least one reference file must exist.
  • if a ā€œcompiler outputā€ reference file exists, we will compare the compiler output (this lets you test for warnings on a run-pass test, for example, as well as capturing errors);
  • if a ā€œruntime outputā€ reference file exists, we will check that the output when executed matches.

Often, when working on a feature, I would like to lump all the tests related to that feature together – some of them will be positive (code runs as it is suppose to), some of them will be negative (compilation fails when it is supposed to).

I guess in general I think it’s more useful to group tests by ā€œfeatureā€ or ā€œpart of the code they testā€ than by how they should be executed.

1 Like

Yeah, I was just thinking… if we switch compile_fail to ui tests, then it falls out pretty well. We already use *.stderr files to show the expected error in the ui test. We could also do the same for *.stdout. Then follow your logic.

So @pnkfelix and I were talking about this today and we thought of an interesting idea for the ui testing harness. In particular, for some tests it IS nice to have comments inline where you expect the errors to occur (e.g., this test). However, compile-fail alone is not so great for the reasons given previously (e.g., we’re not testing that the precise spans don’t regress, just the line numbers, and we can’t capture all the details of the presentation).

So perhaps we should improve the ui tests so that they do both:

  • check the .stderr files as today
  • but also if there are //~ ERROR annotations, check those against the output as well

The idea would be to support a more limited version of said annotations. In particular, we’d scrape the output just to find the ā€œmain line numberā€ and ā€œmain messageā€ (i.e., scan for r"^ *--> <filename>:<line>"), and then we’d match the error against that. I could even imagine just supporting //~ ERROR with no details (since the stderr already captures the error text itself).

This just makes a lightweight way to make the test more robust by asserting in the test itself where you expect errors to occur. The idea would be that if you have ERROR annotations, we will assert that errors occur on those lines (and no other lines, I guess?).

Some questions:

  • should we allow/require you to include text from the main message? error code?
  • should one //~ ERROR match any number of errors from the main line?

As a side note, is there any interest in using fuzzy testing on the compiler?

If so, is there any work you know on automatically generating rust programs?

@carols10cents was working on that a while ago, and produced a Twitter bot: everyrust. It can only generate a subset of the language though.

Yup, code here. The reference and grammar aren’t up to date, so I got distracted from the code generation to work on those, then I got distracted by… lots of other stuff and haven’t been back :slight_smile:

Yes, fuzz testing is very desirable. Please do it and find lots of bugs.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.