Testing Research Code: Sometimes Worth It

Machine learning researchers often don’t write tests for their code. They’re not software engineers, and their code needs only to train a model or prove out an experiment. Plus, their code changes rapidly, and it’s hard to write tests in that context that don’t immediately need to be rewritten. However, at Hop, we’ve found that adding certain kinds of tests can actually accelerate research and increase confidence in results through improving code quality and encouraging reuse.

Testing Code is Great

Testing software is a critical component of software engineering, allowing developers to build software with more speed and confidence. Methodologies like Test-Driven Development (TDD) can improve code even if you don't practice them dogmatically. Much has been written about testing software generally, but here I would like to discuss testing software in a research setting, where quality still matters. Though we primarily write software for machine learning at Hop, much of what I’ll share here is applicable to any code for experimental research – hopefully useful for researchers trying to bolster their code with some additional engineering, or for software engineers who find themselves supporting researchers (who may or may not have any interest in testing).

Research code isn't always well tested

This makes sense. Research code is often intended only to help uncover some insights or prove out an idea, so it might not pay off to invest in rigorous testing. The research idea may not work out, or perhaps the code will be tossed out or rewritten once you get what you need out of it. It's usually not clear up front what parts of the code are going to be useful long-term, so it's risky to put effort into testing. The expected return on investment for tests on research code is generally low.

Why would you test research code?

Often, though, parts of the code live on in some form. Snippets or utility functions can get copied from project to project. Moreover, there can be a lot of value in building common components around datasets or training regimes. Doing this can be a great way to accumulate knowledge about datasets and formats as well as useful techniques or algorithms. However, you can’t trust reusable components without tests.

These considerations are particularly relevant if code is shared amongst a team of researchers working on similar problems – both the potential for added value from sharing code as well as the need to have assurance that the code will work correctly. Even a solo researcher working on a disparate set of problems would likely have some minimal set of helper functions that would be worth cleaning up and maintaining with some tests. Silent bugs in subtle algorithms (like “off-by-one” mistakes) or performance degradation due to model architecture misconfigurations can be hidden for a long time if you don’t put some assurances in place. Even loud errors can be slow to debug if your code isn’t well factored, and can interrupt long-running calculations and model training, wasting time and resources.

Start simple

The good news is that some tests are better than no tests, and you can get a lot of value for minimal effort right from the start. There are definitely some areas in which it makes sense to invest your budget for writing tests more than others. Even if you don’t reuse your code across researchers or projects, we’ve found very narrowly scoped tests and very broad tests are great places to start. Narrow unit tests are likely to provide a long useful life for small utility functions – functions like “apply_tensor_coordinate_transform” are great candidates for this. This is also where my TDD spidey-sense tingles, because the desire to test code like this can often lead to better factoring and more reusability generally. On the other side of things, broad tests that barely do anything are another place to start – this can be as simple as testing that your main script can run in some abbreviated form (dummy input, mocked external dependencies) without crashing. I’ll say more on both of these cases below.

Small tests are nice tests

Tests of small snippets of code are usually the easiest to write. In the context of research, they provide confidence for the correctness of your results. This is where you make sure you don’t get an off-by-one error while manipulating indices of tensors, or have a spurious negative sign in one of your equations. Those kinds of errors can take a long time to debug, or worse, can silently skew your results. An extra benefit here is that you get some free documentation of how to use this code – a collaborator (who could be a future version of yourself) can see simplified versions of the inputs and outputs and also get some insights towards any conventions or assumptions (e.g., “Are the coordinates UTM or some local reference frame?”, “Are these units in m/s or km/hr?”).

Broad tests are easy tests

The other easy way to get started with tests is to write very broad but very simple ones. Think end-to-end or integration-level test, but try to remove all the complicated external dependencies. Simply put, exercise as much of the code as you can, and see that it runs without crashing. The goal is not to boost confidence in your results, but to make sure that breaking changes are caught as soon as possible so you can more easily make changes to your code. This works best if you can run code on some dummy input in a way that finishes quickly. For example, if you have a training script, a test could train on a very small dataset for only one or two epochs and verify that training completes. You won’t learn anything about how well your model works this way, but it will serve as a tripwire, telling you as soon as something breaks (hopefully before you kick off that big hyperparameter sweep over the weekend). This test should be loosely coupled to the inner workings of your code, so that your test won’t care if you tweak the architecture of your model – either training runs or it doesn’t.

This technique can also be used to isolate stable components while you work on other aspects of the code. At a certain stage in our training script example above, say for an encoder or decoder, run some input through and verify that you always get the same output by pickling the model weights before and after. This is basically snapshot testing from frontend web development and will produce brittle tests. The trade-off is that you are sure that your encoder or decoder is stable while you work on some other components. Additionally, this sort of technical debt will nudge you into eventually writing better tests when you get tired of updating the pickled model weights (which you will need to do when you eventually make an intentional change to your encoder/decoder).

Testing can be worth it

Though often neglected, testing can lead to better research code that enables more experiments and accelerates insights. It can provide more confidence in the correctness of experimental results while making it easier to reuse or change code. Avoid the middle ground between very narrow unit tests and broadly scoped tests, and you’ll also avoid the constant breaking of your tests, allowing you to enjoy the benefits of automated testing. 

— Mark Flanagan, ML Engineer @ Hop