Code Quality for Research

I reconnected with a friend at SRI the other day, and in the middle of a larger conversation, he expressed some frustration around balancing code quality and research progress.

This is a common frustration, and I think we as a community are still finding the right answers. Our conversation drifted to other topics, but I wanted to better articulate my thoughts around this.

I view research (and especially applied research of the type that Hop does) as a type of multi-armed bandit problem. A gambler surrounded by slot machines (the 'one-armed bandit') tries to balance playing a 'lucky' slot machine vs trying new slot machines. More generally, multi-armed bandit problems are the class of problems that try to balance new approaches (exploration) with successful approaches (exploitation). (Yes, the term exploitation can appear problematic here -- sometimes I prefer to use the terms learning vs leveraging, but those are non-standard, so I'll stick to the more commonly used terms for now.)

If you only stick to successful approaches (exploit only), you may entirely miss much better approaches to the problem space. If you constantly try new approaches (explore only), you may never actually get around to using the useful things you discover. (Unrelated, but IMHO this is also a useful way to think about your career.)

The code quality/technical debt conversation is usually a bit muddled because both research teams and engineering teams these days work with code, the cloud, large-scale datasets, and the like. It's sometimes challenging to know where to draw the line. It becomes a bit easier to think about if you articulate where on the exploration/exploitation spectrum you currently are.

"Exploiting" an idea typically involves translating it into production systems, where it can positively influence the wider world. All of the standard software engineering and distributed systems principles apply. There are also some subtle but important things to keep in mind that are specific to production-grade ML systems (as opposed to standard distributed systems). Nonetheless, the orientation is consistent -- you want reliable, scalable, long-lived systems that deliver consistent (and continually improving) real-world impact based off of the research idea you have discovered.

On the other hand, if you're "exploring" a space to see if there's a useful approach there, slightly different tradeoffs make sense. The output of your work is not really the code as much as the insights you have about what will and won't work. The code is just a way to find those insights, and to substantiate those insights to peers. Consequently:

  1. Your code does not need to be scalable beyond what your experiments require.
  2. Your code does not need to be optimized, or even particularly performant.
  3. One important purpose of your code is to convince you. Correctness is important, and research code is not intrinsically any more likely to be correct than production code, so appropriate tests are as important in research code as in production code. (Joel Grus has some really useful pointers to testing for research)
  4. Another important purpose of your code is to convince others. The more readable and well-factored your code is, the easier it is for others (and for you, a year later) to follow the chain of your reasoning and accept your conclusions, collaborate with you, and build upon your work.

Again, your work product is the insights you deliver. The hard-earned lessons around the things you tried that seemed reasonable at the time but did not work for subtle or other reasons are even more important to capture. Capturing negative experiments thoughtfully serves three useful purposes:

  1. It helps other researchers understand the space better, by shedding light on nuances that were not obvious at first glance.
  2. It gives you the means to revisit the experiment as you discover more about the space -- perhaps a novel embedding approach (or a different neural architecture, or an alternative data source, or similar) has shown drastic improvements in other experiments. The easier it is for you to try that approach with your prior reasonable-but-unsuccessful experiments, the more likely you are to be able to exploit the value of your previous explorations.
  3. It serves as a system of record for future researchers onboarding to the project. On a much more tactical level than research reports or publications, the archive of experiments that were tried (and the corresponding insights) helps future researchers understand the line of your research inquiry.

This is very different than production code! In production codebases, the only things that ought to get merged to mainline are those that are ready to be deployed to customers. Things you tried that didn't work never get merged in the first place. Things that used to work but no longer do get ripped out, and though they live in version-control history, they are intentionally and appropriately hidden from new team members because they are no longer relevant. Understanding the data structure or algorithm that worked at an earlier stage of the system but no longer does isn't particularly useful. It's unlikely that as the system grows, older approaches may become relevant again.

Our industry is changing fast, and practitioners often end up wearing many hats. Both exploring the space and exploiting an approach are necessary and sometimes done by the same people. Frustratingly, they can look rather similar day to day in terms of the tasks and tools, and it's easy to forget which of the two you're doing. While still just a quick heuristic, I've found that trying to identify where on the continuum you currently are can really help clarify the level of code quality that is appropriate.

I do think we as a community are still trying to find the right answers here, but I hope this shed some light on my current thinking. I'm super curious what other folks think about this. Join the conversation on LinkedIn.

— Ankur Kalra, Founder/CEO @ Hop