eteppo

Research Software Engineering: 3 Things Any Researcher Should Know

Published: 2024-07-14

I don't want to cover general problem-solving principles in this post. These are the likes of "keep it simple", "plan well", "don't do unneccessary things", and "don't repeat yourself". General problem-solving and efficiency-increasing methods are crucial in any context.

Instead, I'm here to list three practical topics from software engineering that can improve the quality of science.

  1. Software design

Big ball of mud. Spaghetti code. These are some words used for code that is awfully unstructured and a nightmare to read and understand. To avoid some nasty consequences, researchers should know a bit about different ways to structure code.

First, the structure of the code should obviously mirror the structure of the problem and its solution (or model the domain, so to speak). The overall structure can be therefore designed by understanding the problem as well as possible, from the most abstract concepts to the most specific steps. In science, the problem could be a data science workflow involving importing, tidying, summarizing, modelling, tabulating, and visualizing. Apart from initial exploratory looks, a plan for the whole analysis is crucial and should guide the programming.

So, the researcher first needs to understand the structure of data analysis, visualization, and so on, more generally before writing any code. Good research software is based on good concepts too, like 'tidy data' or 'grammar of graphics' or the framework in marginaleffects.

Second, it's often the case that someone has already solved some of your problems sufficiently well and shared their code somewhere. A good stack of programming languages, domain-specific sublanguages, frameworks, libraries, and similar projects help structure your code. So, any training in research software engineering would (obviously) include learning a good programming language and some of its current best libraries, including conventions in that ecosystem. Ideally you'd be comfortable reading some source code so that you could learn straight from the masters how they structure things.

That said, you still have a lot of room to think how to structure your own code. So how and where to start?

In science, functional programming patterns should be a great place to start:

However, object-oriented programming is very common (especially in Python) and you could also think if your problem could be naturally represented as separate objects.

  1. Version control

Software engineers use version control systems that allow multiple people working on complicated projects in alternative parallel branches at the same time while leaving an exact trace of history. Simple usage of these systems is enough for any research project and should be likely included in any curriculum for researchers.

Version control comes with one major restriction: files like Microsoft Word documents cannot be easily tracked. The project should instead consist of plain text. But this is actually a good thing as plain text is universal and easy to use without any special software, and plain text can be turned into good-looking documents by rendering when needed (see for example pandoc). This way focus also falls on the content and how the results are generated. Styles and such can be decided later, as the last step.

Another less obvious benefit is that using version control (like git) nudges you to do some project management and open collaborative science (on GitHub) that you might otherwise tend to skip.

Here's a very quick summary how (the infamously confusing) Git works:

An important method in software engineering is CI/CD which refers to continuous integration, delivery, and deployment. This means that small changes are added to the common codebase very frequently so that problems are caught quickly and individuals have a common view of the codebase. A small committed and pushed change can be set up to trigger automated tests, and if the tests pass, further updates can be made even in the production software that is being served to the customers. This creates a quick loop for improving the product ste4p-by-step.

  1. Testing

In science, correctness is really everything. You can't trust the results if you can't trust the code all the way. The only way to gain confidence that software works as expected is by testing it – both manually and automatically. A curriculum in research software engineering should therefore include routine use of simple testing methods.

The basic form of automated testing is called unit testing where individual functions are run on some test inputs and the outputs are compared with what is expected. Different programming languages usually have their own conventions and tools for unit testing (especially for package developers). Often tests are not included in the code itself (testing while running) but rather the tests are run separately when major changes are made. But for most research, testing while running is good.

A simple method is just to add assertions into the code while programming. This is needed especially when the programming language doesn't have proper types or specific typing is not used or cannot be used to cover some complicated condition. For example, the inputs could be checked at the start of a function so that invalid inputs don't trigger far stranger errors later on. Similarly, if you know what the intermediate or final results should look like, they can be checked before moving on. Assertions or expectations within the code also make the code more readable as they clearly show what the code is supposed to be doing – and they clearly show that the code wouldn't even run if these conditions weren't met.

CC BY-SA 4.0 Eero Teppo. Last modified: March 23, 2025. Website built with Franklin.jl and the Julia programming language.