Research Software Engineering: 3 Things Any Researcher Should Know
I don’t want to cover general problem-solving principles in this post. These are the likes of “keep it simple”, “plan well”, “don’t do unneccessary things”, and “don’t repeat yourself”. General problem-solving and efficiency-increasing methods are crucial in any context.
Instead, I’m here to list three practical topics from software engineering that can improve the quality of science.
1. Software design
Big ball of mud. Spaghetti code. These are some words used for code that is awfully unstructured and a nightmare to read and understand. To avoid some nasty consequences, researchers should know a bit about different ways to structure code.
First, the structure of the code should obviously mirror the structure of the problem and its solution (or model the domain, so to speak). The overall structure can be therefore designed by understanding the problem as well as possible, from the most abstract concepts to the most specific steps. In science, the problem could be a data science workflow involving importing, tidying, summarizing, modelling, tabulating, and visualizing. Apart from initial exploratory looks, a plan for the whole analysis is crucial and should guide the programming.
So, the researcher first needs to understand the structure of data analysis, visualization, and so on, more generally before writing any code. Good research software is based on good concepts too, like ‘tidy data’ or ‘grammar of graphics’ or the framework in marginaleffects.
Second, it’s often the case that someone has already solved some of your problems sufficiently well and shared their code somewhere. A good stack of programming languages, domain-specific sublanguages, frameworks, libraries, and similar projects help structure your code. So, any training in research software engineering would (obviously) include learning a good programming language and some of its current best libraries, including conventions in that ecosystem. Ideally you’d be comfortable reading some source code so that you could learn straight from the masters how they structure things.
That said, you still have a lot of room to think how to structure your own code. So how and where to start?
In science, functional programming patterns should be a great place to start:
- Structure your code into composable functions (verbs). Think of a fractal of functions as you move from your abstract concepts to the lower level operations. Think of chains of functions within each function. You can also input and output functions, and you can replace (repeating) control flow statements with functions.
- Use types to define valid inputs and outputs of functions. The function must be able to handle all inputs of that type but no more. The function must return only outputs of that type and no more. Note that good types are automatically documentation for the functions.
- Embracing functions, types, and composition leads you to using more complex wrapped data types, such as the
Option
type that contains either a value wrapped inSuccess
or some error values wrapped inFailure
. To compose different functions together in error-prone contexts, you’ll then need to use adapter functions likebind
,map
, andtee
. General functional programming tools such as these can be found in most programming languages so this is not a big problem.
However, object-oriented programming is very common (especially in Python) and you could also think if your problem could be naturally represented as separate objects.
- Objects package data and functions together (nouns). Often you would define a class of objects and use it to create new instances of objects as needed. Think of a
Person
which hasage
and canrun()
. - Classes of objects may form hierarchies where the more specific class inherits contents from the more abstract class and then extends (or modifies) it somehow. Think of
Employee extends Person
who has alsosalary
and canwork()
. (If something can be done to many different classes of objects, your function can accept any class that implements an interface which is just a set of methods that the class has to include. Think of an interfaceCountable
with methodslength
andadd
.)
2. Version control
Software engineers use version control systems that allow multiple people working on complicated projects in alternative parallel branches at the same time while leaving an exact trace of history. Simple usage of these systems is enough for any research project and should be likely included in any curriculum for researchers.
Version control comes with one major restriction: files like Microsoft Word documents cannot be easily tracked. The project should instead consist of plain text. But this is actually a good thing as plain text is universal and easy to use without any special software, and plain text can be turned into good-looking documents by rendering when needed (see for example pandoc). This way focus also falls on the content and how the results are generated. Styles and such can be decided later, as the last step.
Another less obvious benefit is that using version control (like git) nudges you to do some project management and open collaborative science (on GitHub) that you might otherwise tend to skip.
Here’s a very quick summary how (the infamously confusing) Git works:
- A Git project
git init
exists in four major areas. Locally you have a working directory, a staging area, and the local repository, and remotely you have a remote repositorygit remote
(usually just one). (Also a stash area exists but we ignore it here.) - The repositories include directed acyclical graphs of changes (commit trees). Every committed change defines a new version of the project. Each commit belongs to some
git branch
. You cangit merge
a branch into some other branch if they don’t have conflicting changes. (Moving between branches is done withgit checkout
.) - After you make changes in your local working directory, you
git add
them to the staging area, and then finallygit commit
all staged changes to the local repository. This commit reflects a new local version of your project. Then yougit push
the committed changes to the remote repository to keep them in sync (if the remote is controlled by someone else, they will get a pull request and decide if they want to sync or not). To move to the other direction, you usuallygit pull
changes from the remote to the local repository and working directory all in one step. You can alsogit clone
the remote, or you can separatelygit fetch
remote changes and thengit merge
them into your local state. - One of the best parts of exact version control is that you can move back in history and back. You can
git checkout
to old versions or other branches, you cangit revert
old committed changes, you cangit clean
untracked files from the working directory, and you cangit reset
your state to start fresh from a point where everything still worked fine. - So it is not surprising that Git can get complicated. You need to navigate graphs and branches, multiple areas, and local and remote. You need to work with others (and with yourself) so that things stay in sync without hellish conflicts. But it is a time machine after all.
An important method in software engineering is CI/CD which refers to continuous integration, delivery, and deployment. This means that small changes are added to the common codebase very frequently so that problems are caught quickly and individuals have a common view of the codebase. A small committed and pushed change can be set up to trigger automated tests, and if the tests pass, further updates can be made even in the production software that is being served to the customers. This creates a quick loop for improving the product ste4p-by-step.
3. Testing
In science, correctness is really everything. You can’t trust the results if you can’t trust the code all the way. The only way to gain confidence that software works as expected is by testing it — both manually and automatically. A curriculum in research software engineering should therefore include routine use of simple testing methods.
The basic form of automated testing is called unit testing where individual functions are run on some test inputs and the outputs are compared with what is expected. Different programming languages usually have their own conventions and tools for unit testing (especially for package developers). Often tests are not included in the code itself (testing while running) but rather the tests are run separately when major changes are made. But for most research, testing while running is good.
A simple method is just to add assertions into the code while programming. This is needed especially when the programming language doesn’t have proper types or specific typing is not used or cannot be used to cover some complicated condition. For example, the inputs could be checked at the start of a function so that invalid inputs don’t trigger far stranger errors later on. Similarly, if you know what the intermediate or final results should look like, they can be checked before moving on. Assertions or expectations within the code also make the code more readable as they clearly show what the code is supposed to be doing — and they clearly show that the code wouldn’t even run if these conditions weren’t met.