Source: https://blog.f1000.com/2014/04/04/reproducibility-tweetchat-recap/

Ever since we started working on data science projects at Wellcome data labs we have been thinking a lot about reproducibility. As a team of data scientists, we wanted to ensure that the results of our work can be recreated by any member of the team. Among other things this allows us to easily collaborate on projects and offer support to each other. It also reduces the time it takes us to transition from an experimental idea to production.

How do we do that?

Our process has evolved over the years but the core ideas are heavily inspired by data science cookiecutter. Even though we do not initialise our projects with the template, we re-create the structure as needed in each project. The structure we follow is

PROJECT_NAME
| — data/
|    |— raw/
|    |— processed/
| — models/
| - [notebooks/]       # jupyter notebooks
| - [docs/]            # documentation can go here including results
| - [configs/]
| — PROJECT_NAME/      # source code goes here like train.py
| — requirements.txt
| - Makefile
| - README.md
| - [setup.py]

In square brackets are the files and folders most projects have but we do not enforce.

Data

We store our raw data in a dedicated folder and ensure that we keep this data unchanged, and therefore do not invalidate any processing we do after. Whenever we apply any processing to a raw or processed file, the result goes to either the processed (never raw) or models folder depending on the output. Even though we do not apply any versioning to the data or models, something that I will touch again later on the post, this convention allows us to recreate processed files and models by running our code through the raw data.

Our data and models are synced to AWS S3 (note that s3 provides some form of versoning). Our code is versioned through git and shared through Github. This setup means that any member of the team can switch and work on any project easily by cloning a repo or navigating to the project folder and pulling the data and models from S3.

Makefile

We make the process of syncing data and models even easier through the use of a Makefile.

The Makefile allows us to define shorthand commands for commonly performed actions, and ensures that there is a consistent way to run these commands, which means they can be reproduced with the same result.

Virtualenv

One of the most important components of a data science project which we want to be able to recreate, is the virtualenv that contains all the dependencies of the project (for example the python library ‘pandas’). The dependencies are typically defined in a requirements.txt. To avoid different ways of creating the virtualenv (e.g. python -m venv venv, virtualenv -p python3 venv) and different python and pip versions we use another makecommand to standardise the creation of the virtualenv.

Being able to get a reproducible virtualenv still requires that the dependencies you define are pinned to exact versions. Otherwise, every time a user creates a virtualenv with the above command they will get a different set of dependencies. A common way to ensure all required dependencies are pinned is through the use of an unpinned_requirements.txt and pip freeze.

The unpinned requirements file consists of the names of unpinned libraries (e.g. pandas) for which you don’t have a preference over which version is installed, and pinned libraries, for which it is important that a specific version or version range is installed (e.g. spacy>2 or spacy==3.0). You can install the unpinned requirements to your virtualenv and then create the requirements.txt with pip freeze which will output all dependencies pinned to the versions installed. You can automate updating the requirements.txt file with yet another make command

Configs

Having uncorrupted raw data and a reproducible virtualenv is only a minimum requirement to be able to reproduce a data science project. We also need a way to replicate the actual analysis or modelling. Fortunately data science projects often follow a standard flow of steps where data is preprocessed, features are extracted, and a model is trained and evaluated. Each of these steps can be defined in a separate file, for example preprocess.py or train.py, and receive command line arguments for the various parameters that are needed to reproduce its outputs. It is important to avoid hardcoding any parameters, even those that seem less likely to change, as when they change you will make your previous results not reproducible. An example train.py would look like: