Dockerizing your R analysis

Introduction

This post is the first in a series of blog posts about getting the things that we do with R out into the world beyond the normal R ecosystem. I’ll be talking about exploratory analyses, shiny apps, models, serverless functions, R packages, and Docker. A lot about Docker.

Why should you care?

Why should you care about this? Well, if as a R user you can only really collaborate with other R users, you’re limiting yourself as a data scientist. If you send someone in IT an R script, what do you expect them to be able to do with it? If you develop a model in R, how will you use that in production when your DevOps colleagues have never used R? What happens if everyone else in your team uses Python?

What solutions already exist?

There are already some very good solutions to this problem: basically the software that RStudio provides. RStudio Server Pro is a terrific collaboration tool for R users that gets an organisation away from the horrors of having R installed on every desktop, whilst RStudio Connect takes R users further in being able to publish their R Markdown files, shiny apps, Plumber API’s and so on.

So what’s the problem?

The problem is that RStudio Connect and Server Pro are commercial, enterprise grade tools. This means that within your organisation you will need to find a) someone to set them up and maintain them and (most likely) b) someone to agree to pay for them. This is hard in small organisations, non-profit organisations, or even commercial organisations where you might be the only R user. In addition, non-R users just won’t be familiar with these tools, they don’t translate into the world of a DevOps engineer.

Where Docker comes in

Docker is a containerization technology tha I’ve blogged about before, written a tutorial on, and recorded a YouTube screencast. There are also some excellent other blog posts and tutorials in the ‘other resources’ section at the end of this post. It basically allows you to wrap up your R product in a self contained mini computer that can then be easily shared and run in a variety of different environments.

Docker has fast become the lingua franca of software development and DevOps, so if you can translate your R data science product into a Docker data science product, it makes collaborating easier.

An example

In these blog posts, I am going to use the famous gapminder dataset to carry out a very similar analysis to that used as an example in Hadley Wickham’s R for Data Science book: Chapter 25 Many Models. The scenario is that as an R user we have written an analysis of this dataset, and now want to share that analysis with someone else who isn’t familiar with R but is technical enough to know Docker and Git. So we’re not sharing the analysis output with a non-technical colleague (the html or pdf knitted report would be fine for that), we want to share the analysis with a technical collaborator.

The git repo

Running instructions

The code for these blog posts is on GitHub - https://github.com/chapmandu2/gapminder-pipeline/. In this post we are considering the code in the directory 01-exploratory-analysis. Within this repo there are three files:

Dockerfile
Makefile
gapminder-analysis.Rmd

These three files contain all of the information needed to run the analysis.

To run the analysis:

type git clone https://github.com/chapmandu2/gapminder-pipeline
type make run
go to http://localhost:8787 in your browser to launch RStudio Server
find the cloned repo within the hostdata directory
open the .Rproj file to open the RStudio Project
open the .Rmd file and click on the knit button.

The Dockerfile

The Dockerfile is very simple:

FROM rocker/verse:3.5.2

################
#install linux deps
################

RUN apt-get update -y && \
	apt-get install -y \ 
		curl

################
#install R packages
################

RUN R -e "install.packages(c('gapminder'))"

What this specifies is that the base image is the rocker/verse image of the Rocker project. This amazing project develops and maintains Docker images that contain different versions of R, different sets of standard packages, as well as the open source version of RStudio and Shiny Server. The rocker/verse image contains tidyverse packages as well as RStudio Server.

We then run two commands, one to install any required linux utilities and another to install the gapminder package. This gives us a Docker image on which we can run our analysis.

The Makefile

Makefiles are really useful for providing to the user the (version controlled) Docker commands required to build the Docker image and run the Docker containers.

build:
	docker build --file=./Dockerfile --tag=gapminder-01 .

run: build
	docker run -d -p 8787:8787 \
		-e DISABLE_AUTH=true \
		--name='gapminder-01-ct' \
		-v ${HOME}:/home/rstudio/hostdata \
		gapminder-01;

	sleep 3;
	firefox 127.0.0.1:8787;

stop:
	docker stop gapminder-01-ct

start:
	docker start gapminder-01-ct

remove: stop
	docker rm gapminder-01-ct

The make build command is fairly straightforward, but the make run command is a little more complex:

-d runs the container in detatched mode
-p 8787:8787 binds port 8787 on the container to port 8787 on the local computer
-e DISABLE_AUTH=true means that you don’t have to log into RStudio Server
-v ${HOME}:/home/rstudio/hostdata binds the home directory on the local computer to an accessible location within the Docker container. It’s really important that you don’t store any files you want to keep within the Docker container, since Docker containers are disposable!
firefox 127.0.0.1:8787 this will open RStudio Server in firefox on linux, on a Mac replace firefox with open.

The other commands are convenience commands to start, stop and delete containers. In addition the run command re-builds the image before the docker run command, and the remove command stops the docker container before removing it.

Portability vs reproducibility

The three files in the 01-exploratory-analysis directory provide everything required to run the analysis. It makes the analysis easily portable to other computers, virtual machines, cloud environments and so on. It is somewhat reproducible but not entirely so. The rocker project provides versioned images which are useful, but there is no guarantee if you build the image in Janurary that the same versions of different system dependencies and other R packages will be installed when you rebuild it in April.

There are ways that you can specify versions, for example using the remotes package:

RUN R -e "install.packages('remotes'); \
  remotes::install_version('gapminder', '0.3.0')"

However, it is more difficult to do this for any system libraries. For example I had a situation where a pipeline broke on importing an excel file not because the version of the readxl R package was different, but because the version of the libxls system library had changed!

The best way to absolutely guarantee reproducibility is to save the Docker image in a container registry such as Docker Hub so that you can re-use it. This is beyond the scope of this blog post but you can read more about it in a previous post.

Tools such as packrat permit project specific management of package dependencies, although I’m not sure how well they deal with system dependency issues!

Conclusions

This blog post has provided a basic template for Dockerizing an exploratory analysis in R so that it is easily portable and relatively reproducible. This approach allows analyses to be easily shared between colleagues, even if that colleague isn’t familiar with R, and also allows different projects to maintain their own project specific environments. The overhead is not large in terms of understanding, just a simple Dockerfile and a Makefile to script the Docker commands is all that is required.

Phil Chapman

2019/02/02