R, the free and open source program for statistical computing, poses a substantial threat to the reproducibility of published research. This post explains the problem and introduces a solution.
The Problem: Packages
R itself has some reproducibility problems (see example in this footnote ), but the big problem is its packages: the addon scripts that users install to enable R to do things like run meta-analyses, scrape the web, cluster standard errors, format numbers, etc. The problem is that packages are constantly being updated, and sometimes those updates are not backwards compatible. This means that the R code that you write and run today may no longer work in the (near or far) future because one of the packages your code relies on has been updated. But worse, R packages depend on other packages. Your code could break after a package you don't know you are using updates a function you have never even used.
The very popular
dplyr package is part of the
tidyverse empire. It is used by hundreds of other packages, and by perhaps millions of scripts. The
dplyr package is used to manipulate data. For example, its
distinct() function eliminates rows with duplicate values in a dataset (e.g., with the same ID).
On June 24th, 2016, the creators of
dplyr changed what this
distinct() function does by default (I owe this example to Mark Brandt (.htm). Thanks Mark!). Before that date,
distinct() would keep all variables in the dataset, but after that date it would only keep the variables you are checking to look for duplicates. The figure below illustrates:
The change probably broke every R script written before June 24, 2016 that relied on the function
distinct(). And even scripts that do not use the
distinct() function may stop working if they rely on a function from another package that uses
Packages get updated a lot
For illustration, consider the PNAS paper discussed in Colada. The authors posted their R Code, allowing others to reproduce their results. At least that's the goal. The code relies on 8 packages. The figure below shows how many times those 8 packages, and their 'dependencies,' the packages those packages depend on, have been updated since January 1st 2019. As of Nov 17th 2020: 123 times.
We don't know the probability that a package update breaks a script; but, as a quick calibration, say it is 1%. The probability that the code for the PNAS paper breaks after 123 changes would be 71% . If the base rate were just 1/1000 per change, then after 123 changes there is a ~12% chance of failure. How confident are you now that this code will run in 5 years?
The solution to R's package problem is a new R package:
If the package name seems strange, read this footnote .
All you need to do to fix this huge reproducibility problem is, instead of loading packages with the built-in
library() command, load them with the
That is it.
groundhog.library() command takes two values. Like
library(), you indicate which package you want to load. In addition, you enter a date. Any date. Groundhog will load the most recent version of that package, on CRAN, on that date. It will also load all dependencies of that package as current on that date.
So instead of this:
You do this:
distinct() function will do on every computer, and 'forever', the exact same thing .
Update January 6th, 2021
A reader alerted me to a bug with the current
groundhog(version 1.1.0) where you cannot set the groundhog library to be a folder containing spaces in the name. e.g., "c:/dropbox/groundhog library". This will be fixed in a few days, in the meantime, you can do this:
When starting a new script, choose a recent date (say first of the current or past month), and use it to load all packages. You can assign a variable to the date to make it easy to update. A sensible name is 'groundhog.day', but it can be anything you like.
So, my new scripts now start with something like this:
A nice feature of groundhog is that it makes 'retrofitting' existing code quite easy. If you come across a script that no longer works, you can change its
library() statements for
groundhog.library() ones, using as the
groundhog.day the date the code was probably written (say when it was posted on the internet), and it may work again. For more details (.htm)
Bonus: no more install.packages()
When you use
library() to ask for a package that you have not installed, you get an error. That's annoying.
When you use
install.packages() to ask for a package you already have, the existing one gets deleted without warning. That's dangerous.
When you use
groundhog.library() to ask for a package you don't have, it gets installed automatically, and saved alongside any existing versions of it. That's convenient.
You could have all 32 versions of
dplyr side-by-side in your computer; groundhog will load the one you need for the date you enter. For example, this is what my groundhog folder looks like right now:
When I started working on groundhog I was aware of three existing solution to R's reproducibility problem. The packages
renv (.htm), &
checkpoint (.htm), and the more general solution:
Docker (.htm). These solutions are sophisticated, powerful, and versatile, but they lack 3 features I thought wide adoption by researchers would require. An ideal solution would:
(1) Work within self-contained individual R scripts (e.g., not require projects, or additional files).
(2) Make it so that the code itself reveals which version of which packages were being used.
(3) Involve trivial adoption costs.
In this footnote, I discuss how these features are missing in existing solutions: .
For most people: https://groundhogR.com
For those who self-identify as github users: https://github.com/CredibilityLab/groundhog
Groundhog was funded by the Wharton School of the University of Pennsylvania.
Like AsPredicted and ResearchBox, groundhog is brought to you by the Wharton Credibility Lab.
I wrote the first version of groundhog and then collaborated with Hugo Gruson, an evolutionary biologist (htm), to refine it, improve it, and turn it into a CRAN package.
To use groundhog: