Writing R Packages I

If you write more than two functions, you need a package – this will remind you what functions do and how they interact with each other, help you keep track of inputs and outputs, and, if you want to share you code, allow you to do so in a standard format. This module will take us from a few functions to a complete package.

This is the first module in the Writing R Packages topic; the relevant slack channel is here.

Some slides

Writing R Packages I from Jeff Goldsmith.

Example

Rather than jumping into how to organize this material, we’ll start by motivating the need for an R package by revisiting an example from Iteration. There, we wrote a short function to simulate data from a simple linear regression; this function returns a data frame containing the sample size and estimated coefficients for the simualted dataset.

sim_regression = function(n_samp, beta0, beta1) {
  
  sim_data = tibble(
    x = rnorm(n_samp, mean = 1, sd = 1),
    y = beta0 + beta1 * x + rnorm(n_samp, 0, 1)
  )
  
  ls_fit = lm(y ~ x, data = sim_data) %>% 
    tidy()
  
  tibble(
    n = n_samp,
    beta0_hat = pull(ls_fit, estimate)[1],
    beta1_hat = pull(ls_fit, estimate)[2]
  )
}

We also wrote a wrapper function to repeat the simulation many times and collapse the output to a data frame.

simulate_nrep = function(n_rep, n_samp, beta0, beta1) {
  
  rerun(n_rep, sim_regression(n_samp, beta0, beta1)) %>% 
    bind_rows()
  
}

You can download R scripts containing sim_regression and simulate_nrep.

We used these functions to examine the properties of ordinary least squares estimates in a simple linear regression.

library(broom)
library(tidyverse)

sim_results = 
  map_df(list(30, 60, 120, 240), simulate_nrep,
         n_rep = 100, beta0 = 2, beta1 = 3) 

sim_results %>% 
  mutate(n = as.factor(n),
         n = fct_relabel(n, relabel_function)) %>% 
  ggplot(aes(x = beta0_hat, y = beta1_hat)) + 
  geom_point(alpha = .2) + 
  facet_grid(~n)

It’s really easy to forget what these functions do and how to work with them, even in the space of a couple of weeks. In this case it’s not too bad to look back at the code, but for more complex functions it can be very challenging. That’s why writing packages is helpful.

`create()`

I’ll create a new package skeleton called example.package using devtools::create().

library(devtools)
create("~/Desktop/example.package")

(Note: you might have thought I was going to call this example_package in keeping with literally everything else in the whole course. You’d be right; that’s exactly what I was going to do. But early on CRAN decided not to let underscores in package names and that’s one of those things that just probably won’t change. Putting .s in package names causes other problems and I would usuall avoid it, but it seemed like the best option in this case. So here we are, stuck with example.package.)

Take a look at the directory that create() creates. This contains DESCRIPTION and NAMESPACE files, an empty R directory, and an R Project – everything you need to get a package started.

Open the R Project, and notice the build tab. RStudio has a lot of helpful built-in tools for package development, which we’ll use frequently. Like other things we’ve seen, these are often shortcuts for commands that do the same stuff.

Add functions

My “package” currently doesn’t do anything, so I’m going to copy the .R files above into the R directory. Then I’ll install this version of the package and restart the R session, and run the code below.

library(broom)
library(tidyverse)
library(example.package)

rm(list = ls())

simulate_nrep(n_rep = 10, n_samp = 30, beta0 = 2, beta1 = 3)

I’ve removed the functions from the global environment, but can still run the code above. That’s because we’ve made those functions accessible to R through example.package, which is great!!

If you try ?sim_regression you’ll find there’s nothing there; that’s because we haven’t written any documentation. Putting all your functions in a package keeps them from cluttering up your scripts and R Markdown documents, but packages aren’t really useful without documentation.

Add documentation

Open sim_regression.R and put your cursor somewhere inside the function. Go to Code > Insert Roxygen Skeleton; this will add a bunch of commented lines above the function. We’ll use the roxygen2 package to convert these specially formatted lines into the file /man/sim_regression.Rd, which will become the help page accessed using ?sim_regression.

After updating the title, writing a short summary of the function, describing input parameters and the return object (ignore @export and @examples for now), you should have something like the following:

#' Simulate from an SLR
#'
#' Simualtes data from a simple linear regression model with user-defined
#' sample size and regression coefficients. Fits a SLR to the simulated 
#' data and returns regression coefficients.
#'
#' @param n_samp sample size
#' @param beta0 intercept
#' @param beta1 slope
#'
#' @return tbl_df with one row, containing sample size and coef estimates
#' @export
#'
#' @examples

Repeat this process to create a roxygen comments for simulate_nrep.

Once the roxygen comments are in place, use devtools::document() or Build > More > Document to create help pages for the two functions. Install and restart, and try ?sim_regression again. Now you have help pages accessible in the usual way – neat!!

Open the directory for the package. You should find a new /man/ subdirectory, with files for each function with the extension .Rd. These are the documentation files; you can open them if you want, but don’t edit them! Always change your documentation in the roxygen comments – doing so keeps your functions and documentation in the same place, which makes it easier to stay organized and up to date.

These descriptions are helpful, but illustrative examples are great too! Edit the roxygen comments to include the lines below, then devtools::document(), install + rebuild, and check out the new help page.

#' @examples
#' # simulate a single dataset
#' sim_regression(30, 2, 3)

As you writing documentation, keep in mind that you’ll need to understand how your functions work later – this is another case where you’re collaborating with “future you”, so anything you do now to make this clear will save you time in the long run! If you intend to share this with others, you probably need to be even more explicit in your documentation.

Package description

The DESCRIPTION file should mostly be edited by hand. This is higher-level information about the package, and there’s a lot here. Some of this really only matters if you plan to share your package with others, but it’s quick and easy to fill in (and you never know when a personal package will become public). A few quick points:

There are rules for the title and description
At this point, you’re probably the only author
Start with version number 0.1.0; Jeff Leek’s guide has some nice thoughts about versioning based on Bioconductor.
Add a licence – run devtools::use_mit_license() and change LICENSE.txt to your name and the current year.

Once you’ve done all this, install and restart.

More functions

Right now, example.package consists of a function to simulate data from a SLR, estimate model parameters, and return the results, and a second function that reruns the first, formats the results, and returns a dataframe. To build on this framework, we’ll add a new function that simulates from a different data generating mechanism and edit simulate_nrep to work with both simulation functions.

The sim_bern_mean function has the same general structure as sim_regression() but simulates data from a Bernoulli distribution and returns the sample average (download the script here.

sim_bern_mean = function(n_samp, prob) {
  
  sim_data = tibble(
    y = rbinom(n_samp, 1, prob)
  )
  
  tibble(
    n = n_samp,
    samp_avg = mean(sim_data$y)
  )
}

To add this to example.package, we need to copy the script to /R/, write some documentation using roxygen comments, use document() to create the .Rd file, and install + restart. After these steps, I can check that the function works the way I intended.

library(broom)
library(tidyverse)
library(example.package)

rm(list = ls())

sim_bern_mean(25, .9)

A problem now is that simulate_nrep doesn’t take advantage of this new function, so I need to do some edits there. Specifically, I’ll add sim_func as an argument and use the mysterious but useful ... to pass additional arguments on to other functions – this is helpful because we can still send arguments to sim_regression or sim_bern_mean without listing everything in the definition of simulate_nrep. The function after making these edits follows.

simulate_nrep = function(n_rep, sim_func, ...) {
  
  rerun(n_rep, sim_func(...)) %>% 
    bind_rows()
  
}

After making these edits, I’ll need to change the documentation accordingly, run document(), install and restart, and check that the functions work as expected.

library(broom)
library(tidyverse)
library(example.package)

rm(list = ls())

simulate_nrep(10, sim_regression, n_samp = 30, beta0 = 2, beta1 = 3)
simulate_nrep(10, sim_bern_mean, n_samp = 25, prob = .9)

At this point, you’d probably want to add some default values to sim_regression and sim_bern_mean (and add these to the documentation!).

Dependencies

Our functions depend on other functions in broom, tibble, magrittr, purrr, and dplyr, but we haven’t made this explicit. As a consequence, opening a new R session, loading only example.package, and running the functions as above will produce an error. We’ll talk more about the details of this process later, but for now we simply need to know that our package depends on those packages.

The best way to make your package’s dependencies known to R is to use package::function() everywhere in your code (e.g. broom::tidy()). This makes it clear which functions exist outside your package and can help prevent errors if multiple packages have functions with the same name.

That strategy can be a bit heavy-handed if you use the same function a lot, or if you use lots of functions from the same package. In these cases, you can edit the roxygen comments to include @importFrom package function or @import package. These make it increasingly less clear which functions come from which package (especially @import) but will make the relevant functions available. In example.package, I’ll use broom::tidy() to identify tidy(); importFrom tibble tibble to add tibble() and importFrom magrittr "%>%" to add the pipe; and @import dplyr for bind_rows() and @import purrr for rerun().

These steps make it clear which packages you depend on, but you still need R to include them when you load your package. To address this point, add dependencies to the Imports field of the DESCRIPTION. This is a step you could do “by hand” since we’ve made edits to the file previously; instead we’ll run the following lines in the console. Check the DESCRIPTION before and after!

devtools::use_package("broom")
devtools::use_package("dplyr")
devtools::use_package("magrittr")
devtools::use_package("purrr")
devtools::use_package("tibble")

To check that this has worked, restart RStudio and run the following lines.

library(example.package)

simulate_nrep(10, sim_regression, n_samp = 30, beta0 = 2, beta1 = 3)
simulate_nrep(10, sim_bern_mean, n_samp = 25, prob = .9)

Figuring out exactly what you depend on can take some getting used to, but it’s a critical step for ensuring your package really is self-contained. You should also try to limit your dependencies, both to keep your package simple and to make it easier to maintain if the dependencies are updated.

Deploy on GitHub

Throughout this example, I’ve been using git and GitHub because that’s how I do all my work. In the case of package development, this turns out to have benefits beyond the usual version control stuff – you can install R packages directly from GitHub!!

Try installing example.package using the lines below.

devtools::install_github(repo = "jeff-goldsmith/example.package")

GitHub has become a common way to make R packages publicly available. Packages that exist only on GitHub have a lot fewer guarantees than packages on CRAN and may be less stable, but can be useful anyway (especially if they’re from a developer you trust). You can also frequently access the “developer versions” of CRAN packages using GitHub to get the latest features (again with the caveat that these might be less stable).

README

The information in the package documentation can be difficult for someone to find and read quickly, so it’s helpful to add a short description if you intend to make your package useful for others. Files named README.md are handled in a special way by GitHub, and are perfect for this purpose.

Create a template README using devtools::use_readme_rmd(), and then edit so that someone who happens across the repo will know what the package does and how it works. I also like to include instructions for installing from GitHub that can be copy-and-pasted, so folks don’t have to track down your user and repo names.

After making edits, knit the README, commit, push to remote, and check out the GitHub repo!

Other materials

Writing packages is a major activity for folks who use R. It seems intimidating at first, and really understanding everything can be a lot of work. Thankfully lots of people have tried to make things easy and clear.

Hilary Parker’s blog post on writing a blog post from scratch makes it clear how easy writing a package can be – this is usually where I start new packages!
Jeff Leek’s package development guide has a lot of great tips about design, but isn’t really a how-to
Jenny Bryan has, not surprisingly, some slides, a short illustrated example, and a longer illustrated example, all of which are excellent
Hadley Wickham has a whole book on R packages; for now, the most relevant chapters are the intro and package overview; the chapters on R code, the DESCRIPTION, and the help pages may also be useful.
The usethis package should automate a lot of the package writing process, although I haven’t used it myself
Finally, RStudio has a devtools cheatsheet

The code that I produced working examples in lecture is here.