Writing R Packages II

If you write more than two functions, you need a package – this will remind you what functions do and how they interact with each other, help you keep track of inputs and outputs, and, if you want to share you code, allow you to do so in a standard format. The first part of this module covered getting to a complete package from scratch; this module covers some important but more advanced issues in R package development.

This is the second module in the Writing R Packages topic; the relevant slack channel is here.

Some slides

Writing R Packages II from Jeff Goldsmith.

Example

For today’s example, I’ll continue working on example.package, the R package we started in writing R packages I.

Library path

You can find the path to your package library using .libPath() – opening this directory on your computer will show you what you’ve installed.

Search path

Before jumping into new pacakge development stuff, we’re going to take a closer look at R’s search path. You can see your current search path at any time using search().

search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

There’s not much here yet, since we haven’t loaded anything – mostly we have default packages and the global environment.

When you call a function, R has to find it. We’ve often made the location of a function explicit using package::function() which tells R specifically where to look but doesn’t affect the search path.

iris = janitor::clean_names(iris)
search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

The iris dataset is included in the datasets package, which is in the search path. We can also use the clean_names() function since we’ve been very clear about where R should find it. We didn’t do anything to the search path, though!

If you don’t specify the package for a specific function, R will look for it in the global environment and then in attached packages – that is, in the search path. The library() function attaches a package to the search path, including it in the collection of packages R searches when trying to find a function. For example, to call clean_names() without specifying the package, we can use library(janitor) to attach the package to the search path.

library(janitor)
search()
##  [1] ".GlobalEnv"        "package:janitor"   "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"

iris = clean_names(iris)

Why not just attach everything? In part, at least, to avoid naming conflicts. Both MASS and dplyr have functions called select(), for example, and they do really different things. If you load both packages, which version you get depends on the order in which they’re loaded.

To use tidyverse::select, we load that package second.

library(MASS)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

iris %>% 
  as_tibble() %>% 
  select(sepal_length)
## # A tibble: 150 x 1
##    sepal_length
##           <dbl>
##  1          5.1
##  2          4.9
##  3          4.7
##  4          4.6
##  5          5  
##  6          5.4
##  7          4.6
##  8          5  
##  9          4.4
## 10          4.9
## # ... with 140 more rows

Note the warning that dplyr::select() masks MASS::select() – these warnings are easy to overlook but are really important!

I’ll detach both packages, then reverse the order in which I attach them and try again.

detach("package:dplyr", unload = TRUE)
## Warning: 'dplyr' namespace cannot be unloaded:
##   namespace 'dplyr' is imported by 'janitor' so cannot be unloaded
detach("package:MASS", unload = TRUE)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

iris %>% 
  as_tibble() %>% 
  select(sepal_length)
## Error in select(., sepal_length): unused argument (sepal_length)

iris %>% 
  as_tibble() %>% 
  dplyr::select(sepal_length)
## # A tibble: 150 x 1
##    sepal_length
##           <dbl>
##  1          5.1
##  2          4.9
##  3          4.7
##  4          4.6
##  5          5  
##  6          5.4
##  7          4.6
##  8          5  
##  9          4.4
## 10          4.9
## # ... with 140 more rows

The command that just uses select produces an error, because it’s using (implicitly) MASS::select(); the second is clear about using dplyr::select and works as desired.

As you work more in R you will run into search path issues (if you haven’t already), and understanding how attaching packages affects the search path will help you resolve this. This discussion also ilustrates why it’s best to only attach the packages you need, and to use package::function() notation in cases where a package isn’t used repeatedly.

`NAMESPACE`

The search path discussion is particularly relevant in the context of writing your own packages. In particular, the NAMESPACE file determines search path associated with your package. The NAMESPACE file for example.package is shown below.

# Generated by roxygen2: do not edit by hand

export(sim_bern_mean)
export(sim_regression)
export(simulate_nrep)
import(dplyr)
import(purrr)
importFrom(magrittr,"%>%")
importFrom(tibble,tibble)

We used @import dplyr and @import purrr in our roxygen comments, which adds the statements import(dplyr) and import(purrr) to the NAMESPACE. As a result, code in our package will include dplyr and purrr when looking for functions.

We also used @importFrom tibble tibble and @importFrom magrittr "%>%" in our roxygen comments, which adds the statements importFrom(tibble,tibble) and importFrom(magrittr,"%>%") to the NAMESPACE. As a result, code in our package will include these specific functions when executing code.

There’s an important but confusing distinction between import directives in the NAMESPACE and the Import field in the DESCRIPTION (shown below). Although they share a name, these mean different things: roughly, in the NAMESPACE we’re listing packages that need to be included in the search path, while in the DESCRIPTION we’re listing packages that have to be installed for our package to work.

Package: example.package
Title: Simuates Data and Summarizes
Version: 0.1.0
Authors@R: person("Jeff", "Goldsmith", email = "ajg2202@cumc.columbia.edu", role = c("aut", "cre"))
Description: What the package does (one paragraph).
Depends: R (>= 3.4.1)
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
Imports: broom,
    dplyr,
    magrittr,
    purrr,
    tibble,

To illustrate this distinction, recall that we used broom::tidy() in our functions rather than including @import broom in the roxygen comments. This makes it very clear where tidy() comes from, and means we don’t need broom in the search path; thus, these don’t appear in the NAMESPACE. We do still need the packages though, so they’re listed as a dependency in the DESCRIPTION.

The NAMESPACE and roxygen comments also include exports, which identify functions that are visible when your package is attached. In bigger, more complex packages you may have functions you don’t want users to have access to; for those, remove @export from the roxygen comments.

Checks

Checking yor package for common issues – things like the presence of all needed files, the completeness of documentation, whether the code and examples run – is critical to making sure your package is complete and self-contained. You can perform these checks using devtools::check() or a button in RStudio. This is going to be frustrating, at least until you start to recognize that this is a helpful process. The checks really get into the corners of your package and find things you wouldn’t expect. The messages take some practice to understand. Correcting issues will force you to complete all your documentation.

You don’t have to do this for packages written for yourself, although I do recommend it. You do have to do this for packages that go on CRAN, which is part of the reason that CRAN packages are a bit more trustworthy. Many packages on GitHub has passed checks; look for a happy build | passing sticker at the top of the README!

Below is the (redacted) output of checking example.package.

Updating example.package documentation
Loading example.package
Warning: @examples [sim_bern_mean.R#12]: requires a value
Warning: @param [simulate_nrep.R#4]: requires name and description
Warning: @examples [simulate_nrep.R#13]: requires a value

...

* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK

...

* checking R code for possible problems ... NOTE
sim_bern_mean: no visible global function definition for ‘rbinom’
sim_regression: no visible global function definition for ‘rnorm’
sim_regression: no visible binding for global variable ‘x’

...

* checking examples ... ERROR
Running examples in ‘example.package-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: sim_regression
> ### Title: Simulate from an SLR
> ### Aliases: sim_regression
> 
> ### ** Examples
> 
> # simulate a dataset and return estimates
> sim_regression(30, 2, 3)
Error in lm(y ~ x, data = sim_data) %>% broom::tidy() : 
  could not find function "%>%"
Calls: sim_regression
Execution halted
* DONE
Status: 1 ERROR, 1 WARNING, 2 NOTEs

...

We did pretty well! There are some warnings about our documentation (incomplete @param and @example roxygen comments), a note about being clear about where some functions come from, and an error in one of our examples (which needs %>% to run, but doesn’t load the necessary package). This gives a sense of the kind of issues that checking your package will turn up.

Tests

Checking a package is a process that doesn’t vary from one package to the next because this process isn’t concerned with whether a function or package works as intended – the built-in checks are for things like documentation, namespace, installation, etc.

Testing, in contrast, is package specific because it is concerned with whether functions work as intended. This is important for at least two reasons:

If your code depends on other packages and those change, you should find out and fix the issue
As you add or edit functions, you should be sure that changes don’t break existing code in unexpected ways

Informal testing is common during development – you run your functions on code snippets and make sure they give the results you expect. Formal testing makes this process more rigorous and saving the informal tests and running them each time you check your package. Like other best practices for development, this takes some time to get used to but guards against future trouble and improves your software.

The testthat package does as much as possible to facilitate formal testing. To set this up for your package, run devtools::use_testthat(). Doing so will create a directory /tests/testthat/ to hold tests and a file /tests/testthat.R to run tests. It’s your job to write the tests!

The file tests/testthat/test_sim_bern.R is shown below (note: test files have to start with test and be in the right directory):

context("sim_bern_mean")

test_that("simualtion returns a 2x1 dataframe", {

  sim_output = sim_bern_mean(30, .5)

  expect_is(output, "tbl_df")
  expect_equal(ncol(output), 2)
  expect_equal(nrow(output), 1)

})

test_that("simulation gives anticipated results", {

  set.seed(1)
  output = sim_bern_mean(30, .5)

  set.seed(1)
  sample = rbinom(30, 1, .5)

  expect_equal(output$n, length(sample))
  expect_equal(output$samp_avg, mean(sample))

})

These are pretty contrived tests, but give you an idea of how testing in general might work. Use devtools::test() to run your tests (these will also run when you check your package); output for my tests is shown below.

Loading example.package
Testing example.package
sim_bern_mean: .....

This output contains . for each passed test in each context, and will indicate when a test is failed.

Vignettes

Help pages for functions are great, but assume users know roughly how a package works and only need a reminder about some specifics. To give a more general introduction to a package – what functions do, how they interact, and why you wrote it – you need the longer-form documentation found in package vignettes. Fortunately, these can be written using R Markdown

To build the infrastructure needed to include a vignette in example.package, I’ll run the lines below.

devtools::use_vignette("sim_vignette")

This makes several changes in the package directory.

Adds knitr and rmarkdown to Suggests in DESCRIPTION; these packages aren’t dependencies, but will be needed for someone else to knit your vignette.
Creates a new file, /vignettes/sim_vignette.Rmd, with template vignette content.
Adds /inst/doc to .gitignore.

You’ll need to edit /vignettes/sim_vignette.Rmd. There are some things you have to do:

Replace both instances of “Vignette Title” in the YAML with an actual title, using the same title in both places.
List yourself as author or remove that line.

Then edit the rest of the R Markdown document to give an overview of the package. This often consists of organizing some of the code you’ve used elsewhere – either in the examples or in the code you have that uses the package.

The vignette I wrote for example.package can be downloaded here.

Disseminating your vignette gets complicated, unfortunately – the knitted RMD is in /inst/doc/, which git is ignoring. Building a package (going from source to bundle) using devtools::build() will compile the vignette and include it in the bundle, so packages installed from a bundle or binary will have vignettes available. That means you can check out vignettes for packages you’ve installed from CRAN; see what’s available with browseVignettes(), or go straight to a vignette using vignette("dplyr").

Installing from github first builds the package bundle and then installs that; by default, this won’t knit vignettes in case they are time consuming or complex. You can force the inclusion of a vignette using devtools::install_github(build_vignettes = TRUE).

For packages I put on GH, I usually include a code chunk like the one below in my README to let users know how to include and access the vignette.

devtools::install_github("jeff-goldsmith/example.package", 
                         build_vignettes = TRUE)
# vignette("sim_vignette")

Other materials

Many of these topics are touched on in the other materials for writing R packages I; below we reiterate some of those and add some new resources.

Jenny Bryan’s longer illustrated example covers many of these topics in a reasonable level of detail
The R Packages book is a more complete reference. There are chapters on
- What is a package?
- NAMESPACE
- Checks
- Tests
- Vignettes
- Data
- Compiled code
The usethis package should automate a lot of the package writing process, although I haven’t used it myself
The official guide to Writing R Extensions also exists, although I’m not sure I exactly recommend it …

The code that I produced working examples in lecture is here.