Writing functions

If you use the same code twice, you need a function – this will improve code readability, facilitate troubleshooting, and reduce chances for mistakes. This content looks at the best approaches for writing R functions.

This is the first module in the Iteration topic; the relevant slack channel is here.

Some slides

Writing Functions from Jeff Goldsmith.

Example

For this topic, I’ll create a directory called example_iteration and take the steps to make it a public GitHub repo. I’ll write code for today’s content in a new R Markdown document called functions.Rmd, and I’m going to load the usual packages. I’m also setting the seed so that the output on this page is fixed.

library(tidyverse)

theme_set(theme_bw())
theme_update(legend.position = "bottom")

set.seed(1)

My first function

The best way to build up a function is to start with code you’ve written outside a function. To see how this might work, I’ll start with a simple example: the chunk below takes a sample from a standard normal and then computes the Z score for the sample.

x = rnorm(25)

mean(x) / sd(x)
## [1] 0.1775222

If I want to repeat this (admittedly simple) process for lots of samples, I might want to have a function that takes the sample as an argument, computes the Z score in the body, and returns the result. I define such a function below.

z_score = function(x) {
  
  z = mean(x) / sd(x)
  z
  
}

z_score(x)
## [1] 0.1775222

I can try this with a few samples and confirm that it works. I should also try to think of ways this code might break; the attempts below try a variety of inputs to see what happens.

z_score(3)
## [1] NA
z_score("my name is jeff")
## Warning in mean.default(x): argument is not numeric or logical: returning
## NA
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm
## = na.rm): NAs introduced by coercion
## [1] NA
z_score(iris)
## Warning in mean.default(x): argument is not numeric or logical: returning
## NA
## Error in is.data.frame(x): (list) object cannot be coerced to type 'double'
z_score(sample(c(TRUE, FALSE), 25, replace = TRUE))
## [1] 1.428286

These all did something I didn’t want, but only two returned errors. To avoid behavior you don’t want (i.e. to “fail noisily and as soon as possible”) we’ll add some checks on the argument values using conditional statements.

z_score = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Z scores cannot be computed for length 1 vectors")
  }
  
  z = mean(x) / sd(x)
  
  z
}

Fantastic – we have a pretty solid function for computing Z scores!

Multiple outputs

In some cases it might be good to return not just the Z score, but the mean and standard deviation as well. A first option is to store each of the values in a named list, and to return that list.

z_score = function(x) {
  
  if (!is.numeric(x)) {
    stop("argument x should be numeric")
  } else if (length(x) == 1) {
    stop("z-scores cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)
  z = mean_x / sd_x
  
  list(mean = mean_x, 
       sd = sd_x,
       z = z)
}

Alternatively, we might store values in a data frame.

z_score = function(x) {
  
  if (!is.numeric(x)) {
    stop("argument x should be numeric")
  } else if (length(x) == 1) {
    stop("z-scores cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)
  z = mean_x / sd_x
  
  tibble(
    mean = mean_x, 
    sd = sd_x,
    z = z
  )
}

In general, either of these will be fine; which one you choose will depend on what kind of values you want to return, and what you plan to do with the function itself. If you want to return the original sample along with the computed values, a list might make sense; if you plan to run your function a lot and study the results, having a data frame will make it easier to use other tools. We’ll see more of that in iteration and simulation.

Multiple inputs

As exciting as it is to compute Z scores, let’s start setting our sights higher. I’d like to have a function that takes a given sample size and true simple linear regression line, simulates data from that model, and returns the estimated intercept and slope. I’ll start from the code below.

sim_data = tibble(
  x = rnorm(30, mean = 1, sd = 1),
  y = 2 + 3 * x + rnorm(30, 0, 1)
)

ls_fit = lm(y ~ x, data = sim_data)
  
beta0_hat = coef(ls_fit)[1]
beta1_hat = coef(ls_fit)[2]

You should take a few minutes to examine this code – make a plot of the simulated data to make sure it “makes sense”, take a look at the result of the lm call, etc.

Once you’re satisfied, it’s time to wrap things up in a function. I’d like to be able to change the sample size and regression parameters, so those will be my arguments; the code that simulates data and fits the regression goes in the body; and the return statement should include the intercept and slope. A function that does all this is below.

sim_regression = function(n, beta0, beta1) {
  
  sim_data = tibble(
    x = rnorm(n, mean = 1, sd = 1),
    y = beta0 + beta1 * x + rnorm(n, 0, 1)
  )
  
  ls_fit = lm(y ~ x, data = sim_data)
  
  tibble(
    beta0_hat = coef(ls_fit)[1],
    beta1_hat = coef(ls_fit)[2]
  )
}

Repeated calls to sim_regression() will give a sense of sampling variability in regression coefficients in SLR; take a few minutes to run sim_regression(30, 2, 3) a few times, and then to run sim_regression(300, 2, 3), and think about the results. We’ll examine this more formally in simulation

This is also a good time to point out how R handles argument matching. We can use positional matching, meaning the first value supplied is taken to be the first argument, the second value is the second argument, and so on. We do this with tidyverse functions a lot; the first argument is always a dataframe, and we just supply that dataframe in the first position. We also use positional matching when we call mean(x) or sim_regression(30, 2, 3).

Alternatively, we can use named matching, which uses the argument name in the function call. Named matching can be a bit more stable when you’re writing your own functions (in case you decide to change the order of the inputs, for example) but isn’t strictly necessary. Named arguments can be supplied in any order: sim_regression(n = 30, beta0 = 2, beta1 = 3) is equivalent to `sim_regression(beta1 = 3, n = 30, beta0 = 2).

Revisiting past examples

There have been a couple of times in this class that I’ve had to write code I didn’t like, because it would have made sense to write a function. We’ll revisit those quickly to see how we could improve our code.

Scraping Amazon

In reading data from the web, we wrote code that allowed us to scrape information in Amazon reviews. That code is below.

url = "https://www.amazon.com/Philips-Sonicare-rechargeable-toothbrush-HX6211/product-reviews/B00YAR7ZFM/ref=cm_cr_arp_d_viewopt_srt?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=1"

toothbrush_html = read_html(url)

review_titles = toothbrush_html %>%
  html_nodes("#cm_cr-review_list .review-title") %>%
  html_text()

review_stars = toothbrush_html %>%
  html_nodes("#cm_cr-review_list .review-rating") %>%
  html_text()

reviews = data_frame(
  title = review_titles,
  stars = review_stars
)

Let’s write a quick function to scrape review information for any URL to an Amazon review page. Hint: this code appears in Homework 5 …

Loading LoTR data

In tidy data, we broke the “only copy code twice” rule when we used the code below to process the LoTR words data:

fellowship_ring = read_excel("./data/LotR_Words.xlsx", range = "B3:D6") %>%
  clean_names() %>%
  gather(key = sex, value = words, female:male) %>%
  mutate(race = tolower(race),
         movie = "Fellowship")

two_towers = read_excel("./data/LotR_Words.xlsx", range = "F3:H6") %>%
  clean_names() %>%
  gather(key = sex, value = words, female:male) %>%
  mutate(race = tolower(race),
         movie = "Two Towers")

return_king = read_excel("./data/LotR_Words.xlsx", range = "J3:L6") %>%
  clean_names() %>%
  gather(key = sex, value = words, female:male) %>%
  mutate(race = tolower(race),
         movie = "Return")

Try to write a function that can be used to abstract the data loading and cleaning process. Use this function to recreate the tidied LoTR dataset.

Having a function that handles the loading and cleaning is great – if I decide I want to change the tidying step, I only have to do it once, and I don’t have to worry about mistakes creeping in through copy-and-paste errors!

Functions as arguments

One powerful tool is the ability to pass functions as arguments into functions. This might seem like a weird thing to do, but it has a lot of handy applications – we’ll see just how far it goes in the next modules in the topic.

As a quick example, suppose we wanted to get a sense of how similar or different values in a vector are to each other. There are lots of ways to measure this – variance, standard deviation, range, inter-quartile range – and some are more appropriate in some cases than in others. The function below allows you to input a vector and a function, and returns the result of applying the specified function to the vector input.

x = rnorm(25, 0, 1)

my_summary = function(x, summ_func) {
  summ_func(x)
}

my_summary(x, sd)
## [1] 0.8988712
my_summary(x, IQR)
## [1] 1.271572
my_summary(x, var)
## [1] 0.8079694

This example is pretty trivial – you could just apply those functions directly to x and skip the hassle – but in many cases the idea of passing functions as arguments is really powerful. As an example, remember that you can reorder factors according to different summaries in fct_reorder!

Scoping and names

Take a look at the code below. Will the call f(x = y) work? If so, what will it produce? What is the current value of x, y, and z?

f = function(x) {
  z = x + y
  z
}

x = 1
y = 2

f(x = y)

Examples like this are tricky, but emphasize an issue that comes up a lot in writing functions: you define a variable in your global environment and use it in your function, but it isn’t passed as an argument. This is easy to miss, especially when you go from written code to a function, and can be hard to track down if you empty your working directory or change a variable name. The best advice I have is to give your arguments useful names and think carefully about where everything is defined, and to periodically empty your environment and try everything again!

Other materials

There are chapters on functions in R for Data Science and Advanced R
Jenny Bryan has a three part series on writing functions (each part is short)
R Programming for Data Science has nice chapters on functions and scoping
The Basics of UNIX Philosophy, linked to above, apply broadly to designing code

The code that I produced working examples in lecture is here.