Once you’ve imported data, you’re going to need to do some cleaning up.

This is the secode module in the Data Wrangling I topic; the relevant slack channel is here.

Some slides

Data Manipulation from Jeff Goldsmith.


Example

Start by figuring out how this example fits in with what you’ve done and plan to do. I’d suggest adding a new R Markdown file to the project we started last time; that way you can load example data sets using the code we wrote last time.

Once again we’re going to be using the tidyverse, so we’ll load that at the outset; I’ll also load the janitor package because I like clean names. We’re going to be looking at a lot of output, so I’ll print only three lines of each tibble by default. Lastly, we’ll focus on the data in FAS_litters.csv and FAS_pups.csv, so we’ll load those data and clean up the column names using what we learned in data import.

library(tidyverse)
library(janitor)

options(tibble.print_min = 3)

litters_data = read_csv("./data/FAS_litters.csv",
  col_types = "ccddiiii")
litters_data = clean_names(litters_data)

pups_data = read_csv("./data/FAS_pups.csv",
  col_types = "ciiiii")
pups_data = clean_names(pups_data)

select

For a given analysis, you may only need a subset of the columns in a data table; extracting only what you need can helpfully de-clutter, especially when you have large datasets. You can select columns using select.

You can specify the columns you want to keep by naming all of them:

select(litters_data, group, litter_number, gd0_weight, pups_born_alive)
## # A tibble: 49 x 4
##   group litter_number gd0_weight pups_born_alive
##   <chr> <chr>              <dbl>           <int>
## 1 Con7  #85                 19.7               3
## 2 Con7  #1/2/95/2           27                 8
## 3 Con7  #5/5/3/83/3-3       26                 6
## # ... with 46 more rows

You can specify the specify a range of columns to keep:

select(litters_data, group:gd_of_birth)
## # A tibble: 49 x 5
##   group litter_number gd0_weight gd18_weight gd_of_birth
##   <chr> <chr>              <dbl>       <dbl>       <int>
## 1 Con7  #85                 19.7        34.7          20
## 2 Con7  #1/2/95/2           27          42            19
## 3 Con7  #5/5/3/83/3-3       26          41.4          19
## # ... with 46 more rows

You can also specify columns you’d like to remove:

select(litters_data, -pups_survive)
## # A tibble: 49 x 7
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 Con7  #85                 19.7        34.7          20               3
## 2 Con7  #1/2/95/2           27          42            19               8
## 3 Con7  #5/5/3/83/3-3       26          41.4          19               6
## # ... with 46 more rows, and 1 more variable: pups_dead_birth <int>

You can rename variables as part of this process:

select(litters_data, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 x 2
##   GROUP LiTtEr_NuMbEr
##   <chr> <chr>        
## 1 Con7  #85          
## 2 Con7  #1/2/95/2    
## 3 Con7  #5/5/3/83/3-3
## # ... with 46 more rows

If all you want to do is rename something, you can use rename instead of select. This will rename the variables you care about, and keep everything else:

rename(litters_data, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 x 8
##   GROUP LiTtEr_NuMbEr gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 Con7  #85                 19.7        34.7          20               3
## 2 Con7  #1/2/95/2           27          42            19               8
## 3 Con7  #5/5/3/83/3-3       26          41.4          19               6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## #   pups_survive <int>

There are some handy helper functions for select; read about all of them using ?select_helpers. The one I use most frequently is everything(), which is handy for reorganizing columns without discarding anything:

select(litters_data, litter_number, group, everything())
## # A tibble: 49 x 8
##   litter_number group gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr>         <chr>      <dbl>       <dbl>       <int>           <int>
## 1 #85           Con7        19.7        34.7          20               3
## 2 #1/2/95/2     Con7        27          42            19               8
## 3 #5/5/3/83/3-3 Con7        26          41.4          19               6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## #   pups_survive <int>

In larger datasets, I also use starts_with(), ends_with(), and contains() often.

Lastly, like other functions in dplyr, select will export a dataframe even if you only select one column. Mostly this is fine, but sometimes you want the vector stored in the column. To pull a single variable, use pull.

Learning Assessment: In the pups data, select the columns containing litter number, sex, and PD ears.

filter

Some data tables will include rows you don’t need for your current analysis. Although you could remove specific row numbers using base R, you shouldn’t – this might break if the raw data are updated, and the thought process isn’t transparent. Instead, you should filter rows based on logical expressions using the filter function. Like select, the first argument to filter is the dataframe you’re filtering; all subsequent arguments are logical expressions.

You will often filter using comparison operators (>, >=, <, <=, ==, and !=). You may also use %in% to detect if values appear in a set, and is.na() to find missing values. The results of comparisons are logical – the statement is TRUE or FALSE depending on the values you compare – and can be combined with other comparisons using the logical operators & and |, or negated using !.

Some ways you might filter the litters data are:

  • gd_of_birth == 20
  • pups_born_alive >= 2
  • pups_survive != 4
  • !(pups_survive == 4)
  • group %in% c("Con7", "Con8")
  • group == "Con7" & gd_of_birth == 20
  • !is.na(wt_increase)

Filtering can be helpful for limiting a dataset to only those observations needed for an analysis. However, I recommend against the creation of many data subsets (e.g. one for each group). This can clutter up your workspace, and we’ll see good tools for the analysis of subsets before long.

Learning Assessment: In the pups data:

  • Filter the data to include only pups with sex 1
  • Filter the data to include only pups with PD walk less than 11 and sex 2

mutate

Sometimes you need to select columns; sometimes you need to change them or create new ones. You can do this using mutate.

The example below creates a new variable measuring the difference between gd18_weight and gd0_weight and modifies the existing group variable.

mutate(litters_data,
  wt_gain = gd18_weight - gd0_weight,
  group = tolower(group)
)
## # A tibble: 49 x 9
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ... with 46 more rows, and 3 more variables: pups_dead_birth <int>,
## #   pups_survive <int>, wt_gain <dbl>

A few things in this example are worth noting:

  • Your new variables can be functions of old variables
  • New variables appear at the end of the dataset in the order that they are created
  • You can overwrite old variables
  • You can create a new variable and immediately refer to (or change) it

Creating a new variable that does exactly what you need can be a challenge; the more functions you know about, the easier this gets.

Learning Assessment: In the pups data:

  • Create a variable that subtracts 7 from PD pivot
  • Create a variable that is the sum of all the PD variables

arrange

In comparison to the preceding, arranging is pretty straightforward. You can arrange the rows in your data according to the values in one or more columns:

head(arrange(litters_data, group, pups_born_alive), 10)
## # A tibble: 10 x 8
##    group litter_number  gd0_weight gd18_weight gd_of_birth pups_born_alive
##    <chr> <chr>               <dbl>       <dbl>       <int>           <int>
##  1 Con7  #85                  19.7        34.7          20               3
##  2 Con7  #5/4/2/95/2          28.5        44.1          19               5
##  3 Con7  #5/5/3/83/3-3        26          41.4          19               6
##  4 Con7  #4/2/95/3-3          NA          NA            20               6
##  5 Con7  #2/2/95/3-2          NA          NA            20               6
##  6 Con7  #1/2/95/2            27          42            19               8
##  7 Con7  #1/5/3/83/3-3…       NA          NA            20               9
##  8 Con8  #2/2/95/2            NA          NA            19               5
##  9 Con8  #1/6/2/2/95-2        NA          NA            20               7
## 10 Con8  #3/6/2/2/95-3        NA          NA            20               7
## # ... with 2 more variables: pups_dead_birth <int>, pups_survive <int>

%>%

We’ve seen several commands you will use regularly for data manipulation and cleaning. You will rarely use them in isolation. For example, suppose you want to load the data, clean the column names, remove pups_survive, and create wt_gain. There are a couple of options for this kind of multi-step data manipulation:

  • define intermediate datasets (or overwrite data at each stage)
  • nest function calls

The following is an example of the first option:

litters_data_raw = read_csv("./data/FAS_litters.csv",
  col_types = "ccddiiii")
litters_data_clean_names = clean_names(litters_data_raw)
litters_data_selected_cols = select(litters_data_clean_names, -pups_survive)
litters_data_with_vars = mutate(litters_data_selected_cols, 
  wt_gain = gd18_weight - gd0_weight,
  group = tolower(group))
litters_data_with_vars
## # A tibble: 49 x 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## #   wt_gain <dbl>

Below, we try the second option:

litters_data_clean = 
  mutate(
    select(
      clean_names(
        read_csv("./data/FAS_litters.csv", col_types = "ccddiiii")
        ), 
    -pups_survive
    ),
  wt_gain = gd18_weight - gd0_weight,
  group = tolower(group)
  )
litters_data_clean
## # A tibble: 49 x 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## #   wt_gain <dbl>

These are both confusing and bad: the first gets confusing and clutters our workspace, and the second has to be read inside out.

Piping solves this problem. It allows you to turn the nested approach into a sequential chain by passing the result of one function call as an argument to the next function call:

litters_data = 
  read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
  clean_names() %>%
  select(-pups_survive) %>%
  mutate(
    wt_gain = gd18_weight - gd0_weight,
    group = tolower(group))

litters_data
## # A tibble: 49 x 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## #   wt_gain <dbl>

All three approaches result in the same dataset, but the piped commands are by far the most straightforward. The easiest way to read %>% is “then”; the keyboard shortcuts are Ctrl + Shift + M (Windows) and Cmd + Shift + M (Mac).

The functions in dplyr (and much of the tidyverse) are designed to work smoothly with the pipe operator. By default, the pipe will take the result of one function call and use that as the first argument of the next function call; by design, functions in dplyr will take a tibble as an input and return a tibble as a result. As a consequence, functions in dplyr are easy to connect in a data cleaning chain. You can make this more explicit by using . as a placeholder for the result of the preceding call:

litters_data = 
  read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
  clean_names(dat = .) %>%
  select(.data = ., -pups_survive) %>%
  mutate(.data = .,
    wt_gain = gd18_weight - gd0_weight,
    group = tolower(group))

Usually you’ll just elide the first argument and be happy with life, but there are some cases where the placeholder will help. For example, to regress wt_gain on pups_born_alive, you might use:

litters_data %>%
  lm(wt_gain ~ pups_born_alive, data = .) %>%
  broom::tidy()
##              term   estimate std.error statistic      p.value
## 1     (Intercept) 13.0833132 1.2707022 10.296128 3.390028e-11
## 2 pups_born_alive  0.6050786 0.1732314  3.492892 1.553186e-03

There are limitations to the pipe. You shouldn’t have sequences that are too long; there isn’t a great way to deal with multiple inputs and outputs; and (since it’s not base R) not everyone will know what %>% means or does. That said, compared to days when R users only had the first two options, life is much better!

Learning Assessment: Write a chain of commands that:

  • loads the pups data
  • cleans the variable names
  • filters the data to include only pups with sex 1
  • removes the PD ears variable
  • creates a variable that indicates whether PD pivot is 7 or more days

Other materials

There’s lots of stuff out there on how to clean your data using dplyr.

The code that I produced working examples in lecture is here.