dplyr
Once you’ve imported data, you’re going to need to do some cleaning up.
This is the secode module in the Data Wrangling I topic; the relevant slack channel is here.
Start by figuring out how this example fits in with what you’ve done and plan to do. I’d suggest adding a new R Markdown file to the project we started last time; that way you can load example data sets using the code we wrote last time.
Once again we’re going to be using the tidyverse
, so we’ll load that at the outset; I’ll also load the janitor
package because I like clean names. We’re going to be looking at a lot of output, so I’ll print only three lines of each tibble by default. Lastly, we’ll focus on the data in FAS_litters.csv
and FAS_pups.csv
, so we’ll load those data and clean up the column names using what we learned in data import.
library(tidyverse)
library(janitor)
options(tibble.print_min = 3)
litters_data = read_csv("./data/FAS_litters.csv",
col_types = "ccddiiii")
litters_data = clean_names(litters_data)
pups_data = read_csv("./data/FAS_pups.csv",
col_types = "ciiiii")
pups_data = clean_names(pups_data)
select
For a given analysis, you may only need a subset of the columns in a data table; extracting only what you need can helpfully de-clutter, especially when you have large datasets. You can select columns using select
.
You can specify the columns you want to keep by naming all of them:
select(litters_data, group, litter_number, gd0_weight, pups_born_alive)
## # A tibble: 49 x 4
## group litter_number gd0_weight pups_born_alive
## <chr> <chr> <dbl> <int>
## 1 Con7 #85 19.7 3
## 2 Con7 #1/2/95/2 27 8
## 3 Con7 #5/5/3/83/3-3 26 6
## # ... with 46 more rows
You can specify the specify a range of columns to keep:
select(litters_data, group:gd_of_birth)
## # A tibble: 49 x 5
## group litter_number gd0_weight gd18_weight gd_of_birth
## <chr> <chr> <dbl> <dbl> <int>
## 1 Con7 #85 19.7 34.7 20
## 2 Con7 #1/2/95/2 27 42 19
## 3 Con7 #5/5/3/83/3-3 26 41.4 19
## # ... with 46 more rows
You can also specify columns you’d like to remove:
select(litters_data, -pups_survive)
## # A tibble: 49 x 7
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Con7 #85 19.7 34.7 20 3
## 2 Con7 #1/2/95/2 27 42 19 8
## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6
## # ... with 46 more rows, and 1 more variable: pups_dead_birth <int>
You can rename variables as part of this process:
select(litters_data, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 x 2
## GROUP LiTtEr_NuMbEr
## <chr> <chr>
## 1 Con7 #85
## 2 Con7 #1/2/95/2
## 3 Con7 #5/5/3/83/3-3
## # ... with 46 more rows
If all you want to do is rename something, you can use rename
instead of select
. This will rename the variables you care about, and keep everything else:
rename(litters_data, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 x 8
## GROUP LiTtEr_NuMbEr gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Con7 #85 19.7 34.7 20 3
## 2 Con7 #1/2/95/2 27 42 19 8
## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## # pups_survive <int>
There are some handy helper functions for select
; read about all of them using ?select_helpers
. The one I use most frequently is everything()
, which is handy for reorganizing columns without discarding anything:
select(litters_data, litter_number, group, everything())
## # A tibble: 49 x 8
## litter_number group gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 #85 Con7 19.7 34.7 20 3
## 2 #1/2/95/2 Con7 27 42 19 8
## 3 #5/5/3/83/3-3 Con7 26 41.4 19 6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## # pups_survive <int>
In larger datasets, I also use starts_with()
, ends_with()
, and contains()
often.
Lastly, like other functions in dplyr
, select
will export a dataframe even if you only select one column. Mostly this is fine, but sometimes you want the vector stored in the column. To pull a single variable, use pull
.
Learning Assessment: In the pups data, select the columns containing litter number, sex, and PD ears.
filter
Some data tables will include rows you don’t need for your current analysis. Although you could remove specific row numbers using base R, you shouldn’t – this might break if the raw data are updated, and the thought process isn’t transparent. Instead, you should filter rows based on logical expressions using the filter
function. Like select
, the first argument to filter
is the dataframe you’re filtering; all subsequent arguments are logical expressions.
You will often filter using comparison operators (>
, >=
, <
, <=
, ==
, and !=
). You may also use %in%
to detect if values appear in a set, and is.na()
to find missing values. The results of comparisons are logical – the statement is TRUE
or FALSE
depending on the values you compare – and can be combined with other comparisons using the logical operators &
and |
, or negated using !
.
Some ways you might filter the litters data are:
gd_of_birth == 20
pups_born_alive >= 2
pups_survive != 4
!(pups_survive == 4)
group %in% c("Con7", "Con8")
group == "Con7" & gd_of_birth == 20
!is.na(wt_increase)
Filtering can be helpful for limiting a dataset to only those observations needed for an analysis. However, I recommend against the creation of many data subsets (e.g. one for each group). This can clutter up your workspace, and we’ll see good tools for the analysis of subsets before long.
Learning Assessment: In the pups data:
mutate
Sometimes you need to select columns; sometimes you need to change them or create new ones. You can do this using mutate
.
The example below creates a new variable measuring the difference between gd18_weight
and gd0_weight
and modifies the existing group
variable.
mutate(litters_data,
wt_gain = gd18_weight - gd0_weight,
group = tolower(group)
)
## # A tibble: 49 x 9
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ... with 46 more rows, and 3 more variables: pups_dead_birth <int>,
## # pups_survive <int>, wt_gain <dbl>
A few things in this example are worth noting:
Creating a new variable that does exactly what you need can be a challenge; the more functions you know about, the easier this gets.
Learning Assessment: In the pups data:
arrange
In comparison to the preceding, arranging is pretty straightforward. You can arrange the rows in your data according to the values in one or more columns:
head(arrange(litters_data, group, pups_born_alive), 10)
## # A tibble: 10 x 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 Con7 #85 19.7 34.7 20 3
## 2 Con7 #5/4/2/95/2 28.5 44.1 19 5
## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6
## 4 Con7 #4/2/95/3-3 NA NA 20 6
## 5 Con7 #2/2/95/3-2 NA NA 20 6
## 6 Con7 #1/2/95/2 27 42 19 8
## 7 Con7 #1/5/3/83/3-3… NA NA 20 9
## 8 Con8 #2/2/95/2 NA NA 19 5
## 9 Con8 #1/6/2/2/95-2 NA NA 20 7
## 10 Con8 #3/6/2/2/95-3 NA NA 20 7
## # ... with 2 more variables: pups_dead_birth <int>, pups_survive <int>
%>%
We’ve seen several commands you will use regularly for data manipulation and cleaning. You will rarely use them in isolation. For example, suppose you want to load the data, clean the column names, remove pups_survive
, and create wt_gain
. There are a couple of options for this kind of multi-step data manipulation:
The following is an example of the first option:
litters_data_raw = read_csv("./data/FAS_litters.csv",
col_types = "ccddiiii")
litters_data_clean_names = clean_names(litters_data_raw)
litters_data_selected_cols = select(litters_data_clean_names, -pups_survive)
litters_data_with_vars = mutate(litters_data_selected_cols,
wt_gain = gd18_weight - gd0_weight,
group = tolower(group))
litters_data_with_vars
## # A tibble: 49 x 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## # wt_gain <dbl>
Below, we try the second option:
litters_data_clean =
mutate(
select(
clean_names(
read_csv("./data/FAS_litters.csv", col_types = "ccddiiii")
),
-pups_survive
),
wt_gain = gd18_weight - gd0_weight,
group = tolower(group)
)
litters_data_clean
## # A tibble: 49 x 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## # wt_gain <dbl>
These are both confusing and bad: the first gets confusing and clutters our workspace, and the second has to be read inside out.
Piping solves this problem. It allows you to turn the nested approach into a sequential chain by passing the result of one function call as an argument to the next function call:
litters_data =
read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
clean_names() %>%
select(-pups_survive) %>%
mutate(
wt_gain = gd18_weight - gd0_weight,
group = tolower(group))
litters_data
## # A tibble: 49 x 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ... with 46 more rows, and 2 more variables: pups_dead_birth <int>,
## # wt_gain <dbl>
All three approaches result in the same dataset, but the piped commands are by far the most straightforward. The easiest way to read %>%
is “then”; the keyboard shortcuts are Ctrl + Shift + M (Windows) and Cmd + Shift + M (Mac).
The functions in dplyr
(and much of the tidyverse
) are designed to work smoothly with the pipe operator. By default, the pipe will take the result of one function call and use that as the first argument of the next function call; by design, functions in dplyr
will take a tibble as an input and return a tibble as a result. As a consequence, functions in dplyr
are easy to connect in a data cleaning chain. You can make this more explicit by using .
as a placeholder for the result of the preceding call:
litters_data =
read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
clean_names(dat = .) %>%
select(.data = ., -pups_survive) %>%
mutate(.data = .,
wt_gain = gd18_weight - gd0_weight,
group = tolower(group))
Usually you’ll just elide the first argument and be happy with life, but there are some cases where the placeholder will help. For example, to regress wt_gain
on pups_born_alive
, you might use:
litters_data %>%
lm(wt_gain ~ pups_born_alive, data = .) %>%
broom::tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 13.0833132 1.2707022 10.296128 3.390028e-11
## 2 pups_born_alive 0.6050786 0.1732314 3.492892 1.553186e-03
There are limitations to the pipe. You shouldn’t have sequences that are too long; there isn’t a great way to deal with multiple inputs and outputs; and (since it’s not base R) not everyone will know what %>%
means or does. That said, compared to days when R users only had the first two options, life is much better!
Learning Assessment: Write a chain of commands that:
There’s lots of stuff out there on how to clean your data using dplyr
.
dplyr
.The code that I produced working examples in lecture is here.