The overarching goal of data wrangling is to have a tidy, easy-to-use dataset.
This is the third module in the Data Wrangling I topic; the relevant slack channel is here.
I’ll keep using the same R Project as in data import and data manipulation, but create a new .Rmd for tidying. I’m also going to load some relevant packages, and limit the number of lines printed in a tibble. Lastly, there’s another dataset we’ll use in examples today; I’m going to download it here and add it to my data
subdirectory.
library(tidyverse)
library(haven)
library(readxl)
library(janitor)
options(tibble.print_min = 5)
gather
In data import, we used the haven
package to load the PULSE biomarkers dataset from a .sas7bdat. Let’s reload those data and take a closer look:
pulse_data = read_sas("./data/public_pulse_data.sas7bdat") %>%
clean_names()
pulse_data
## # A tibble: 1,087 x 7
## id age sex bdi_score_bl bdi_score_01m bdi_score_06m bdi_score_12m
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 10003 48.0 male 7 1 2 0
## 2 10015 72.5 male 6 NA NA NA
## 3 10022 58.5 male 14 3 8 NA
## 4 10026 72.7 male 20 6 18 16
## 5 10035 60.4 male 4 0 1 2
## # ... with 1,082 more rows
With our new understanding of tidy data, we quickly recognize a problem: the BDI score is spread across four columns, which correspond to four observation times. We can fix this problem using gather
:
pulse_tidy_data = gather(pulse_data, key = visit, value = bdi, bdi_score_bl:bdi_score_12m)
## Warning: attributes are not identical across measure variables;
## they will be dropped
pulse_tidy_data
## # A tibble: 4,348 x 5
## id age sex visit bdi
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 10003 48.0 male bdi_score_bl 7
## 2 10015 72.5 male bdi_score_bl 6
## 3 10022 58.5 male bdi_score_bl 14
## 4 10026 72.7 male bdi_score_bl 20
## 5 10035 60.4 male bdi_score_bl 4
## # ... with 4,343 more rows
This looks much better! However, you should notice that this produced a warning. The warning seemed pretty specific but I didn’t understand it, so I googled “tidyr gather attributes are not identical across measure variables; they will be dropped” and found this. After some reading (and maybe a bit more googling), it started to sound like the columns containing BDI scores had different attributes, and indeed they do:
str(pulse_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1087 obs. of 7 variables:
## $ id : num 10003 10015 10022 10026 10035 ...
## $ age : num 48 72.5 58.5 72.7 60.4 ...
## ..- attr(*, "label")= chr "Age at baseline"
## ..- attr(*, "format.sas")= chr "AGEA"
## $ sex : chr "male" "male" "male" "male" ...
## $ bdi_score_bl : num 7 6 14 20 4 2 4 5 0 10 ...
## ..- attr(*, "label")= chr "BDI, BL IMPUTED SCORE"
## ..- attr(*, "format.sas")= chr "F"
## $ bdi_score_01m: num 1 NA 3 6 0 10 0 NA 3 2 ...
## ..- attr(*, "label")= chr "BDI, 1 MONTH IMPUTED SCORE"
## ..- attr(*, "format.sas")= chr "F"
## $ bdi_score_06m: num 2 NA 8 18 1 12 NA 0 4 11 ...
## ..- attr(*, "label")= chr "BDI, 6 MONTH IMPUTED SCORE"
## ..- attr(*, "format.sas")= chr "F"
## $ bdi_score_12m: num 0 NA NA 16 2 8 NA 2 0 6 ...
## ..- attr(*, "label")= chr "BDI, 12 MONTH IMPUTED SCORE"
## ..- attr(*, "format.sas")= chr "F"
Each BDI score column has a specific label in the SAS dataset; these don’t match, so gather
dropped them when creating the bdi
column. Not a problem here, but dropping attributes could be an issue if you wanted to preserve dates, factors, or some other feature.
separate
The other issue with the PULSE data, now, is visit
. The original column names were informative but are not necessarily the format we would need for analysis. I’m going to split visit
into two columns using separate
:
separate(pulse_tidy_data, visit, into = c("remove", "visit"), sep = "_")
## Warning: Expected 2 pieces. Additional pieces discarded in 4348 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## # A tibble: 4,348 x 6
## id age sex remove visit bdi
## <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 10003 48.0 male bdi score 7
## 2 10015 72.5 male bdi score 6
## 3 10022 58.5 male bdi score 14
## 4 10026 72.7 male bdi score 20
## 5 10035 60.4 male bdi score 4
## # ... with 4,343 more rows
This isn’t really the optimal way to isolate “visit” because I now need an additional select
step to remove the unnecessary character variable remove
. However, it does illustrate the separate
function, and we haven’t yet learned about str_replace
and other functions from the stringr
package that are better suited to this problem, so …
In the preceding I’ve saved intermediate datasets to make each step clear. While this can be a helpful crutch as you’re trying out code, it is generally bad practice. There are also some additional transformations needed to wrap up the data wrangling process, like changing bl
to 00m
for consistency across visits and converting visit
to a factor variable. (It’s possible that you would want visit
to be a numeric variable instead, which could be done with a different call to mutate
.) Lastly, it’s nice to organize the data into a reasonable order.
Altogether, then, the code below will import, tidy, and transform the PULSE dataset into a usable format:
pulse_data = read_sas("./data/public_pulse_data.sas7bdat") %>%
clean_names() %>%
gather(key = visit, value = bdi, bdi_score_bl:bdi_score_12m) %>%
separate(visit, into = c("remove", "visit"), sep = "_") %>%
select(id, visit, everything(), -remove) %>%
mutate(visit = replace(visit, visit == "bl", "00m"),
visit = factor(visit, levels = paste0(c("00", "01", "06", "12"), "m"))) %>%
arrange(id, visit)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Expected 2 pieces. Additional pieces discarded in 4348 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
print(pulse_data, n = 12)
## # A tibble: 4,348 x 5
## id visit age sex bdi
## <dbl> <fct> <dbl> <chr> <dbl>
## 1 10003 <NA> 48.0 male 7
## 2 10003 <NA> 48.0 male 1
## 3 10003 <NA> 48.0 male 2
## 4 10003 <NA> 48.0 male 0
## 5 10015 <NA> 72.5 male 6
## 6 10015 <NA> 72.5 male NA
## 7 10015 <NA> 72.5 male NA
## 8 10015 <NA> 72.5 male NA
## 9 10022 <NA> 58.5 male 14
## 10 10022 <NA> 58.5 male 3
## 11 10022 <NA> 58.5 male 8
## 12 10022 <NA> 58.5 male NA
## # ... with 4,336 more rows
Before moving on, let’s revisit the group
variable in the litters dataset:
read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
clean_names() %>%
pull(group) %>%
table()
## .
## Con7 Con8 Low7 Low8 Mod7 Mod8
## 7 8 8 7 12 7
These data are also untidy: group
encodes both dose and day of treatment! Time to fix that …
litters_data =
read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
clean_names() %>%
separate(group, into = c("dose", "day_of_tx"), sep = 3) %>%
mutate(dose = tolower(dose),
wt_gain = gd18_weight - gd0_weight) %>%
arrange(litter_number)
litters_data
## # A tibble: 49 x 10
## dose day_of_tx litter_number gd0_weight gd18_weight gd_of_birth
## <chr> <chr> <chr> <dbl> <dbl> <int>
## 1 con 7 #1/2/95/2 27 42 19
## 2 con 7 #1/5/3/83/3-3/2 NA NA 20
## 3 con 8 #1/6/2/2/95-2 NA NA 20
## 4 mod 7 #1/82/3-2 NA NA 19
## 5 low 8 #100 20 39.2 20
## # ... with 44 more rows, and 4 more variables: pups_born_alive <int>,
## # pups_dead_birth <int>, pups_survive <int>, wt_gain <dbl>
Now we’re in pretty good shape :-).
Learning Assessment: In the litters data, the variables gd0_weight
and gd18_weight
give the weight of the mother mouse on gestational days 0 and 18. Write a data cleaning chain that retains only litter_number
and these columns; produces new variables gd
and weight
; and makes gd
a numeric variable taking values 0
and 18
(for the last part, you might want to use recode
…). Is this version “tidy”?
spread
We’ve been exclusively interested in tidying data, but we’ve admitted that sometimes untidy is better for human consumption. For that reason we’re going to take a short digression into untidying your tidy data.
The code below creates a tidy dataset that could result from an analysis. This is the correct format for additional analysis or visualization, but doesn’t facilitate quick comparisons for human readers.
analysis_result = tibble(
group = c("treatment", "treatment", "placebo", "placebo"),
time = c("pre", "post", "pre", "post"),
mean = c(4, 8, 3.5, 4)
)
analysis_result
## # A tibble: 4 x 3
## group time mean
## <chr> <chr> <dbl>
## 1 treatment pre 4
## 2 treatment post 8
## 3 placebo pre 3.5
## 4 placebo post 4
An alternative presentation of the same data might have groups in rows, times in columns, and mean values in table cells. This is decidedly non-tidy; to get there from here we’ll need to use spread
, which is the inverse of gather
:
spread(analysis_result, key = time, value = mean)
## # A tibble: 2 x 3
## group post pre
## <chr> <dbl> <dbl>
## 1 placebo 4 3.5
## 2 treatment 8 4
We’re pretty much there now, although you could use select
to reorder columns, and (depending on your goal) use knitr::kable()
to produce a nicer table for reading.
We’ve looked at single-table non-tidy data, but non-tidiness often stems from relevant data spread across multiple tables. In the simplest case, these tables are basically the same and can be stacked to produce a tidy dataset. That’s the setting in LotR_words.xlsx
, where the word counts for different races and sexes in each movie in the trilogy are spread across distinct data rectangles (these data are based on this example).
To produce the desired tidy dataset, we first need to read each table and do some cleaning.
fellowship_ring = read_excel("./data/LotR_Words.xlsx", range = "B3:D6") %>%
clean_names() %>%
gather(key = sex, value = words, female:male) %>%
mutate(race = tolower(race),
movie = "Fellowship")
two_towers = read_excel("./data/LotR_Words.xlsx", range = "F3:H6") %>%
clean_names() %>%
gather(key = sex, value = words, female:male) %>%
mutate(race = tolower(race),
movie = "Two Towers")
return_king = read_excel("./data/LotR_Words.xlsx", range = "J3:L6") %>%
clean_names() %>%
gather(key = sex, value = words, female:male) %>%
mutate(race = tolower(race),
movie = "Return")
This illustrates that there’s often some work needed to ensure the separate tables can be reasonably stacked. As an aside, the three code snippets above are all basically the same except for the range and the movie name – later we’ll see a better way to handle cases like this by writing our own functions, but this works for now.
Once each table is ready to go, we can stack them up using bind_rows
:
lotr_tidy = bind_rows(fellowship_ring, two_towers, return_king) %>%
select(movie, everything())
lotr_tidy
## # A tibble: 18 x 4
## movie race sex words
## <chr> <chr> <chr> <dbl>
## 1 Fellowship elf female 1229
## 2 Fellowship hobbit female 14
## 3 Fellowship man female 0
## 4 Fellowship elf male 971
## 5 Fellowship hobbit male 3644
## 6 Fellowship man male 1995
## 7 Two Towers elf female 331
## 8 Two Towers hobbit female 0
## 9 Two Towers man female 401
## 10 Two Towers elf male 513
## 11 Two Towers hobbit male 2463
## 12 Two Towers man male 3589
## 13 Return elf female 183
## 14 Return hobbit female 2
## 15 Return man female 268
## 16 Return elf male 510
## 17 Return hobbit male 2673
## 18 Return man male 2459
Having the data in this form will make it easier to make comparisons across movies, aggregate within races across the trilogy, and perform other analyses.
Data can be spread across multiple related tables, in which case it is necessary to combine or join them prior to analysis. We’ll focus on the problem of combining two tables only; combining three or more is done step-by-step using the same ideas.
There are four major ways join dataframes x
and y
:
x
and y
x
y
x
or y
Left joins are the most common, because they add data from a smaller table y
into a larger table x
without removing anything from x
.
As an example, consider the data tables in FAS_pups.csv
and FAS_litters.csv
, which are related through the Litter Number
variable. The former contains data unique to each pup, and the latter contains data unique to each litter. We can combine these using a left join of litter data into pup data; doing so retains data on each pup and adds data in new columns.
pup_data = read_csv("./data/FAS_pups.csv", col_types = "ciiiii") %>%
clean_names() %>%
mutate(sex = recode(sex, `1` = "male", `2` = "female"))
litter_data = read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
clean_names %>%
select(-pups_survive) %>%
mutate(
wt_gain = gd18_weight - gd0_weight,
group = tolower(group))
FAS_data = left_join(pup_data, litter_data, by = "litter_number")
FAS_data
## # A tibble: 313 x 13
## litter_number sex pd_ears pd_eyes pd_pivot pd_walk group gd0_weight
## <chr> <chr> <int> <int> <int> <int> <chr> <dbl>
## 1 #85 male 4 13 7 11 con7 19.7
## 2 #85 male 4 13 7 12 con7 19.7
## 3 #1/2/95/2 male 5 13 7 9 con7 27
## 4 #1/2/95/2 male 5 13 8 10 con7 27
## 5 #5/5/3/83/3-3 male 5 13 8 10 con7 26
## # ... with 308 more rows, and 5 more variables: gd18_weight <dbl>,
## # gd_of_birth <int>, pups_born_alive <int>, pups_dead_birth <int>,
## # wt_gain <dbl>
We made the key explicit in the join. By default, the *_join
functions in dplyr
will try to determine the key(s) based on variable names in the datasets you want to join. This is often but not always sufficient, and an extra step to make the key clear will help you and others reading your code.
Note that joining is not particularly amenable to the %>%
operator because it is fundamentally non-linear: two separate datasets are coming together, rather than a single dataset being processed in a step-by-step fashion.
As a final point, the *_join
functions are very much related to SQL syntax, but emphasize operations common to data analysis.