Tidy data and relational datasets

The overarching goal of data wrangling is to have a tidy, easy-to-use dataset.

This is the third module in the Data Wrangling I topic; the relevant slack channel is here.

Some slides

Tidy Data from Jeff Goldsmith.

Example

I’ll keep using the same R Project as in data import and data manipulation, but create a new .Rmd for tidying. I’m also going to load some relevant packages, and limit the number of lines printed in a tibble. Lastly, there’s another dataset we’ll use in examples today; I’m going to download it here and add it to my data subdirectory.

library(tidyverse)
library(haven)
library(readxl)
library(janitor)

options(tibble.print_min = 5)

`gather`

In data import, we used the haven package to load the PULSE biomarkers dataset from a .sas7bdat. Let’s reload those data and take a closer look:

pulse_data = read_sas("./data/public_pulse_data.sas7bdat") %>%
  clean_names()
pulse_data
## # A tibble: 1,087 x 7
##      id   age sex   bdi_score_bl bdi_score_01m bdi_score_06m bdi_score_12m
##   <dbl> <dbl> <chr>        <dbl>         <dbl>         <dbl>         <dbl>
## 1 10003  48.0 male             7             1             2             0
## 2 10015  72.5 male             6            NA            NA            NA
## 3 10022  58.5 male            14             3             8            NA
## 4 10026  72.7 male            20             6            18            16
## 5 10035  60.4 male             4             0             1             2
## # ... with 1,082 more rows

With our new understanding of tidy data, we quickly recognize a problem: the BDI score is spread across four columns, which correspond to four observation times. We can fix this problem using gather:

pulse_tidy_data = gather(pulse_data, key = visit, value = bdi, bdi_score_bl:bdi_score_12m)
## Warning: attributes are not identical across measure variables;
## they will be dropped
pulse_tidy_data
## # A tibble: 4,348 x 5
##      id   age sex   visit          bdi
##   <dbl> <dbl> <chr> <chr>        <dbl>
## 1 10003  48.0 male  bdi_score_bl     7
## 2 10015  72.5 male  bdi_score_bl     6
## 3 10022  58.5 male  bdi_score_bl    14
## 4 10026  72.7 male  bdi_score_bl    20
## 5 10035  60.4 male  bdi_score_bl     4
## # ... with 4,343 more rows

This looks much better! However, you should notice that this produced a warning. The warning seemed pretty specific but I didn’t understand it, so I googled “tidyr gather attributes are not identical across measure variables; they will be dropped” and found this. After some reading (and maybe a bit more googling), it started to sound like the columns containing BDI scores had different attributes, and indeed they do:

str(pulse_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1087 obs. of  7 variables:
##  $ id           : num  10003 10015 10022 10026 10035 ...
##  $ age          : num  48 72.5 58.5 72.7 60.4 ...
##   ..- attr(*, "label")= chr "Age at baseline"
##   ..- attr(*, "format.sas")= chr "AGEA"
##  $ sex          : chr  "male" "male" "male" "male" ...
##  $ bdi_score_bl : num  7 6 14 20 4 2 4 5 0 10 ...
##   ..- attr(*, "label")= chr "BDI, BL IMPUTED SCORE"
##   ..- attr(*, "format.sas")= chr "F"
##  $ bdi_score_01m: num  1 NA 3 6 0 10 0 NA 3 2 ...
##   ..- attr(*, "label")= chr "BDI, 1 MONTH IMPUTED SCORE"
##   ..- attr(*, "format.sas")= chr "F"
##  $ bdi_score_06m: num  2 NA 8 18 1 12 NA 0 4 11 ...
##   ..- attr(*, "label")= chr "BDI, 6 MONTH IMPUTED SCORE"
##   ..- attr(*, "format.sas")= chr "F"
##  $ bdi_score_12m: num  0 NA NA 16 2 8 NA 2 0 6 ...
##   ..- attr(*, "label")= chr "BDI, 12 MONTH IMPUTED SCORE"
##   ..- attr(*, "format.sas")= chr "F"

Each BDI score column has a specific label in the SAS dataset; these don’t match, so gather dropped them when creating the bdi column. Not a problem here, but dropping attributes could be an issue if you wanted to preserve dates, factors, or some other feature.

`separate`

The other issue with the PULSE data, now, is visit. The original column names were informative but are not necessarily the format we would need for analysis. I’m going to split visit into two columns using separate:

separate(pulse_tidy_data, visit, into = c("remove", "visit"), sep = "_")
## Warning: Expected 2 pieces. Additional pieces discarded in 4348 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## # A tibble: 4,348 x 6
##      id   age sex   remove visit   bdi
##   <dbl> <dbl> <chr> <chr>  <chr> <dbl>
## 1 10003  48.0 male  bdi    score     7
## 2 10015  72.5 male  bdi    score     6
## 3 10022  58.5 male  bdi    score    14
## 4 10026  72.7 male  bdi    score    20
## 5 10035  60.4 male  bdi    score     4
## # ... with 4,343 more rows

This isn’t really the optimal way to isolate “visit” because I now need an additional select step to remove the unnecessary character variable remove. However, it does illustrate the separate function, and we haven’t yet learned about str_replace and other functions from the stringr package that are better suited to this problem, so …

In the preceding I’ve saved intermediate datasets to make each step clear. While this can be a helpful crutch as you’re trying out code, it is generally bad practice. There are also some additional transformations needed to wrap up the data wrangling process, like changing bl to 00m for consistency across visits and converting visit to a factor variable. (It’s possible that you would want visit to be a numeric variable instead, which could be done with a different call to mutate.) Lastly, it’s nice to organize the data into a reasonable order.

Altogether, then, the code below will import, tidy, and transform the PULSE dataset into a usable format:

pulse_data = read_sas("./data/public_pulse_data.sas7bdat") %>%
  clean_names() %>%
  gather(key = visit, value = bdi, bdi_score_bl:bdi_score_12m) %>%
  separate(visit, into = c("remove", "visit"), sep = "_") %>%
  select(id, visit, everything(), -remove) %>%
  mutate(visit = replace(visit, visit == "bl", "00m"),
         visit = factor(visit, levels = paste0(c("00", "01", "06", "12"), "m"))) %>%
  arrange(id, visit)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Expected 2 pieces. Additional pieces discarded in 4348 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

print(pulse_data, n = 12)
## # A tibble: 4,348 x 5
##       id visit   age sex     bdi
##    <dbl> <fct> <dbl> <chr> <dbl>
##  1 10003 <NA>   48.0 male      7
##  2 10003 <NA>   48.0 male      1
##  3 10003 <NA>   48.0 male      2
##  4 10003 <NA>   48.0 male      0
##  5 10015 <NA>   72.5 male      6
##  6 10015 <NA>   72.5 male     NA
##  7 10015 <NA>   72.5 male     NA
##  8 10015 <NA>   72.5 male     NA
##  9 10022 <NA>   58.5 male     14
## 10 10022 <NA>   58.5 male      3
## 11 10022 <NA>   58.5 male      8
## 12 10022 <NA>   58.5 male     NA
## # ... with 4,336 more rows

Before moving on, let’s revisit the group variable in the litters dataset:

read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>% 
  clean_names() %>%
  pull(group) %>%
  table()
## .
## Con7 Con8 Low7 Low8 Mod7 Mod8 
##    7    8    8    7   12    7

These data are also untidy: group encodes both dose and day of treatment! Time to fix that …

litters_data = 
  read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>% 
  clean_names() %>%
  separate(group, into = c("dose", "day_of_tx"), sep = 3) %>%
  mutate(dose = tolower(dose),
         wt_gain = gd18_weight - gd0_weight) %>%
  arrange(litter_number)

litters_data
## # A tibble: 49 x 10
##   dose  day_of_tx litter_number   gd0_weight gd18_weight gd_of_birth
##   <chr> <chr>     <chr>                <dbl>       <dbl>       <int>
## 1 con   7         #1/2/95/2               27        42            19
## 2 con   7         #1/5/3/83/3-3/2         NA        NA            20
## 3 con   8         #1/6/2/2/95-2           NA        NA            20
## 4 mod   7         #1/82/3-2               NA        NA            19
## 5 low   8         #100                    20        39.2          20
## # ... with 44 more rows, and 4 more variables: pups_born_alive <int>,
## #   pups_dead_birth <int>, pups_survive <int>, wt_gain <dbl>

Now we’re in pretty good shape :-).

Learning Assessment: In the litters data, the variables gd0_weight and gd18_weight give the weight of the mother mouse on gestational days 0 and 18. Write a data cleaning chain that retains only litter_number and these columns; produces new variables gd and weight; and makes gd a numeric variable taking values 0 and 18 (for the last part, you might want to use recode …). Is this version “tidy”?

`spread`

We’ve been exclusively interested in tidying data, but we’ve admitted that sometimes untidy is better for human consumption. For that reason we’re going to take a short digression into untidying your tidy data.

The code below creates a tidy dataset that could result from an analysis. This is the correct format for additional analysis or visualization, but doesn’t facilitate quick comparisons for human readers.

analysis_result = tibble(
  group = c("treatment", "treatment", "placebo", "placebo"),
  time = c("pre", "post", "pre", "post"),
  mean = c(4, 8, 3.5, 4)
)

analysis_result
## # A tibble: 4 x 3
##   group     time   mean
##   <chr>     <chr> <dbl>
## 1 treatment pre     4  
## 2 treatment post    8  
## 3 placebo   pre     3.5
## 4 placebo   post    4

An alternative presentation of the same data might have groups in rows, times in columns, and mean values in table cells. This is decidedly non-tidy; to get there from here we’ll need to use spread, which is the inverse of gather:

spread(analysis_result, key = time, value = mean)
## # A tibble: 2 x 3
##   group      post   pre
##   <chr>     <dbl> <dbl>
## 1 placebo       4   3.5
## 2 treatment     8   4

We’re pretty much there now, although you could use select to reorder columns, and (depending on your goal) use knitr::kable() to produce a nicer table for reading.

Binding rows

We’ve looked at single-table non-tidy data, but non-tidiness often stems from relevant data spread across multiple tables. In the simplest case, these tables are basically the same and can be stacked to produce a tidy dataset. That’s the setting in LotR_words.xlsx, where the word counts for different races and sexes in each movie in the trilogy are spread across distinct data rectangles (these data are based on this example).

To produce the desired tidy dataset, we first need to read each table and do some cleaning.

fellowship_ring = read_excel("./data/LotR_Words.xlsx", range = "B3:D6") %>%
  clean_names() %>%
  gather(key = sex, value = words, female:male) %>%
  mutate(race = tolower(race),
         movie = "Fellowship")

two_towers = read_excel("./data/LotR_Words.xlsx", range = "F3:H6") %>%
  clean_names() %>%
  gather(key = sex, value = words, female:male) %>%
  mutate(race = tolower(race),
         movie = "Two Towers")

return_king = read_excel("./data/LotR_Words.xlsx", range = "J3:L6") %>%
  clean_names() %>%
  gather(key = sex, value = words, female:male) %>%
  mutate(race = tolower(race),
         movie = "Return")

This illustrates that there’s often some work needed to ensure the separate tables can be reasonably stacked. As an aside, the three code snippets above are all basically the same except for the range and the movie name – later we’ll see a better way to handle cases like this by writing our own functions, but this works for now.

Once each table is ready to go, we can stack them up using bind_rows:

lotr_tidy = bind_rows(fellowship_ring, two_towers, return_king) %>%
  select(movie, everything()) 

lotr_tidy
## # A tibble: 18 x 4
##    movie      race   sex    words
##    <chr>      <chr>  <chr>  <dbl>
##  1 Fellowship elf    female  1229
##  2 Fellowship hobbit female    14
##  3 Fellowship man    female     0
##  4 Fellowship elf    male     971
##  5 Fellowship hobbit male    3644
##  6 Fellowship man    male    1995
##  7 Two Towers elf    female   331
##  8 Two Towers hobbit female     0
##  9 Two Towers man    female   401
## 10 Two Towers elf    male     513
## 11 Two Towers hobbit male    2463
## 12 Two Towers man    male    3589
## 13 Return     elf    female   183
## 14 Return     hobbit female     2
## 15 Return     man    female   268
## 16 Return     elf    male     510
## 17 Return     hobbit male    2673
## 18 Return     man    male    2459

Having the data in this form will make it easier to make comparisons across movies, aggregate within races across the trilogy, and perform other analyses.

Joining datasets

Data can be spread across multiple related tables, in which case it is necessary to combine or join them prior to analysis. We’ll focus on the problem of combining two tables only; combining three or more is done step-by-step using the same ideas.

There are four major ways join dataframes x and y:

Inner: keeps data that appear in both x and y
Left: keeps data that appear in x
Right: keeps data that appear in y
Full: keeps data that appear in either x or y

Left joins are the most common, because they add data from a smaller table y into a larger table x without removing anything from x.

As an example, consider the data tables in FAS_pups.csv and FAS_litters.csv, which are related through the Litter Number variable. The former contains data unique to each pup, and the latter contains data unique to each litter. We can combine these using a left join of litter data into pup data; doing so retains data on each pup and adds data in new columns.

pup_data = read_csv("./data/FAS_pups.csv", col_types = "ciiiii") %>%
  clean_names() %>%
  mutate(sex = recode(sex, `1` = "male", `2` = "female")) 

litter_data = read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>%
  clean_names %>%
  select(-pups_survive) %>%
  mutate(
    wt_gain = gd18_weight - gd0_weight,
    group = tolower(group))

FAS_data = left_join(pup_data, litter_data, by = "litter_number")
FAS_data
## # A tibble: 313 x 13
##   litter_number sex   pd_ears pd_eyes pd_pivot pd_walk group gd0_weight
##   <chr>         <chr>   <int>   <int>    <int>   <int> <chr>      <dbl>
## 1 #85           male        4      13        7      11 con7        19.7
## 2 #85           male        4      13        7      12 con7        19.7
## 3 #1/2/95/2     male        5      13        7       9 con7        27  
## 4 #1/2/95/2     male        5      13        8      10 con7        27  
## 5 #5/5/3/83/3-3 male        5      13        8      10 con7        26  
## # ... with 308 more rows, and 5 more variables: gd18_weight <dbl>,
## #   gd_of_birth <int>, pups_born_alive <int>, pups_dead_birth <int>,
## #   wt_gain <dbl>

We made the key explicit in the join. By default, the *_join functions in dplyr will try to determine the key(s) based on variable names in the datasets you want to join. This is often but not always sufficient, and an extra step to make the key clear will help you and others reading your code.

Note that joining is not particularly amenable to the %>% operator because it is fundamentally non-linear: two separate datasets are coming together, rather than a single dataset being processed in a step-by-step fashion.

As a final point, the *_join functions are very much related to SQL syntax, but emphasize operations common to data analysis.

Other materials

R for Data Science, of course, has a chapter on tidy data. The paper that lays out the underlying ideas may also be useful.
R for Data Science also has an excellent and very detailed presentation of joins.
Jenny Bryan’s Stat 545 class has content on tidy data – parts 1, 2, 3, and 4 are all good (some of the content above is very much related to this).
You should revisit the data import cheatsheet, which also has some tips for tidying.