Data Import

Importing is the first step in wrangling.

This is the first module in the Data Wrangling I topic; the relevant slack channel is here.

Some slides

Data Import from Jeff Goldsmith.

Example

We’re going to figure out how to import the three datasets in this zip file. First, create a directory for today’s work (I’m calling mine data_wrangling_i), add an R Project, move the data to your directory (perhaps in a sub-directory called data), and start an R Markdown file. We’re also going to be making extensive use of the tidyverse package, so go ahead and load that. Lastly, take a moment to install the packages janitor, haven, and readxl.

library(tidyverse)
## ── Attaching packages ─────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Paths

We’ve mentioned paths and your working directory in passing (when talking about best practices). A firmer understanding is helpful when you start loading data into R, because to load data R will have to know where to find it.

There are two kinds of paths:

Absolute: a file or folder’s “full address” on your computer
Relative: directions to a file or folder from your current working directory

Absolute paths are often easier, because you don’t really have to think about them – you’re just giving the complete address, starting from the root directory. These work from any current working directory on the machine. However, absolute paths can take up a lot of space if you use nested directories. And they aren’t portable: someone with the same directory and contents on their machine won’t have the same path to the directory. An absolute path example is below:

"/Users/jeffgoldsmith/Dropbox/Work/Teaching/P8105/DSI/data/FAS_litters.csv"

Relative paths, meanwhile, start from your current working directory. These are often much shorter, since the files you want to access are in subdirectories or at least not too far away. They also are portable, in that someone else with the same directory will have the same relative path to the related files. For both of these reasons, relative paths are preferred in almost every setting. The code below finds my current working directory:

getwd()
## [1] "/Users/jeffgoldsmith/Dropbox/Work/Teaching/P8105/DSI"

A relative path to the same file as above is

"./data/FAS_litters.csv"

The table below, copied from R Programming for Research, gives useful shorthand notation for relative pathnames.

Shorthand	Meaning
`~`	Home directory
`.`	Current working directory
`..`	One directory up from current working directory
`../..`	Two directories up from current working directory

One note: autocomplete in RStudio works with both absolute and relative paths.

One caveat: if you’re going to both share code and use confidential data, you have to take extra steps to ensure data security. That can mean storing data outside of your shared project directory and using absolute paths. In these cases, set a “parent directory” at the outset of your code and using relative paths subsequently.

Importing data tables

Now that we have paths handled, we can start loading data. We’re going to start with rectangular data tables (data in rows and columns, with data separated by a delimiter) saved in a plain text format. Of these, csv (comma separated value) files are most common, and others are handled in basically the same way. To import a csv, we’ll use a function from readr:

litters_data = readr::read_csv(file = "./data/FAS_litters.csv")
## Parsed with column specification:
## cols(
##   Group = col_character(),
##   `Litter Number` = col_character(),
##   `GD0 weight` = col_double(),
##   `GD18 weight` = col_double(),
##   `GD of Birth` = col_integer(),
##   `Pups born alive` = col_integer(),
##   `Pups dead @ birth` = col_integer(),
##   `Pups survive` = col_integer()
## )

Great – we’ve imported data! The first argument to read_csv is the path to the data, and the line above assigns the result of read_csv to litters_data. This function call also prints information about the column parsing. We’ll talk more about both of these shortly.

I (almost always) use janitor::clean_names() to clean up variable names after importing data. Doing so will take whatever the column names are and convert them to lower snake case.

names(litters_data)
## [1] "Group"             "Litter Number"     "GD0 weight"       
## [4] "GD18 weight"       "GD of Birth"       "Pups born alive"  
## [7] "Pups dead @ birth" "Pups survive"
litters_data = janitor::clean_names(litters_data)
names(litters_data)
## [1] "group"           "litter_number"   "gd0_weight"      "gd18_weight"    
## [5] "gd_of_birth"     "pups_born_alive" "pups_dead_birth" "pups_survive"

Learning Assessment: Make sure you are able to load the FAS_litters.csv dataset. Use both absolute and relative paths. Quit R Studio and move the directory containing your project, data, and R Markdown document. Repeat the previous data import process; do both absolute and relative paths still work?

Looking at data

The first thing to do after importing the data (unless read_csv gives warnings) is to look at it. If there are unexpected results during data import, you’ll catch a lot of them here. In addition to printing the data, I often use str, head, and tail:

litters_data
## # A tibble: 49 x 8
##    group litter_number  gd0_weight gd18_weight gd_of_birth pups_born_alive
##    <chr> <chr>               <dbl>       <dbl>       <int>           <int>
##  1 Con7  #85                  19.7        34.7          20               3
##  2 Con7  #1/2/95/2            27          42            19               8
##  3 Con7  #5/5/3/83/3-3        26          41.4          19               6
##  4 Con7  #5/4/2/95/2          28.5        44.1          19               5
##  5 Con7  #4/2/95/3-3          NA          NA            20               6
##  6 Con7  #2/2/95/3-2          NA          NA            20               6
##  7 Con7  #1/5/3/83/3-3…       NA          NA            20               9
##  8 Con8  #3/83/3-3            NA          NA            20               9
##  9 Con8  #2/95/3              NA          NA            20               8
## 10 Con8  #3/5/2/2/95          28.5        NA            20               8
## # ... with 39 more rows, and 2 more variables: pups_dead_birth <int>,
## #   pups_survive <int>
tail(litters_data, 5)
## # A tibble: 5 x 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 Low8  #100                20          39.2          20               8
## 2 Low8  #4/84               21.8        35.2          20               4
## 3 Low8  #108                25.6        47.5          20               8
## 4 Low8  #99                 23.5        39            20               6
## 5 Low8  #110                25.5        42.7          20               7
## # ... with 2 more variables: pups_dead_birth <int>, pups_survive <int>

Another tool that I use often is `skimr::skim. The output is too wide to render well here, but is pretty slick! Note this isn’t available from CRAN, so installation is a bit more involved.

skimr::skim(litters_data)
## Skim summary statistics
##  n obs: 49 
##  n variables: 8 
## 
## Variable type: character 
##       variable missing complete  n min max empty n_unique
##          group       0       49 49   4   4     0        6
##  litter_number       0       49 49   3  15     0       49
## 
## Variable type: integer 
##         variable missing complete  n  mean   sd p0 p25 p50 p75 p100
##      gd_of_birth       0       49 49 19.65 0.48 19  19  20  20   20
##  pups_born_alive       0       49 49  7.35 1.76  3   6   8   8   11
##  pups_dead_birth       0       49 49  0.33 0.75  0   0   0   0    4
##     pups_survive       0       49 49  6.41 2.05  1   5   7   8    9
##      hist
##  ▅▁▁▁▁▁▁▇
##  ▂▂▃▃▇▅▁▁
##  ▇▂▁▁▁▁▁▁
##  ▂▂▃▃▅▇▇▅
## 
## Variable type: numeric 
##     variable missing complete  n  mean   sd   p0   p25   p50   p75 p100
##   gd0_weight      15       34 49 24.38 3.28 17   22.3  24.1  26.67 33.4
##  gd18_weight      17       32 49 41.52 4.05 33.4 38.88 42.25 43.8  52.7
##      hist
##  ▁▃▇▇▇▆▁▁
##  ▂▃▃▇▆▂▁▁

Arguments to `read_*`

In the best case, the data are stored in the csv without any weirdness – there are no blank lines or columns, the first row is the variable name, missing values are stored in sensible ways. When this isn’t the case, arguments to read_csv are helpful. The ones I use most frequently are:

col_names: usually TRUE. If FALSE, column names are X1, X1, … . You can also supply column names.
na: string vector containing character expressions for missing values.
skip: number of rows to skip before reading data.

For example, the call below will skip the first 50 lines of data and not assume the first row are variable names:

litters_data = read_csv(file = "./data/FAS_litters.csv",
  skip = 10, col_names = FALSE)
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_double(),
##   X4 = col_double(),
##   X5 = col_integer(),
##   X6 = col_integer(),
##   X7 = col_integer(),
##   X8 = col_integer()
## )
head(litters_data)
## # A tibble: 6 x 8
##   X1    X2                 X3    X4    X5    X6    X7    X8
##   <chr> <chr>           <dbl> <dbl> <int> <int> <int> <int>
## 1 Con8  #3/5/2/2/95      28.5    NA    20     8     0     8
## 2 Con8  #5/4/3/83/3      28      NA    19     9     0     8
## 3 Con8  #1/6/2/2/95-2    NA      NA    20     7     0     6
## 4 Con8  #3/5/3/83/3-3-2  NA      NA    20     8     0     8
## 5 Con8  #2/2/95/2        NA      NA    19     5     0     4
## 6 Con8  #3/6/2/2/95-3    NA      NA    20     7     0     7

These arguments generally work for other members of the read_* family of functions.

Parsing columns

I skipped the col_types argument because it’s worth talking about in some detail. The read_* functions will attempt to guess the data type stored in each column; by default, these guesses are based on the first 1000 rows. The guesses are also usually pretty good. In some cases, though, you’ll want to give explicit column specifications. This is done using the cols function, and each column is given a column type:

litters_data = read_csv(file = "./data/FAS_litters.csv",
  col_types = cols(
    Group = col_character(),
    `Litter Number` = col_character(),
    `GD0 weight` = col_double(),
    `GD12 weight` = col_double(),
    `GD of Birth` = col_integer(),
    `Pups born alive` = col_integer(),
    `Pups dead @ birth` = col_integer(),
    `Pups survive` = col_integer()
  )
)
## Warning: The following named parsers don't match the column names: GD12
## weight
tail(litters_data)
## # A tibble: 6 x 8
##   Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
##   <chr> <chr>                  <dbl>         <dbl>         <int>
## 1 Low8  #79                     25.4          43.8            19
## 2 Low8  #100                    20            39.2            20
## 3 Low8  #4/84                   21.8          35.2            20
## 4 Low8  #108                    25.6          47.5            20
## 5 Low8  #99                     23.5          39              20
## 6 Low8  #110                    25.5          42.7            20
## # ... with 3 more variables: `Pups born alive` <int>, `Pups dead @
## #   birth` <int>, `Pups survive` <int>

Column parsing also allows a shorthand for almost every data type.

litters_data = read_csv(file = "./data/FAS_litters.csv",
  col_types = "ccddiiii"
)
litters_data
## # A tibble: 49 x 8
##    Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
##    <chr> <chr>                  <dbl>         <dbl>         <int>
##  1 Con7  #85                     19.7          34.7            20
##  2 Con7  #1/2/95/2               27            42              19
##  3 Con7  #5/5/3/83/3-3           26            41.4            19
##  4 Con7  #5/4/2/95/2             28.5          44.1            19
##  5 Con7  #4/2/95/3-3             NA            NA              20
##  6 Con7  #2/2/95/3-2             NA            NA              20
##  7 Con7  #1/5/3/83/3-3/2         NA            NA              20
##  8 Con8  #3/83/3-3               NA            NA              20
##  9 Con8  #2/95/3                 NA            NA              20
## 10 Con8  #3/5/2/2/95             28.5          NA              20
## # ... with 39 more rows, and 3 more variables: `Pups born alive` <int>,
## #   `Pups dead @ birth` <int>, `Pups survive` <int>

Learning Assessment: Repeat the data import process above for the file FAS_pups.csv. Make sure the column names are reasonable, and take some quick looks at the dataset. What happens if your specifications for column parsing aren’t reasonable (e.g. character instead of double, or vice versa)?

Other file formats

Non-csv plain text files (e.g. tab delimited files) can be handled using read_table. This is very similar to read_csv, but you have to specify a delimiter.

CSV format is great, but you’ll encounter a lot of Excel files too. Although you can export these to a csv, don’t – use the readxl package instead! This is part of the tidyverse but you’ll have to download and install it separately. The read_excel function in this package has many of the same arguments as read_csv, including col_names, na, skip, and col_types, and can be used in basically the same way. There is also a sheet option (useful when there are multiple sheets in the Excel file) and the range option (when you want to read in a specific data rectangle). Lastly, in RStudio you can use File > Import Dataset > From Excel for a GUI interface. The code below illustrates read_excel.

library(readxl)
mlb11_data = read_excel("data/mlb11.xlsx", n_max = 20)
head(mlb11_data, 5)
## # A tibble: 5 x 12
##   team   runs at_bats  hits homeruns bat_avg strikeouts stolen_bases  wins
##   <chr> <dbl>   <dbl> <dbl>    <dbl>   <dbl>      <dbl>        <dbl> <dbl>
## 1 Texa…   855    5659  1599      210   0.283        930          143    96
## 2 Bost…   875    5710  1600      203   0.28        1108          102    90
## 3 Detr…   787    5563  1540      169   0.277       1143           49    95
## 4 Kans…   730    5672  1560      129   0.275       1006          153    71
## 5 St. …   762    5532  1513      162   0.273        978           57    90
## # ... with 3 more variables: new_onbase <dbl>, new_slug <dbl>,
## #   new_obs <dbl>

The last tidyverse package for data import we’ll note is haven, which is used to import into R data files from SAS, Stata, and SPSS. An example for reading data from SAS follows.

library(haven)
pulse_data = read_sas("./data/public_pulse_data.sas7bdat")
head(pulse_data, 5)
## # A tibble: 5 x 7
##      ID   age Sex   BDIScore_BL BDIScore_01m BDIScore_06m BDIScore_12m
##   <dbl> <dbl> <chr>       <dbl>        <dbl>        <dbl>        <dbl>
## 1 10003  48.0 male            7            1            2            0
## 2 10015  72.5 male            6           NA           NA           NA
## 3 10022  58.5 male           14            3            8           NA
## 4 10026  72.7 male           20            6           18           16
## 5 10035  60.4 male            4            0            1            2

You can read in data that isn’t coming as a flat file, but it’s beyond the scope of this course.

Comparison with Base R

The functions in readr are relatively new, and can be used in place of base R’s read.csv, read.table, and so on. The base R versions tend to be slower (very noticeably for large datasets), and the default options can make less sense for modern datasets. Meanwhile, the readr functions export tibbles, and keep characters as characters (instead of converting to factors …).

Learning Assessment: Import the FAS_litters.csv dataset using read.csv. Compare the class of this dataset to the one produced by read_csv. Try printing both in the console – what happens? After cleaning up the names, try accessing the group variable using gr (e.g., litters_data$gr). What happens?

Exporting data

As a final point, you will sometimes need to export data after you have imported and cleaned it. The write_* functions in readr address this problem.

Other materials

The content in this page draws heavily from several sources; each of the things below goes into more detail in one way or another.

You can learn more about tibbles from R for Data Science or the tidyverse page
R for Data Science has a chapter on data import
R Programming for Data Science gives more detail on data import in base R with some information about readr
R Programming for Research also discusses entering data; the section on pathnames and directories is nicely detailed
RStudio has recordings of webinars on Getting data into R and What’s new with readxl
There are pages on the tidyverse website for readr, readxl, and haven
The data import cheatsheet is handy once you have a good handle on things

The code that I produced working examples in lecture is here.