First I’ll load some packages.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(ggridges)
library(ggthemes)

Load the data

The code below downloads the weather data for today’s example. Note that I cache this chunk because it takes a while to run (also note that I loaded the other packages I need in a separate chunk – caching will keep the data but won’t reload any packages from the cached chunk).

library(rnoaa)

weather = 
  meteo_pull_monitors(c("USW00094728", "USC00519397", "USS0023B17S"),
                      var = c("PRCP", "TMIN", "TMAX"), 
                      date_min = "2016-01-01",
                      date_max = "2016-12-31") %>%
  mutate(
    name = recode(id, USW00094728 = "CentralPark_NY", 
                      USC00519397 = "Waikiki_HA",
                      USS0023B17S = "Waterhole_WA"),
    tmin = tmin / 10,
    tmax = tmax / 10,
    month = lubridate::floor_date(date, unit = "month")) %>%
  select(name, id, date, month, everything())
weather
## # A tibble: 1,098 x 7
##              name          id       date      month  prcp  tmax  tmin
##             <chr>       <chr>     <date>     <date> <dbl> <dbl> <dbl>
##  1 CentralPark_NY USW00094728 2016-01-01 2016-01-01     0   5.6   1.1
##  2 CentralPark_NY USW00094728 2016-01-02 2016-01-01     0   4.4   0.0
##  3 CentralPark_NY USW00094728 2016-01-03 2016-01-01     0   7.2   1.7
##  4 CentralPark_NY USW00094728 2016-01-04 2016-01-01     0   2.2  -9.9
##  5 CentralPark_NY USW00094728 2016-01-05 2016-01-01     0  -1.6 -11.6
##  6 CentralPark_NY USW00094728 2016-01-06 2016-01-01     0   5.0  -3.8
##  7 CentralPark_NY USW00094728 2016-01-07 2016-01-01     0   7.8  -0.5
##  8 CentralPark_NY USW00094728 2016-01-08 2016-01-01     0   7.8  -0.5
##  9 CentralPark_NY USW00094728 2016-01-09 2016-01-01     0   8.3   4.4
## 10 CentralPark_NY USW00094728 2016-01-10 2016-01-01   457  15.0   4.4
## # ... with 1,088 more rows

Counting things

We have a couple of ways of counting the how many observations we have in various groups.

weather %>%
  group_by(name, month) %>%
  summarize(n_obs = n())
## # A tibble: 36 x 3
## # Groups:   name [?]
##              name      month n_obs
##             <chr>     <date> <int>
##  1 CentralPark_NY 2016-01-01    31
##  2 CentralPark_NY 2016-02-01    29
##  3 CentralPark_NY 2016-03-01    31
##  4 CentralPark_NY 2016-04-01    30
##  5 CentralPark_NY 2016-05-01    31
##  6 CentralPark_NY 2016-06-01    30
##  7 CentralPark_NY 2016-07-01    31
##  8 CentralPark_NY 2016-08-01    31
##  9 CentralPark_NY 2016-09-01    30
## 10 CentralPark_NY 2016-10-01    31
## # ... with 26 more rows
weather %>%
  group_by(month) %>%
  summarize(n_obs = n(),
            n_days = n_distinct(date))
## # A tibble: 12 x 3
##         month n_obs n_days
##        <date> <int>  <int>
##  1 2016-01-01    93     31
##  2 2016-02-01    87     29
##  3 2016-03-01    93     31
##  4 2016-04-01    90     30
##  5 2016-05-01    93     31
##  6 2016-06-01    90     30
##  7 2016-07-01    93     31
##  8 2016-08-01    93     31
##  9 2016-09-01    90     30
## 10 2016-10-01    93     31
## 11 2016-11-01    90     30
## 12 2016-12-01    93     31

General summaries

The same pipeline can be used for more general summaries. The summarized data is included in a tibble, and can be operated on using dplyr / tidyr functions or plotted using ggplot.

weather %>%
  group_by(name, month) %>%
  summarize(mean_tmax = mean(tmax),
            mean_prec = mean(prcp, na.rm = TRUE),
            median_tmax = median(tmax),
            sd_tmax = sd(tmax)) %>% 
  ggplot(aes(x = month, y = mean_tmax, color = name)) + geom_point()

weather %>%
  group_by(name) %>%
  mutate(centered_tmax = tmax - mean(tmax)) %>% 
  ggplot(aes(x = date, y = centered_tmax, color = name)) + 
    geom_point() 

weather %>%
  group_by(name, month) %>%
  mutate(temp_ranking = min_rank(tmax)) %>% 
  filter(temp_ranking < 2)
## # A tibble: 47 x 8
## # Groups:   name, month [36]
##              name          id       date      month  prcp  tmax  tmin
##             <chr>       <chr>     <date>     <date> <dbl> <dbl> <dbl>
##  1 CentralPark_NY USW00094728 2016-01-23 2016-01-01   587  -2.7  -4.3
##  2 CentralPark_NY USW00094728 2016-02-14 2016-02-01     0  -9.3 -18.2
##  3 CentralPark_NY USW00094728 2016-03-03 2016-03-01     0   2.2  -3.2
##  4 CentralPark_NY USW00094728 2016-04-05 2016-04-01     0   6.1  -3.2
##  5 CentralPark_NY USW00094728 2016-04-09 2016-04-01    28   6.1   2.2
##  6 CentralPark_NY USW00094728 2016-05-01 2016-05-01    41  10.6   7.2
##  7 CentralPark_NY USW00094728 2016-06-08 2016-06-01   114  19.4  11.1
##  8 CentralPark_NY USW00094728 2016-07-09 2016-07-01   135  22.2  18.3
##  9 CentralPark_NY USW00094728 2016-08-02 2016-08-01     0  26.1  20.0
## 10 CentralPark_NY USW00094728 2016-08-22 2016-08-01     0  26.1  18.3
## # ... with 37 more rows, and 1 more variables: temp_ranking <int>
weather %>%
  group_by(name, month) %>%
  mutate(temp_rank = min_rank(desc(tmax))) %>%
  filter(temp_rank < 4)
## # A tibble: 151 x 8
## # Groups:   name, month [36]
##              name          id       date      month  prcp  tmax  tmin
##             <chr>       <chr>     <date>     <date> <dbl> <dbl> <dbl>
##  1 CentralPark_NY USW00094728 2016-01-10 2016-01-01   457  15.0   4.4
##  2 CentralPark_NY USW00094728 2016-01-16 2016-01-01    61  11.1   5.6
##  3 CentralPark_NY USW00094728 2016-01-31 2016-01-01     0  13.3   2.2
##  4 CentralPark_NY USW00094728 2016-02-20 2016-02-01     0  16.1   3.9
##  5 CentralPark_NY USW00094728 2016-02-25 2016-02-01     5  16.1   2.8
##  6 CentralPark_NY USW00094728 2016-02-29 2016-02-01    13  16.1   8.3
##  7 CentralPark_NY USW00094728 2016-03-09 2016-03-01     0  25.0   6.7
##  8 CentralPark_NY USW00094728 2016-03-10 2016-03-01     0  26.1  17.2
##  9 CentralPark_NY USW00094728 2016-03-31 2016-03-01     0  22.8   9.4
## 10 CentralPark_NY USW00094728 2016-04-01 2016-04-01     5  26.1  16.1
## # ... with 141 more rows, and 1 more variables: temp_rank <int>

This EDA process can be used to answer surprisingly complex questions. The code below examines the day-to-day variability in max temperatures in the three weather stations we’ve been examining.

weather %>%
  group_by(name) %>%
  mutate(temp_change = tmax - lag(tmax)) %>% 
  summarize(sd_temp_change = sd(temp_change, na.rm = TRUE))
## # A tibble: 3 x 2
##             name sd_temp_change
##            <chr>          <dbl>
## 1 CentralPark_NY       4.278321
## 2     Waikiki_HA       1.146472
## 3   Waterhole_WA       2.817583

Finally, we compare the summarize function to summary – the main difference is that summary operates on a vector and produces a table rather than a data.frame.

weather %>% 
  pull(tmax) %>%
  summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -10.80    8.10   19.75   18.41   29.40   35.60