First I’ll load some packages.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggridges)
library(ggthemes)
The code below downloads the weather data for today’s example. Note that I cache this chunk because it takes a while to run (also note that I loaded the other packages I need in a separate chunk – caching will keep the data but won’t reload any packages from the cached chunk).
library(rnoaa)
weather =
meteo_pull_monitors(c("USW00094728", "USC00519397", "USS0023B17S"),
var = c("PRCP", "TMIN", "TMAX"),
date_min = "2016-01-01",
date_max = "2016-12-31") %>%
mutate(
name = recode(id, USW00094728 = "CentralPark_NY",
USC00519397 = "Waikiki_HA",
USS0023B17S = "Waterhole_WA"),
tmin = tmin / 10,
tmax = tmax / 10,
month = lubridate::floor_date(date, unit = "month")) %>%
select(name, id, date, month, everything())
weather
## # A tibble: 1,098 x 7
## name id date month prcp tmax tmin
## <chr> <chr> <date> <date> <dbl> <dbl> <dbl>
## 1 CentralPark_NY USW00094728 2016-01-01 2016-01-01 0 5.6 1.1
## 2 CentralPark_NY USW00094728 2016-01-02 2016-01-01 0 4.4 0.0
## 3 CentralPark_NY USW00094728 2016-01-03 2016-01-01 0 7.2 1.7
## 4 CentralPark_NY USW00094728 2016-01-04 2016-01-01 0 2.2 -9.9
## 5 CentralPark_NY USW00094728 2016-01-05 2016-01-01 0 -1.6 -11.6
## 6 CentralPark_NY USW00094728 2016-01-06 2016-01-01 0 5.0 -3.8
## 7 CentralPark_NY USW00094728 2016-01-07 2016-01-01 0 7.8 -0.5
## 8 CentralPark_NY USW00094728 2016-01-08 2016-01-01 0 7.8 -0.5
## 9 CentralPark_NY USW00094728 2016-01-09 2016-01-01 0 8.3 4.4
## 10 CentralPark_NY USW00094728 2016-01-10 2016-01-01 457 15.0 4.4
## # ... with 1,088 more rows
We have a couple of ways of counting the how many observations we have in various groups.
weather %>%
group_by(name, month) %>%
summarize(n_obs = n())
## # A tibble: 36 x 3
## # Groups: name [?]
## name month n_obs
## <chr> <date> <int>
## 1 CentralPark_NY 2016-01-01 31
## 2 CentralPark_NY 2016-02-01 29
## 3 CentralPark_NY 2016-03-01 31
## 4 CentralPark_NY 2016-04-01 30
## 5 CentralPark_NY 2016-05-01 31
## 6 CentralPark_NY 2016-06-01 30
## 7 CentralPark_NY 2016-07-01 31
## 8 CentralPark_NY 2016-08-01 31
## 9 CentralPark_NY 2016-09-01 30
## 10 CentralPark_NY 2016-10-01 31
## # ... with 26 more rows
weather %>%
group_by(month) %>%
summarize(n_obs = n(),
n_days = n_distinct(date))
## # A tibble: 12 x 3
## month n_obs n_days
## <date> <int> <int>
## 1 2016-01-01 93 31
## 2 2016-02-01 87 29
## 3 2016-03-01 93 31
## 4 2016-04-01 90 30
## 5 2016-05-01 93 31
## 6 2016-06-01 90 30
## 7 2016-07-01 93 31
## 8 2016-08-01 93 31
## 9 2016-09-01 90 30
## 10 2016-10-01 93 31
## 11 2016-11-01 90 30
## 12 2016-12-01 93 31
The same pipeline can be used for more general summaries. The summarized data is included in a tibble, and can be operated on using dplyr
/ tidyr
functions or plotted using ggplot
.
weather %>%
group_by(name, month) %>%
summarize(mean_tmax = mean(tmax),
mean_prec = mean(prcp, na.rm = TRUE),
median_tmax = median(tmax),
sd_tmax = sd(tmax)) %>%
ggplot(aes(x = month, y = mean_tmax, color = name)) + geom_point()
weather %>%
group_by(name) %>%
mutate(centered_tmax = tmax - mean(tmax)) %>%
ggplot(aes(x = date, y = centered_tmax, color = name)) +
geom_point()
weather %>%
group_by(name, month) %>%
mutate(temp_ranking = min_rank(tmax)) %>%
filter(temp_ranking < 2)
## # A tibble: 47 x 8
## # Groups: name, month [36]
## name id date month prcp tmax tmin
## <chr> <chr> <date> <date> <dbl> <dbl> <dbl>
## 1 CentralPark_NY USW00094728 2016-01-23 2016-01-01 587 -2.7 -4.3
## 2 CentralPark_NY USW00094728 2016-02-14 2016-02-01 0 -9.3 -18.2
## 3 CentralPark_NY USW00094728 2016-03-03 2016-03-01 0 2.2 -3.2
## 4 CentralPark_NY USW00094728 2016-04-05 2016-04-01 0 6.1 -3.2
## 5 CentralPark_NY USW00094728 2016-04-09 2016-04-01 28 6.1 2.2
## 6 CentralPark_NY USW00094728 2016-05-01 2016-05-01 41 10.6 7.2
## 7 CentralPark_NY USW00094728 2016-06-08 2016-06-01 114 19.4 11.1
## 8 CentralPark_NY USW00094728 2016-07-09 2016-07-01 135 22.2 18.3
## 9 CentralPark_NY USW00094728 2016-08-02 2016-08-01 0 26.1 20.0
## 10 CentralPark_NY USW00094728 2016-08-22 2016-08-01 0 26.1 18.3
## # ... with 37 more rows, and 1 more variables: temp_ranking <int>
weather %>%
group_by(name, month) %>%
mutate(temp_rank = min_rank(desc(tmax))) %>%
filter(temp_rank < 4)
## # A tibble: 151 x 8
## # Groups: name, month [36]
## name id date month prcp tmax tmin
## <chr> <chr> <date> <date> <dbl> <dbl> <dbl>
## 1 CentralPark_NY USW00094728 2016-01-10 2016-01-01 457 15.0 4.4
## 2 CentralPark_NY USW00094728 2016-01-16 2016-01-01 61 11.1 5.6
## 3 CentralPark_NY USW00094728 2016-01-31 2016-01-01 0 13.3 2.2
## 4 CentralPark_NY USW00094728 2016-02-20 2016-02-01 0 16.1 3.9
## 5 CentralPark_NY USW00094728 2016-02-25 2016-02-01 5 16.1 2.8
## 6 CentralPark_NY USW00094728 2016-02-29 2016-02-01 13 16.1 8.3
## 7 CentralPark_NY USW00094728 2016-03-09 2016-03-01 0 25.0 6.7
## 8 CentralPark_NY USW00094728 2016-03-10 2016-03-01 0 26.1 17.2
## 9 CentralPark_NY USW00094728 2016-03-31 2016-03-01 0 22.8 9.4
## 10 CentralPark_NY USW00094728 2016-04-01 2016-04-01 5 26.1 16.1
## # ... with 141 more rows, and 1 more variables: temp_rank <int>
This EDA process can be used to answer surprisingly complex questions. The code below examines the day-to-day variability in max temperatures in the three weather stations we’ve been examining.
weather %>%
group_by(name) %>%
mutate(temp_change = tmax - lag(tmax)) %>%
summarize(sd_temp_change = sd(temp_change, na.rm = TRUE))
## # A tibble: 3 x 2
## name sd_temp_change
## <chr> <dbl>
## 1 CentralPark_NY 4.278321
## 2 Waikiki_HA 1.146472
## 3 Waterhole_WA 2.817583
Finally, we compare the summarize
function to summary
– the main difference is that summary
operates on a vector and produces a table
rather than a data.frame
.
weather %>%
pull(tmax) %>%
summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.80 8.10 19.75 18.41 29.40 35.60