Data import

I’ll load the tidyverse and the dataset. I’m also doing a bit of editing and renaming to make my life easier.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
airbnb_data = read_csv("./data/nyc_airbnb.zip") %>%
  mutate(rating = review_scores_location / 2) %>%
  rename(boro = neighbourhood_group)
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   review_scores_location = col_integer(),
##   name = col_character(),
##   host_id = col_integer(),
##   host_name = col_character(),
##   neighbourhood_group = col_character(),
##   neighbourhood = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   room_type = col_character(),
##   price = col_integer(),
##   minimum_nights = col_integer(),
##   number_of_reviews = col_integer(),
##   last_review = col_date(format = ""),
##   reviews_per_month = col_double(),
##   calculated_host_listings_count = col_integer(),
##   availability_365 = col_integer()
## )

Understanding variables

Next I’ll do a bit of inspection to make sure I understand the data structure.

# View(airbnb_data)
str(airbnb_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    40753 obs. of  18 variables:
##  $ id                            : int  7949480 16042478 1886820 6627449 5557381 9147025 11675715 715270 17876530 182177 ...
##  $ review_scores_location        : int  10 NA NA 10 10 10 10 9 10 9 ...
##  $ name                          : chr  "City Island Sanctuary relaxing BR & Bath w Parking" "WATERFRONT STUDIO APARTMENT" "Quaint City Island Community." "Large 1 BDRM in Great location" ...
##  $ host_id                       : int  119445 9117975 9815788 13886510 28811542 403032 56714504 3684360 11305944 873273 ...
##  $ host_name                     : chr  "Linda & Didier" "Collins" "Steve" "Arlene" ...
##  $ boro                          : chr  "Bronx" "Bronx" "Bronx" "Bronx" ...
##  $ neighbourhood                 : chr  "City Island" "City Island" "City Island" "City Island" ...
##  $ latitude                      : num  40.9 40.9 40.8 40.8 40.9 ...
##  $ longitude                     : num  -73.8 -73.8 -73.8 -73.8 -73.8 ...
##  $ room_type                     : chr  "Private room" "Private room" "Entire home/apt" "Entire home/apt" ...
##  $ price                         : int  99 200 300 125 69 125 85 39 95 125 ...
##  $ minimum_nights                : int  1 7 7 3 3 2 1 2 3 2 ...
##  $ number_of_reviews             : int  25 0 0 12 86 41 74 114 5 206 ...
##  $ last_review                   : Date, format: "2017-04-23" NA ...
##  $ reviews_per_month             : num  1.59 NA NA 0.54 3.63 2.48 5.43 2.06 5 2.98 ...
##  $ calculated_host_listings_count: int  1 1 1 1 1 1 1 4 3 4 ...
##  $ availability_365              : int  170 180 365 335 352 129 306 306 144 106 ...
##  $ rating                        : num  5 NA NA 5 5 5 5 4.5 5 4.5 ...
airbnb_data %>%
  count(boro)
## # A tibble: 5 x 2
##            boro     n
##           <chr> <int>
## 1         Bronx   649
## 2      Brooklyn 16810
## 3     Manhattan 19212
## 4        Queens  3821
## 5 Staten Island   261
airbnb_data %>%
  count(room_type)
## # A tibble: 3 x 2
##         room_type     n
##             <chr> <int>
## 1 Entire home/apt 19937
## 2    Private room 19626
## 3     Shared room  1190

Brainstorming questions

Does rating vary by neighborhood, room type, or both?

First I’ll create a table to look at rating by boro.

airbnb_data %>%
  group_by(boro) %>% 
  filter(!is.na(rating)) %>%
  summarize(median_rating = median(rating),
            mean_rating = mean(rating),
            sd_rating = sd(rating))
## # A tibble: 5 x 4
##            boro median_rating mean_rating sd_rating
##           <chr>         <dbl>       <dbl>     <dbl>
## 1         Bronx           4.5    4.444444 0.4832429
## 2      Brooklyn           5.0    4.645007 0.4519313
## 3     Manhattan           5.0    4.785744 0.3645893
## 4        Queens           4.5    4.651240 0.4303368
## 5 Staten Island           4.5    4.618280 0.4119686

Next I’ll try some visual displays.

airbnb_data %>%
  ggplot(aes(x = boro, y = rating)) + 
  geom_violin(bw = .1)
## Warning: Removed 10037 rows containing non-finite values (stat_ydensity).

airbnb_data %>%
  ggplot(aes(x = room_type, y = rating)) + 
  geom_violin(bw = .1)
## Warning: Removed 10037 rows containing non-finite values (stat_ydensity).

Finally, I’m interested in the room_type variable. Rating didn’t seem to vary much across this variable, but I’d still like to examine this and boro at the same time.

The chunk below creates a table, and then uses knitr::kable() to format the output.

airbnb_data %>%
  group_by(boro, room_type) %>% 
  filter(!is.na(rating)) %>%
  summarize(mean_rating = mean(rating)) %>%
  spread(key = boro, value = mean_rating) %>%
  knitr::kable()
room_type Bronx Brooklyn Manhattan Queens Staten Island
Entire home/apt 4.445652 4.676650 4.815377 4.689802 4.637500
Private room 4.451613 4.620142 4.741297 4.636719 4.604762
Shared room 4.325000 4.516791 4.782297 4.532143 4.500000

Do hosts with more listings have higher or lower ratings?

First I’ll look at the number of properties by each host. This code gave me some trouble in class – the first try and the better approach are both shown below.

airbnb_data %>% 
  count(calculated_host_listings_count)
## # A tibble: 21 x 2
##    calculated_host_listings_count     n
##                             <int> <int>
##  1                              1 30356
##  2                              2  5486
##  3                              3  2040
##  4                              4  1008
##  5                              5   545
##  6                              6   360
##  7                              7   245
##  8                              8   208
##  9                              9    72
## 10                             10    90
## # ... with 11 more rows
airbnb_data %>%
  count(host_id) %>%
  count(n)
## # A tibble: 21 x 2
##        n    nn
##    <int> <int>
##  1     1 30356
##  2     2  2743
##  3     3   680
##  4     4   252
##  5     5   109
##  6     6    60
##  7     7    35
##  8     8    26
##  9     9     8
## 10    10     9
## # ... with 11 more rows

We’ll use calculated_host_listings_count moving forward for this question – that gives the total number of rentals that are hosted by the host in the row.

I’ll try to answer this using a scatterplot

airbnb_data %>% 
  ggplot(aes(x = calculated_host_listings_count, y = rating)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 10037 rows containing non-finite values (stat_smooth).
## Warning: Removed 10037 rows containing missing values (geom_point).

I’ll also try to answer this using a numeric summary for a binary lots_of_houses variable.

airbnb_data %>%
  filter(!is.na(rating)) %>%
  mutate(lots_of_houses = (calculated_host_listings_count > 1)) %>% 
  group_by(lots_of_houses) %>% 
  summarize(median_rating = median(rating),
            mean_rating = mean(rating),
            sd_rating = sd(rating))
## # A tibble: 2 x 4
##   lots_of_houses median_rating mean_rating sd_rating
##            <lgl>         <dbl>       <dbl>     <dbl>
## 1          FALSE           5.0    4.745917 0.4016954
## 2           TRUE           4.5    4.617015 0.4433053

If anything, hosts with lots of apartments have slightly lower ratings.

Where are rentals located?

I’m going to use latitude and longitude to plot the location of rentals. I always get lat and mixed up, and it seems like the person who put these data together did as well …

airbnb_data %>%
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point()

Next I’ll try to clean this up a bit and use other aesthetic mappings. For simplicity, I’ll also focus on Manhattan.

airbnb_data %>%
  filter(boro == "Manhattan") %>% 
  ggplot(aes(x = longitude, y = latitude, color = rating)) +
  geom_point(alpha = .1) + 
  scale_colour_gradient(low = "red", high = "green",
  space = "Lab", na.value = "grey50", guide = "colourbar") +
  coord_cartesian() + 
  theme_classic()

We can learn quite a bit from this – there are fewer rentals in some neighborhoods of Manhattan than others, and there seems to be a slight decrease in rating for rentals farther north than downtown.

Other questions

There are several questions we didn’t get to:

  • How is price related to other variables?
  • How are descriptions related to number of reviews?
  • Which rentals make the most money?
  • What’s different about low-availability rooms?

Of these, the question dealing with descriptions is one we’re not in a position to handle yet – that involves parsing character strings, which we’ll get to shortly. The others we could at least explore using the tools we have at our disposal now.