We have, by now, established some fundamental tools for doing data science. It’s important to revisit our definition, and especially our discussion of connotation, before moving forward.

The slack channel for today’s example is here.

Example

This example is based entirely on live-coding and uses the NYC Airbnb data. Once downloaded, the data can be imported using:

airbnb_data = read_csv("./data/nyc_airbnb.zip")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   review_scores_location = col_integer(),
##   name = col_character(),
##   host_id = col_integer(),
##   host_name = col_character(),
##   neighbourhood_group = col_character(),
##   neighbourhood = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   room_type = col_character(),
##   price = col_integer(),
##   minimum_nights = col_integer(),
##   number_of_reviews = col_integer(),
##   last_review = col_date(format = ""),
##   reviews_per_month = col_double(),
##   calculated_host_listings_count = col_integer(),
##   availability_365 = col_integer()
## )

As always, I’ll do today’s coding in a R Markdown file, sitting in an R Project with a data directory.

Understanding variables

First, let’s take a few minutes to understand the dataset and the variables it contains.

# View(airbnb_data)
str(airbnb_data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    40753 obs. of  17 variables:
##  $ id                            : int  7949480 16042478 1886820 6627449 5557381 9147025 11675715 715270 17876530 182177 ...
##  $ review_scores_location        : int  10 NA NA 10 10 10 10 9 10 9 ...
##  $ name                          : chr  "City Island Sanctuary relaxing BR & Bath w Parking" "WATERFRONT STUDIO APARTMENT" "Quaint City Island Community." "Large 1 BDRM in Great location" ...
##  $ host_id                       : int  119445 9117975 9815788 13886510 28811542 403032 56714504 3684360 11305944 873273 ...
##  $ host_name                     : chr  "Linda & Didier" "Collins" "Steve" "Arlene" ...
##  $ neighbourhood_group           : chr  "Bronx" "Bronx" "Bronx" "Bronx" ...
##  $ neighbourhood                 : chr  "City Island" "City Island" "City Island" "City Island" ...
##  $ latitude                      : num  40.9 40.9 40.8 40.8 40.9 ...
##  $ longitude                     : num  -73.8 -73.8 -73.8 -73.8 -73.8 ...
##  $ room_type                     : chr  "Private room" "Private room" "Entire home/apt" "Entire home/apt" ...
##  $ price                         : int  99 200 300 125 69 125 85 39 95 125 ...
##  $ minimum_nights                : int  1 7 7 3 3 2 1 2 3 2 ...
##  $ number_of_reviews             : int  25 0 0 12 86 41 74 114 5 206 ...
##  $ last_review                   : Date, format: "2017-04-23" NA ...
##  $ reviews_per_month             : num  1.59 NA NA 0.54 3.63 2.48 5.43 2.06 5 2.98 ...
##  $ calculated_host_listings_count: int  1 1 1 1 1 1 1 4 3 4 ...
##  $ availability_365              : int  170 180 365 335 352 129 306 306 144 106 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 17
##   .. ..$ id                            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ review_scores_location        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ name                          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ host_id                       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ host_name                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ neighbourhood_group           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ neighbourhood                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ latitude                      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ longitude                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ room_type                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ price                         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ minimum_nights                : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ number_of_reviews             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ last_review                   :List of 1
##   .. .. ..$ format: chr ""
##   .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
##   .. ..$ reviews_per_month             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ calculated_host_listings_count: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ availability_365              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

airbnb_data %>%
  count(room_type)
## # A tibble: 3 x 2
##   room_type           n
##   <chr>           <int>
## 1 Entire home/apt 19937
## 2 Private room    19626
## 3 Shared room      1190

airbnb_data %>%
  count(neighbourhood_group)
## # A tibble: 5 x 2
##   neighbourhood_group     n
##   <chr>               <int>
## 1 Bronx                 649
## 2 Brooklyn            16810
## 3 Manhattan           19212
## 4 Queens               3821
## 5 Staten Island         261

Brainstorming questions

A major element of data science is to ask questions, and this dataset provides some rich opportunities. For example, we might ask:

  • Does rating vary by neighborhood, room type, or both?
  • Do hosts with more listings have higher or lower ratings?
  • How is price related to other variables?
  • Where are rentals located?

We’ll take a few minutes as a class to brainstorm some additional questions, and then try to answer some of them.

Other materials

The code that I produced working examples in lecture is here.