I’ll load the tidyverse and the dataset. I’m also doing a bit of editing and renaming to make my life easier.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
airbnb_data = read_csv("./data/nyc_airbnb.zip") %>%
mutate(rating = review_scores_location / 2) %>%
rename(boro = neighbourhood_group)
## Parsed with column specification:
## cols(
## id = col_integer(),
## review_scores_location = col_integer(),
## name = col_character(),
## host_id = col_integer(),
## host_name = col_character(),
## neighbourhood_group = col_character(),
## neighbourhood = col_character(),
## latitude = col_double(),
## longitude = col_double(),
## room_type = col_character(),
## price = col_integer(),
## minimum_nights = col_integer(),
## number_of_reviews = col_integer(),
## last_review = col_date(format = ""),
## reviews_per_month = col_double(),
## calculated_host_listings_count = col_integer(),
## availability_365 = col_integer()
## )
Next I’ll do a bit of inspection to make sure I understand the data structure.
# View(airbnb_data)
str(airbnb_data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 40753 obs. of 18 variables:
## $ id : int 7949480 16042478 1886820 6627449 5557381 9147025 11675715 715270 17876530 182177 ...
## $ review_scores_location : int 10 NA NA 10 10 10 10 9 10 9 ...
## $ name : chr "City Island Sanctuary relaxing BR & Bath w Parking" "WATERFRONT STUDIO APARTMENT" "Quaint City Island Community." "Large 1 BDRM in Great location" ...
## $ host_id : int 119445 9117975 9815788 13886510 28811542 403032 56714504 3684360 11305944 873273 ...
## $ host_name : chr "Linda & Didier" "Collins" "Steve" "Arlene" ...
## $ boro : chr "Bronx" "Bronx" "Bronx" "Bronx" ...
## $ neighbourhood : chr "City Island" "City Island" "City Island" "City Island" ...
## $ latitude : num 40.9 40.9 40.8 40.8 40.9 ...
## $ longitude : num -73.8 -73.8 -73.8 -73.8 -73.8 ...
## $ room_type : chr "Private room" "Private room" "Entire home/apt" "Entire home/apt" ...
## $ price : int 99 200 300 125 69 125 85 39 95 125 ...
## $ minimum_nights : int 1 7 7 3 3 2 1 2 3 2 ...
## $ number_of_reviews : int 25 0 0 12 86 41 74 114 5 206 ...
## $ last_review : Date, format: "2017-04-23" NA ...
## $ reviews_per_month : num 1.59 NA NA 0.54 3.63 2.48 5.43 2.06 5 2.98 ...
## $ calculated_host_listings_count: int 1 1 1 1 1 1 1 4 3 4 ...
## $ availability_365 : int 170 180 365 335 352 129 306 306 144 106 ...
## $ rating : num 5 NA NA 5 5 5 5 4.5 5 4.5 ...
airbnb_data %>%
count(boro)
## # A tibble: 5 x 2
## boro n
## <chr> <int>
## 1 Bronx 649
## 2 Brooklyn 16810
## 3 Manhattan 19212
## 4 Queens 3821
## 5 Staten Island 261
airbnb_data %>%
count(room_type)
## # A tibble: 3 x 2
## room_type n
## <chr> <int>
## 1 Entire home/apt 19937
## 2 Private room 19626
## 3 Shared room 1190
First I’ll create a table to look at rating by boro.
airbnb_data %>%
group_by(boro) %>%
filter(!is.na(rating)) %>%
summarize(median_rating = median(rating),
mean_rating = mean(rating),
sd_rating = sd(rating))
## # A tibble: 5 x 4
## boro median_rating mean_rating sd_rating
## <chr> <dbl> <dbl> <dbl>
## 1 Bronx 4.5 4.444444 0.4832429
## 2 Brooklyn 5.0 4.645007 0.4519313
## 3 Manhattan 5.0 4.785744 0.3645893
## 4 Queens 4.5 4.651240 0.4303368
## 5 Staten Island 4.5 4.618280 0.4119686
Next I’ll try some visual displays.
airbnb_data %>%
ggplot(aes(x = boro, y = rating)) +
geom_violin(bw = .1)
## Warning: Removed 10037 rows containing non-finite values (stat_ydensity).
airbnb_data %>%
ggplot(aes(x = room_type, y = rating)) +
geom_violin(bw = .1)
## Warning: Removed 10037 rows containing non-finite values (stat_ydensity).
Finally, I’m interested in the room_type
variable. Rating didn’t seem to vary much across this variable, but I’d still like to examine this and boro at the same time.
The chunk below creates a table, and then uses knitr::kable()
to format the output.
airbnb_data %>%
group_by(boro, room_type) %>%
filter(!is.na(rating)) %>%
summarize(mean_rating = mean(rating)) %>%
spread(key = boro, value = mean_rating) %>%
knitr::kable()
room_type | Bronx | Brooklyn | Manhattan | Queens | Staten Island |
---|---|---|---|---|---|
Entire home/apt | 4.445652 | 4.676650 | 4.815377 | 4.689802 | 4.637500 |
Private room | 4.451613 | 4.620142 | 4.741297 | 4.636719 | 4.604762 |
Shared room | 4.325000 | 4.516791 | 4.782297 | 4.532143 | 4.500000 |
First I’ll look at the number of properties by each host. This code gave me some trouble in class – the first try and the better approach are both shown below.
airbnb_data %>%
count(calculated_host_listings_count)
## # A tibble: 21 x 2
## calculated_host_listings_count n
## <int> <int>
## 1 1 30356
## 2 2 5486
## 3 3 2040
## 4 4 1008
## 5 5 545
## 6 6 360
## 7 7 245
## 8 8 208
## 9 9 72
## 10 10 90
## # ... with 11 more rows
airbnb_data %>%
count(host_id) %>%
count(n)
## # A tibble: 21 x 2
## n nn
## <int> <int>
## 1 1 30356
## 2 2 2743
## 3 3 680
## 4 4 252
## 5 5 109
## 6 6 60
## 7 7 35
## 8 8 26
## 9 9 8
## 10 10 9
## # ... with 11 more rows
We’ll use calculated_host_listings_count
moving forward for this question – that gives the total number of rentals that are hosted by the host in the row.
I’ll try to answer this using a scatterplot
airbnb_data %>%
ggplot(aes(x = calculated_host_listings_count, y = rating)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 10037 rows containing non-finite values (stat_smooth).
## Warning: Removed 10037 rows containing missing values (geom_point).
I’ll also try to answer this using a numeric summary for a binary lots_of_houses
variable.
airbnb_data %>%
filter(!is.na(rating)) %>%
mutate(lots_of_houses = (calculated_host_listings_count > 1)) %>%
group_by(lots_of_houses) %>%
summarize(median_rating = median(rating),
mean_rating = mean(rating),
sd_rating = sd(rating))
## # A tibble: 2 x 4
## lots_of_houses median_rating mean_rating sd_rating
## <lgl> <dbl> <dbl> <dbl>
## 1 FALSE 5.0 4.745917 0.4016954
## 2 TRUE 4.5 4.617015 0.4433053
If anything, hosts with lots of apartments have slightly lower ratings.
I’m going to use latitude and longitude to plot the location of rentals. I always get lat and mixed up, and it seems like the person who put these data together did as well …
airbnb_data %>%
ggplot(aes(x = longitude, y = latitude)) +
geom_point()
Next I’ll try to clean this up a bit and use other aesthetic mappings. For simplicity, I’ll also focus on Manhattan.
airbnb_data %>%
filter(boro == "Manhattan") %>%
ggplot(aes(x = longitude, y = latitude, color = rating)) +
geom_point(alpha = .1) +
scale_colour_gradient(low = "red", high = "green",
space = "Lab", na.value = "grey50", guide = "colourbar") +
coord_cartesian() +
theme_classic()
We can learn quite a bit from this – there are fewer rentals in some neighborhoods of Manhattan than others, and there seems to be a slight decrease in rating for rentals farther north than downtown.
There are several questions we didn’t get to:
Of these, the question dealing with descriptions is one we’re not in a position to handle yet – that involves parsing character strings, which we’ll get to shortly. The others we could at least explore using the tools we have at our disposal now.