This assignment reinforces ideas in Visualization and EDA. A PDF of this assignment is here.
Due: October 11 at 5:00pm.
This “problem” focuses on structure of your submission, including the use of R Markdown to write reproducible reports, the use of R Projects to organize your work, the use of relative paths to load data, and the naming structure for your files.
To that end:
p8105_hw3_YOURUNI
(e.g. p8105_hw3_ajg2202
for Jeff)p8105_hw3_YOURUNI.Rmd
Some of the datasets used in this homework are large, so you should not include raw data files in your directory. Instead, create a separate directory called data
and use relative paths starting with ../data/
to load data. We’ll have a similar directory and should be able to knit your R Markdown file. The screenshot below illustrates this configuration.
Your solutions to Problems 1+ should be included in your .Rmd file, and your submission for this assignment will be a zip file of the directory named p8105_hw3_YOURUNI
. The required structure is shown below.
We will assess adherence to the instructions above and whether we are able to knit your .Rmd – that is, whether your work is reproducible – in the grading of this problem. Adherence to appropriate styling and clarity of code will be assessed in Problems 1+. This homework includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed in Problems 1+.
This problem uses the PULSE dataset which appeared as an example in the Data Wrangling I.
Read and clean the PULSE dataset; omit observations for which BDI score wasn’t measured.
geom_path()
will be useful. Based on this plot, comment on the stability of BDI score within a person over time – do subjects with high BDI scores at baseline tend to have high BDI scores at 12 months?This problem uses the Instacart data. Note that the data can be loaded as a zipped csv file by read_csv()
– no need to unzip the data first.
The goal is to do some exploration of this dataset. To that end, answer or address the following:
This problem uses the NY NOAA data. Note that the data can be loaded as a zipped csv file by read_csv()
– no need to unzip the data first.
The goal is to do some exploration of this dataset. To that end, answer or address the following:
tmax
and snow
? Does this vary by station?tmax
against tmin
. For these data, you might try a scatterplot, but it is unlikely to be “useful”.tmax
. Make a spaghetti plot showing the average tmax
curve for each station. Comment on your plot.