Context

At this point, we’ve covered Building Blocks, Data Wrangling I, and Visualization and EDA. These three topics give a broad introduction into the commonly-used tools of data science, and are the main focus of this project.

Independence

In contrast to Homework assignments, you must work completely independently on this project – do not discuss your approach, your code, or your results with any other students, and do not use the discussion board for questions related to this project. If questions do arise, please email the instructor.

Due date

Due: October 23 at 1:00pm.

Data

Accelerometers have become an appealing alternative to self-report techniques for studying physical activity in observational studies and clinical trials, largely because of their relative objectivity. During observation periods, the devices measure electrical signals that are a proxy for acceleration. ``Activity counts" are then devised by summarizing the voltage signals across a short period known as an epoch; one-minute epochs are common. Because accelerometers can be worn comfortably and unobtrusively, they produce around-the-clock observations of many kinds of activity.

Study participants were recruited from 50 Head Start centers in northern Manhattan, the Bronx, and Brooklyn, in neighborhoods with high rates of pediatric asthma. Researchers used a survey instrument to collect data on the child’s age, sex, asthma symptoms and other medical conditions, family-related factors, and features of the home environment. Field staff measured the child’s height, weight, and skin-fold thicknesses. The staff then attached the accelerometer to the child’s non-dominant wrist with a hospital band. From the accelerometer data, we observe minute-by-minute activity data over 24 hours. Data are aggregated into 10-minute epochs by averaging across minutes in 10-minute intervals; Epoch 1 corresponds to the 10-minute interval beginning at midnight.

A description of the data collection process can be found in this paper. The data can be downloaded here.

Reproducibility

You’re asked to submit a zipped R Project that contains a reproducible report covering the topics below. Use the following structure:

  • create a directory named p8105_midterm_YOURUNI (e.g. p8105_midterm_ajg2202 for Jeff)
  • put an R project in the directory
  • create a single .Rmd file named p8105_midterm_YOURUNI.Rmd

Do not include raw data files in your directory. Instead, create a separate directory called data and use relative paths starting with ../data/ to load data.

We will assess adherence to the instructions above and whether we are able to knit your .Rmd – that is, whether your work is reproducible – in the grading of this problem. Adherence to appropriate styling and clarity of code will be assessed. This project includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed.

Problem

Throughout, you should describe your work in writing that targets a reasonably sophisticated collaborator – not an expert data scientist, but an interested observer. Include code and output that you think is relevant or useful. Write in a reproducible way (e.g. using inline R code where necessary). Include only relevant information – 500 words (plus figures and tables) should suffice.

First, load, tidy, and wrangle the data; your final dataset should include all originally observed covariates (recoded to contain informative values, where appropriate) and should restrict activity observations to epochs between 6:00am and midnight. Describe the resulting dataset (e.g. how many subjects, the percentage that is female, etc). Discuss any additional exploratory analyses you conduct, but only include results you think are interesting (e.g. visually inspect distributions for outliers, but include only if these are informative).

Traditional analyses of accelerometer data focus on the total activity over the day. Using your tidied dataset, aggregate accross minutes to create a total activity variable. Using these data, explore the hypothesis that boys are more active than girls. You may want to want to make comparisons visually and / or quantitatively, or using formal statistical analyses. Additionally, examine the possibility that season (warm vs cold) affects the relationship between sex and activity.

One advantage of accelerometer data is the ability to inspect activity over the course of the data, rather than in aggregate. Make a plot that shows the 18-hour activity “profiles” for each child. Build on this plot to examine covariate effects including (although not limited to) the difference between boys and girls; incorporating smooth estimates of the mean activity profile may clarify these effects. Comment on relationships you think are interesting. (No formal statistical analyses are needed.)