---
output:
pdf_document: default
html_document: default
word_document: default
widgets:
- mathjax
- bootstrap
- quiz
- shiny
- interactive
- latex_macros
---
**Biostatistics P8111, Spring 2016**
**Homework 4 Solutions**
Due April 8 by 5:00pm
Please submit an electronic copy (PDF) of your solutions using CourseWorks. Use the title
"P8111HW4_Lastname_Firstname.pdf".
Solutions to Problems 1-2 can be typed or handwritten and scanned; Problems 3-4 should be completed using R markdown. In addition to your PDF, please also submit the .Rmd that produces your written solutions to Problems 4-5 with the title "P8111HW4_Lastname_Firstname.Rmd".
```{r, echo=FALSE, message=FALSE, warning=FALSE}
## load all needed packages at the beginning of the document
library(dplyr)
library(SemiPar)
library(mgcv)
library(glmnet)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(broom)
library(caret)
```
### Problem 1. [7 points]
#### Solution:
\begin{align*}
\hat{\beta} & = \arg \min _{\beta} (y - X \beta)^{T} (y - X \beta) + \lambda \beta^{T} P \beta \\
& = \arg \min_{\beta} (y^T y - 2\beta^T X y + \beta^T X^T X \beta) + \lambda \beta^{T} P \beta
\end{align*}
If we take a derivative w.r.t. $\beta$ and set it equal to 0, then
$$ \frac{\partial }{\partial \beta} (y^T y - 2\beta^T X y + \beta^T X^T X \beta) + \lambda \beta^{T} P \beta = - 2 X y + 2 X^T X \beta + 2 \lambda P \beta = 0.$$
Then it follows that
$$ (X^T X + \lambda P) \beta = X y $$
Therefore,
$$ \hat{\beta} = (X^T X + \lambda P)^{-1} X^T y $$.
\pagebreak
### Problem 2. [8 points]
Consider the multiple linear regression model
$$ y = X \beta + \epsilon$$
with response vector $y$, design matrix $X$, coefficient vector $\beta$ and error vector $\epsilon$. In Problem 1, we found ridge regression estimates through
$$\hat{\beta}_{R} = \mbox{min}_{\beta} (y - X \beta)^{T} (y - X \beta) + \lambda \beta^{T} P \beta $$
with penalty matrix $P$ and tuning parameter $\lambda$.
We now make the additional assumptions
(a) That errors are multivariate normal: $\epsilon \sim \mbox{N}(0, \sigma^2_{\epsilon} I)$.
(b) That regression coefficients are also multivariate normal: $\beta \sim \mbox{N}(0, \sigma^2_{\beta} P^{-1})$.
Show that, under these assumptions, the MLE for $\beta$ is the same as the ridge regression estimate (it is sufficient to show that maximizing the likelihood is equivalent to minimizing the objective function above). What is the relationship between $\lambda$, $\sigma^2_{\epsilon}$ and $\sigma^2_{beta}$?
#### Solution:
In this formulation, the likelihood for $\beta$ is
$$
\begin{aligned}
L(\beta | y) & \propto p(y | \beta) p(\beta) \\
& \propto \exp\left\{\frac{1}{2\sigma^2_\epsilon} (y - X\beta)^{T}(y- x\beta) \right\} \exp\left\{\frac{1}{2\sigma^2_\beta} \beta^{T}P\beta \right\} \\
& = \exp\left\{\frac{1}{2}\left[\frac{1}{\sigma^2_\epsilon} (y - X\beta)^{T}(y- x\beta) + \frac{1}{\sigma^2_\beta} \beta^{T}P\beta \right]\right\}
\end{aligned}
$$
and the log likelihood is
$$
\begin{aligned}
l(\beta | y) & \propto \frac{1}{2}\left[\frac{1}{\sigma^2_\epsilon} (y - X\beta)^{T}(y- x\beta) + \frac{1}{\sigma^2_\beta} \beta^{T}P\beta \right]
& \propto (y - X\beta)^{T}(y- x\beta) + \frac{\sigma^2_\epsilon}{\sigma^2_\beta} \beta^{T}P\beta.
\end{aligned}
$$
This corresponds to the ridge regression loss function with $\lambda = \frac{\sigma^2_\epsilon}{\sigma^2_\beta}$.
A couple of points on this (not required for the homework solutions, but maybe interesting):
* When $\sigma^2_\epsilon \gg \sigma^2_\beta$, $\frac{\sigma^2_\epsilon}{\sigma^2_\beta}$ is large; this roughly translates to a large penalty and lots of smoothness. Intuitively, the amount of residual variation makes it difficult to estimate a smooth underlying function, even if the true mean varies.
* If some regression coefficients aren't penalized, $P$ has some zeros on the main diagonal and $P^{-1}$ does not exist. There are at least two ways to think of this. First, one can "make peace" with the idea that, even though $\mbox{N}(0, \sigma^2_{\beta} P^{-1})$ isn't a well-defined distribution, the likelihood exists and can be maximized; in a Bayesian framework, we are using an "improper" prior distribution and finding that the posterior exists and is proper. Alternatively, one can partition the vector $\beta$ into penalized and unpenalized elements and define the assumption in (b) only to the penalized components. This method was alluded to in Lecture 20 and is mathematically equivalent.
\pagebreak
### Problem 3. [10+5+5=20 points]
Solutions to this assignment are drawn from student submissions. We include four solutions, each of which took a different approach to model building. At the top of each solution there is a short paragraph describing some key strengths. Please note that many other approaches are possible.
#### Solution 1:
\textcolor{red}{
Solution 1 takes an approach that is useful for hypothesis generating or secondary data analyses. After thorough exploratory analyses, including examining tables and graphs of the data, a subset of known confounders is selected. Following that, a guided stepwise procedure that focuses on plausible predictors and that incorporates scientific interpretation at each step is used to construct a final model. Interpreting coefficients within the summary paragraph (rather than as a list) may be more helpful to collaborators, but the presentation is clear throughout.
}
```{r load_data, echo=FALSE, message = FALSE}
rm(list = ls())
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_BWT.RDA"))
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_Accel.RDA"))
```
```{r tidy = TRUE, message= FALSE, echo = FALSE}
birth_data = tbl_df(data.bwt)
```
The data analysis will begin with an exploration of the variables of interest. For continuous variables, the bivariate relationships between the variable and birthweight.For categorical predictors the frequency and mean of each category will be explored.
```{r tidy = TRUE, message= FALSE, warning = FALSE}
#data exploration
birth_data = birth_data %>%
dplyr::select(c(babysex, bhead, blength, bwt,
delwt, fincome, frace, gaweek,
malform, menarche, mheight, momage,
mrace, parity, ppbmi, ppwt,
smoken, wtgain, pnumlbw, pnumsga))
summary(birth_data)
ggplot(data = birth_data, aes(bwt)) + geom_density()
bhead_plot = ggplot(data = birth_data, aes(bhead, bwt)) + geom_point()
blength_plot = ggplot(data = birth_data, aes(blength, bwt)) + geom_point()
delwt_plot = ggplot(data = birth_data, aes(delwt, bwt)) + geom_point()
fincome_plot = ggplot(data = birth_data, aes(fincome, bwt)) + geom_point()
gaweek_plot = ggplot(data = birth_data, aes(gaweek, bwt)) + geom_point()
menarche_plot = ggplot(data = birth_data, aes(menarche, bwt)) + geom_point()
parity_plot = ggplot(data = birth_data, aes(parity, bwt)) + geom_point()
pnumlbw_plot = ggplot(data = birth_data, aes(pnumlbw, bwt)) + geom_point()
pnumsga_plot = ggplot(data = birth_data, aes(pnumsga, bwt)) + geom_point()
ppbmi_plot = ggplot(data = birth_data, aes(ppbmi, bwt)) + geom_point()
ppwt_plot = ggplot(data = birth_data, aes(ppwt, bwt)) + geom_point()
smoken_plot = ggplot(data = birth_data, aes(smoken, bwt)) + geom_point()
wtgain_plot = ggplot(data = birth_data, aes(wtgain, bwt)) + geom_point()
momage_plot = ggplot(data = birth_data, aes(momage, bwt)) + geom_point()
summary(birth_data$menarche)
birth_data %>% group_by(mrace) %>% summarize(n = n(), mean = mean(bwt, na.rm = T))
birth_data %>% group_by(frace) %>% summarize(n = n(), mean = mean(bwt, na.rm = T))
birth_data %>% group_by(malform) %>% summarize(n = n(), mean = mean(bwt, na.rm = T))
birth_data %>% group_by(babysex) %>% summarize(n = n(), mean = mean(bwt, na.rm = T))
grid.arrange(bhead_plot, blength_plot, delwt_plot, fincome_plot, gaweek_plot, menarche_plot, parity_plot, pnumlbw_plot, pnumsga_plot, ppbmi_plot, ppwt_plot, smoken_plot, wtgain_plot, momage_plot)
```
For continuous variables the following have have seem to have a linear relationship with birthweight: baby's head circumference, baby length, delivery weight, gestational age, pre-pregnancy bmi, pre-pregnancy weight, and weight gain. Mother's age seems to have a slight relationship but not definitive, significance will be assessed later on. The plots also reveal that previous number of low birth weights and number of prior small for gestational age all have 0 values, so these variables will not be considered as potential predictors. Another observation made from the plots is that menarche has some extreme values. By examining this variable further menarche values of 99 and 0 are not possible so these observations will be removed in the data cleaning step. After looking at the categorical variable frequencies and means per group, nothing jumps out as being unusual except there are `r sum(is.na(birth_data$frace))` missing values which is extremely high. These along with all the other NA's will be temporarily removed from the dataset so that linear models can be generated. However, this removal will be reconsidered throughout the model selection process. Other steps of data cleaning that will be preformed is converting categorical variables into factors and creating a categorical version of smoken and age to possibly be considered in the model.
```{r tidy = TRUE, message= FALSE}
#data cleaning
birth_data_clean =
birth_data %>%
mutate(babysex = factor(babysex, c(1:2), labels = c("male", "female")),
frace = factor(frace, c(1, 2, 3, 4, 8, 9),
label = c("White", "Black", "Asian",
"Puerto Rican", "Other", "Unknown")),
mrace = factor(mrace, c(1, 2, 3, 4, 8),
labels = c("White", "Black", "Asian",
"Puerto Rican", "Other")),
malform = factor(malform, c(0, 1),labels = c("Absent", "Present"))) %>%
dplyr::select(c(babysex, bhead, blength, bwt,
delwt, fincome, frace, gaweek,
malform, menarche, mheight, momage, mrace, parity, ppbmi,
ppwt, smoken, wtgain, pnumlbw, pnumsga)) %>%
filter(menarche != 99, menarche != 0, babysex != "NA", bhead != "NA",
bwt != "NA", gaweek != "NA", ppbmi != "NA", smoken != "NA",
wtgain != "NA", blength != "NA", frace != "NA") %>%
mutate(smoken_cat = cut(smoken, c(-1,0,20,60), right = TRUE, labels = c("none","light","heavy"))) %>%
mutate(momage_cat = cut(momage, c(12, 20, 30, 45), labels = c("young","normal" ,"old"), right = FALSE))
```
###(a) Building a predictive model
Before the model building process will begin there are variables in the dataset that seem to be highly correlated and are linear combinations of other variables. BMI is calculated using weight and height, so including ppbmi, ppwt, and mheight may be an issue of perfect colinearity. This is confirmed by a correlation between these 2 values of `r cor(x = birth_data_clean$ppbmi, y = birth_data_clean$ppwt/(birth_data_clean$mheight))`. Which is exceptionally high so for the purposes of model building only ppbmi will be considered for the model. Similarly with weight gain which was calculated by the investigators by subtracting pre-pregnancy weight from delivery weight leading to a perfect correlation. Only weight gain will be considered in the model because the relationship between the mother's weight gain and birthweight is an interesting scientific question.
The model building process will begin with a base model that includes predictors that are scientifically known to be associated with birthweight. These predictors will remain in the model regardless of future steps. The predictors in the base model are the size of the head of the baby, the length of the baby, and the gestational age. All these variables are clearly associated with baby weight.
```{r tidy = TRUE, message= FALSE, warning = FALSE}
#the analysis will begin with a base model of known predictors of birth weight
base_model = lm(bwt ~ bhead + blength + gaweek, data = birth_data_clean)
summary(base_model)
```
After considering this base model and looking at the coefficients other interesting variables will be added to the model to determine if they are significantly associated with birthweight. The first variable considered is mother's age. One would expect mother's age to be somewhat related to birthweight because one would expect younger mothers to have babies with lower birthweights.
```{r tidy = TRUE, message= FALSE, warning = FALSE}
#The age of the mother seems to be a possible predictor of birthweight so it will be included in the model and then the estimates and significance of the model will be assessed
model.1 = lm(bwt ~ bhead + blength + gaweek + momage, data = birth_data_clean)
summary(model.1)
```
As predicted mother's age seems to have a significant relationship with birthweight so it will be kept in the model temporarily. While it is a significant predictor the values of the coefficients dont change signaling that it is probably not a confounder. The next two variables added to the model will be weight gain and family income.
```{r tidy = TRUE, message= FALSE, warning = FALSE, results = "hide"}
#next wtgain will be included in the model
model.2 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain, data = birth_data_clean)
summary(model.2)
#next fincome will be included in the model to determine if it is a significant predictor of birthweight
model.3 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain + fincome, data = birth_data_clean)
summary(model.3)
```
Weight gain and fincome were added independently one at a time. Weight gain is significantly associated with birthweight and all other variables remained significant. When family income was added to the model, based on the p-value it was determined to be significant and all the values of the coefficients changed signaling that it is a possible confounder. While fincome, is currently significant, it is not as significant as the other variables so it may not be included in the final model. This will be examined as the process continues.
The next two variables to be added will be smoken and ppbmi.
```{r tidy = TRUE, message= FALSE, warning = FALSE, results = "hide"}
#next smoken is added to the model to determine if it is significant
model.4 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain + fincome + smoken, data = birth_data_clean)
summary(model.4)
#smoken not significantly related to birthweight so it will not be included in the final model, all other variables remain significant
#next ppbmi
model.5 = lm(bwt ~ bhead + blength + gaweek + momage + delwt + fincome + ppbmi, data = birth_data_clean)
summary(model.5)
#ppbmi significant, ppwt will not be included because its perfectly correlated to ppbmi
```
While smoking is known to cause malformations in babies, it is not found to be a significant predictor of birthweight. Later in the analysis smoken as a categorical variable will be added to see if that significantly predicts birthweight. Ppbmi is found to be a significant predictor of birthweight so it will be added to the model. It is noteworthy that the p-values of momage and fincome have increased.
Next babysex will be added to the model.
```{r tidy = TRUE, message= FALSE, warning = FALSE, results= "hide"}
#next baby sex
model.6 = lm(bwt ~ bhead + blength + gaweek + momage + delwt + fincome + ppbmi + babysex, data = birth_data_clean)
summary(model.6)
```
Baby sex is also a significant predictor of birthweight. Fincome and momage have remained stable.
Father's race will now be considered.
```{r tidy = TRUE, message= FALSE, warning = FALSE}
#next father race will be included
model.7 = lm(bwt ~ bhead + blength + gaweek + momage + delwt + fincome + ppbmi + babysex + frace, data = birth_data_clean)
summary(model.7)
```
Father race is not significant at any level so it will not be included in the final model. Because many obervations were removed because of the NA's of frace, these observations will be restored into the model dataset. These observations carry a lot of important information. The model building model will be restarted because these new observations can change the coefficients.
```{r tidy = TRUE, message= FALSE, warning = FALSE, echo = FALSE, results= "hide"}
birth_data_clean =
birth_data %>%
mutate(babysex = factor(babysex, c(1:2), labels = c("male", "female")),
frace = factor(frace, c(1, 2, 3, 4, 8, 9),
label = c("White", "Black", "Asian",
"Puerto Rican", "Other", "Unknown")),
mrace = factor(mrace, c(1, 2, 3, 4, 8),
labels = c("White", "Black", "Asian",
"Puerto Rican", "Other")),
malform = factor(malform, c(0, 1),labels = c("Absent", "Present"))) %>%
dplyr::select(c(babysex, bhead, blength, bwt,
delwt, fincome, frace, gaweek,
malform, menarche, mheight, momage, mrace, parity, ppbmi,
ppwt, smoken, wtgain, pnumlbw, pnumsga)) %>%
filter(menarche != 99, menarche != 0, babysex != "NA",
bhead != "NA", bwt != "NA", delwt != "NA",
gaweek != "NA", ppbmi != "NA", smoken != "NA",
wtgain != "NA", blength != "NA") %>%
mutate(smoken_cat = cut(smoken, c(-1,0,20,60), right = TRUE, labels = c("none","light","heavy")))
```
```{r tidy = TRUE, message= FALSE, warning = FALSE, echo = FALSE, results = "hide"}
#the analysis will begin with a base model of known predictors of birth weight
base_model = lm(bwt ~ bhead + blength + gaweek, data = birth_data_clean)
summary(base_model)
model.1 = lm(bwt ~ bhead + blength + gaweek + momage, data = birth_data_clean)
summary(model.1)
model.2 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain, data = birth_data_clean)
summary(model.2)
model.3 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain + fincome, data = birth_data_clean)
summary(model.3)
model.4 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain + fincome + smoken, data = birth_data_clean)
summary(model.4)
model.5 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain + fincome + ppbmi, data = birth_data_clean)
summary(model.5)
#ppbmi significant, ppwt will not be included because its perfectly correlated to ppbmi
#next sex
model.6 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain + fincome + ppbmi + babysex, data = birth_data_clean)
summary(model.6)
```
The model building process with the restored observations yield similar results as previously. The process will now continue.
```{r tidy = TRUE, message= FALSE, warning = FALSE, results = "hide"}
#next mother race will be included
model.7 = lm(bwt ~ bhead + blength + gaweek + momage + wtgain + fincome + ppbmi + babysex + mrace, data = birth_data_clean)
summary(model.7)
#certain levels of mother race are sign, but levels with sparse observations are not sign, fincome no longer sign momage not sign
```
Most levels of of mrace are significant so it will be kept in the model. Next parity and malform will be considered.
```{r tidy = TRUE, message= FALSE, warning = FALSE, results = "hide"}
model.8 = lm(bwt ~ bhead + blength + gaweek + wtgain + ppbmi + babysex + mrace + parity, data = birth_data_clean)
summary(model.8)
#parity not significant
model.9 = lm(bwt ~ bhead + blength + gaweek + wtgain + ppbmi + babysex + mrace + malform, data = birth_data_clean)
summary(model.9)
```
Both of these variables are not significant the remaining variables have not changed a prelimanary model is now formed.
```{r tidy = TRUE, message= FALSE, warning = FALSE}
model_prelim = lm(bwt ~ bhead + blength + gaweek + wtgain + ppbmi + babysex + mrace, data = birth_data_clean)
summary(model_prelim)
```
Because one of the levels of race is not significant we will use anova to test if the entire race variable is significant. Additionally, I compared the predictive ability of the model's comparing one with race and one without.
```{r tidy = TRUE, message= FALSE, warning = FALSE}
model_prelim_norace = lm(bwt ~ bhead + blength + gaweek + wtgain + ppbmi + babysex, data = birth_data_clean)
anova(model_prelim_norace, model_prelim)
AIC(model_prelim)
AIC(model_prelim_norace)
```
Race as a variable was found to be significant. The model with race has a lower AIC and a therefore is a better predictive model and will remain in the model.
Now we will explore some scientifically interesting interactions. The interaction between sex and baby length.
```{r tidy = TRUE, message= FALSE, warning = FALSE, results="hide"}
model_inter_sex = lm(bwt ~ bhead + blength + gaweek + wtgain + ppbmi + babysex + mrace + blength * babysex, data = birth_data_clean)
summary(model_inter_sex)
AIC(model_inter_sex)
AIC(model_prelim)
```
Race does not seem to modify the relationship between baby length and birthweight adjusting for the other variables. The interaction between sex and baby length is significant and the model with this interaction has a lower AIC than the preliminary model. So the final model will include the interaction between baby sex and baby length.
```{r tidy = TRUE, message= FALSE, warning = FALSE}
model.final = lm(bwt ~ bhead + blength + gaweek + wtgain + ppbmi + babysex + mrace + blength * babysex, data = birth_data_clean)
summary(model.final)
```
###(b)Regression Diagnostics
```{r tidy = TRUE, message= FALSE, warning = FALSE}
## save fitted values and residuals
residual_data = birth_data_clean %>%
mutate(fitted = fitted(model.final), resid1 = resid(model.final)) %>%
dplyr::select(fitted, resid1)
resid_plot = ggplot(residual_data, aes(x = fitted, y = resid1)) + geom_point() +
ggtitle("Residual Plot")
qq_plot = ggplot(residual_data, aes(sample = resid1)) + stat_qq() +
ggtitle("QQ-Plot")
grid.arrange(resid_plot, qq_plot, ncol = 2)
## plot hat values
leverage_data = mutate(birth_data_clean,
hatvals = hatvalues(model_inter_sex),
index = seq_along(hatvals))
ggplot(leverage_data, aes(x=index, y=hatvals)) + geom_point() +
ylim(0,0.065) + geom_hline(yintercept = 2*(7+1)/dim(birth_data_clean)[1]) +
theme_bw() + ggtitle("Leverage Plot")
cooks_data = birth_data_clean %>%
mutate(fitted = fitted(model.final), resid = resid(model.final), cooks = cooks.distance(model.final), index = seq_along(cooks)) %>%
mutate(influence = ifelse(cooks >= 0.025, TRUE, FALSE))
ggplot(cooks_data, aes(x=index, y=cooks)) + geom_point() +
geom_hline(yintercept = 2*(7+1)/dim(birth_data_clean)[1]) +
theme_bw() + ggtitle("Cooks Distance")
influence_data = cooks_data %>%
filter(influence == TRUE) %>%
dplyr::select(c(bwt, blength, bhead, gaweek, wtgain, mrace, ppbmi, babysex, mrace))
```
###(c)Summary of Findings
The dataset analyzed contains `r dim(birth_data)[1]` observations of 20 variables. Some of the variables under consideration have high number of NA's (>200). These variables are baby's head size, baby length, ppbmi, weight gain delivery weight, father's race, mother's height. Father's race has an abnormally large number of missing values: `r sum(is.na(birth_data$frace))`. Observations that had any NA's were initially removed so that analysis can be performed. For variables that were not included in the final model, these observations were restored, because all observations hold important information and should only be removed if absolutely necessary or if all values are missing. The distribution of birthweights displayed some exterme low birth weights. This will be addressed with model diagnostics.
Once the data was cleaned, the model building process began with a base model of predictors that are scientifically known to be ass confounders. These variables were baby's head size, baby's length, and gestational age. Not all variables were considered for the final model. Many variables were linear combinations of other variables and we highly, or even perfectly correlated to other variables. Interesting variables were added one at a time and their significance and the significance of the other variables were evaluated. The criteria for significance was 0.05. Predictors that were found to be insignificant at each stage were removed, unless they were part of the starting model. This process yielded a model with 7 predictors: the base variables, weight gain, pre-pregnancy BMI, sex of the baby, and mother's race. This model was the preliminary model considered. Next possibility that number of cigarretes as continuous variables is not the ideal for of the variable was considered, because effect of one cigarrette on birthweight is not as interesting as the effect of being a heavy smoker on birthweight. Therefore, birthweight was converted into a categorical variable with 3 levels. Then this new variable was tested for significance. It was determined that it was still not significant. Next, scientifically interesting interactions were considered. The questions analyzed was: does the relationship between the length of the baby and birthwight vary by sex adjusting for the other variables. The interaction between length and sex was found significant. To determine if including this interaction improves prediction, AIC were compared between a model with the interaction and a model without. The model with the interaction had a lower AIC so the interaction was added to the model. The final model included model with 7 predictors and 8 terms: baby length, baby head size, gestational age, weight gain, pre-pregnancy BMI, sex of the baby, mother's race, and the interaction between sex and baby length. Diagnostic tests were the run of this model to ensure that all the assumptions are met and that there are not too many influential points. The plot of the residuals exhibited randmoness across 0, so the assumption of constant variance is met. Additionally, the qqplot was linear, meaning that the error terms are normally distributed and inference can be conducted on the model. Next leverage points were considered. While the leverage plot exhibited many leverage points, the influence of these points need to be determined. This was done using a plot of cook's distance. Five observations were exerting a lot of influence on the model. These 5 observations were examined and found to be important so they were not removed. The highest two points that exerted the most influence were points with exceptionally low birth weight, but their gestational age was also quite small so they are interesting points and were not removed from the dataset.
Interpretation of Coefficients:
bhead: For a one cm increase in the circumference of a baby's head the expected birthweight of a baby would increase by 0.27 pounds, adjusting for length, gestational age, weight gain, bmi, sex, and race.
blength: For a one cm increase in length of a baby the expeced birthweight would increase by 0.188 pounds, adjusting for head circumference, gestational age, weight gain, bmi, sex, and race (not so meaningful because of interaction.
gaweek: For a one week increase in gestational age the expected birthweight of a baby would increase by 0.019 pounds, adjusting for head circumference, length, weight gain, bmi, sex, and race.
wtgain: For a one pound increase in weight gain the expected birthweight of a baby would increase by 0.011 pounds, adjusting for head circumference, length, gestational age, bmi, sex, and race.
ppbmi: For a one unit increase in ppbmi, the expected birthweight of a baby would increase by 0.019 pounds, adjusting for head circumference, length, weight gain, gestational age, bmi, sex, and race.
mraceBlack: The expected birthweight of a baby born to a black mother is 0.20 pounds less than the expected birthweight of a baby born to a white mother, adjusting for head circumference, length, weight gain, gestational age, bmi, and sex.
mraceAsian: The expected birthweight of a baby born to a Asian mother is 0.33 pounds less than the expected birthweight of a baby born to a white mother, adjusting for head circumference, length, weight gain, gestational age, bmi, and sex.
mracePuertoRican: The expected birthweight of a baby born to a Puerto Rican mother is 0.20 pounds less than the expected birthweight of a baby born to a white mother, adjusting for head circumference, length, weight gain, gestational age, bmi, and sex.
mraceOther: The expected birthweight of a baby born to a mother of race category "other" is 0.15 pounds less than the expected birthweight of a baby born to a white mother, adjusting for head circumference, length, weight gain, gestational age, bmi, and sex.
Interaction between babysex and blength: The difference in rates of change of the relationship between baby length and birthweight between males and females is -0.024, adjusting for the other variables.
\pagebreak
#### Solution 2:
\textcolor{red}{
Solution 2 did an excellent job of defining important predictors and stating expectations at the outset; it also identifies possible confounders and collinear variables at the outset. As in Solution 1, these first steps were augmented by careful exploratory analyses. This approach compared backward and forward selection on the subset of important predictors; comparison using AIC isn't all that helpful here, since the underlying datasets are different, but the qualitative comparison of included predictors can be helpful in an exploratory analysis. This solution also carefully evaluated effect modification ane confounding by the smoking variable, using the tools learned in categorical data analysis.
}
```{r setup1, echo=FALSE, message= FALSE, warning= FALSE, results='hide'}
rm(list = ls())
library(dplyr)
library(ggplot2)
library(tidyr)
library(gridExtra)
library(MASS)
library(glmnet)
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_BWT.RDA"))
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_Accel.RDA"))
```
The following dataset contains 20 variables of interest as potential predictors of birthweight.
```{r, echo=TRUE, results='hide'}
summary(data.bwt)
#Clean the Data
clean_bwt= dplyr::select(data.bwt, c(babysex, bhead, blength, bwt,
delwt, fincome, frace, gaweek,
malform, menarche, mheight, momage, mrace, parity, ppbmi,
ppwt, smoken, pnumlbw, pnumsga, wtgain)) %>%
mutate(., babysex= factor(babysex, levels=c("1", "2"),
labels=c("Male", "Female"))) %>%
mutate(., mrace= factor(mrace, levels=c("1", "2", "3", "4", "8"),
labels=c("White", "Black", "Asian", "Puerto Rican", "Other"))) %>%
dplyr::select(., -c(pnumlbw, pnumsga))%>%
mutate(., malform= factor(malform, levels=c("0", "1"),
labels=c("Absent", "Present"))) %>%
filter(., !is.na(bwt))
summary(clean_bwt)
```
I cleaned the data above by converting some of the variables from numeric to factor, and selecting the 20 variables of interest from the greater dataset which contains 69 variables.I removed pnumlbw and pnumsga variables completely from the birthweight dataset because all of the observations of both of these variables were 0, which I found in the summary of the dataset. I also removed all of the missing values from our outcome, birthweight, because if the outcome is missing than the observations cannot tell us anything. I chose not to remove anything else from the dataset yet until I produce a model of interest. I also noted that one variable in particular, father race, had a very large number of missing values, over 2000 observations.
###Part A.###
*Step 1: Univariable Analysis
*Step 2: Model Selection Process
*Step 3: Interactions
*Step 4: Confounding
Step 1: Univariable Analysis
To begin the selection process, I identified variables in the model that were of interest, and thus would likely associated with birthweight. The variables I was most interested in were: sex of the baby, head circumference, length of the baby, gestational age in weeks, pre-pregnancy weight of mother, mother race, and number of cigarettes smoked by mother on average a day during pregnancy. I would expect birthweight to be greater for babies that are males than for females. I would expect babies with a larger head circumference to be heavier, that bigger length would also imply heavier baby, that higher gestational age would lead to heavier babies, and that women who weighed more pre-pregnancy may thus have a larger baby because perhaps they eat more overall throughout their pregnancy. Pre-pregnancy mother weight may be also correlated to several other variables of interest, like pre-pregnancy BMI, mother height, and delivery weight of the mother, all 3 of which had more missing observations than that of pre-pregnancy weight. Therefore, I chose to start with pre-pregnancy weight of the mother. Mother race is an important variable since individuals of one race have a different genetic combinationthan those of another race, and this would be an important association to find in the dataset. Lastly, I would expect greater number of cigarettes smoked may lead to smaller birthweights, because typically smoking is associated with birth defects or changes for the baby. Father race was also of interest initially, but since over 2000 observations of this variable were missing, I decided to exclude this from the list of variables of interest. Each of the predictors were graphed against the outcome, birthweight, and a univariable analysis was performed on each of the variables of interest to determine if there is any association between each of the individual variables and the birthweight of the baby.
```{r, echo=TRUE, eval=FALSE}
#Plot the data to see the associations
g1= ggplot(clean_bwt, aes(x=babysex, y=bwt)) + geom_point() + theme_bw()
g2= ggplot(clean_bwt,aes(x=bhead, y=bwt)) + geom_point() + theme_bw()
g3= ggplot(clean_bwt,aes(x=blength, y=bwt)) + geom_point() + theme_bw()
g4= ggplot(clean_bwt,aes(x=delwt, y=bwt)) + geom_point() + theme_bw()
g5= ggplot(clean_bwt,aes(x=fincome, y=bwt)) + geom_point() + theme_bw()
g6= ggplot(clean_bwt,aes(x=frace, y=bwt)) + geom_point() + theme_bw()
g7= ggplot(clean_bwt,aes(x=gaweek, y=bwt)) + geom_point() + theme_bw()
g8= ggplot(clean_bwt,aes(x=malform, y=bwt)) + geom_point() + theme_bw()
g9= ggplot(clean_bwt,aes(x=menarche, y=bwt)) + geom_point() + theme_bw()
g10= ggplot(clean_bwt,aes(x=mheight, y=bwt)) + geom_point() + theme_bw()
g11= ggplot(clean_bwt,aes(x=momage, y=bwt)) + geom_point() + theme_bw()
g12= ggplot(clean_bwt,aes(x=mrace, y=bwt)) + geom_point() + theme_bw()
g13= ggplot(clean_bwt,aes(x=parity, y=bwt)) + geom_point() + theme_bw()
g14= ggplot(clean_bwt,aes(x=ppbmi, y=bwt)) + geom_point() + theme_bw()
g15= ggplot(clean_bwt,aes(x=ppwt, y=bwt)) + geom_point() + theme_bw()
g16= ggplot(clean_bwt,aes(x=smoken, y=bwt)) + geom_point() + theme_bw()
g17= ggplot(clean_bwt,aes(x=wtgain, y=bwt)) + geom_point() + theme_bw()
p= grid.arrange(g1, g2, g3, g4, g5, g6, g7, g8, g9, nrow=3, ncol=3)
p1= grid.arrange(g10, g11, g12, g13, g14, g15, g16, g17, nrol=3, ncol=3)
```
```{r, echo=TRUE, results='hide'}
#check correlation of similar variables
#Clean data to remove NA values for variables of interest
clean_bwt1= filter(clean_bwt, !is.na(babysex), !is.na(bhead), !is.na(blength),
!is.na(mrace), !is.na(ppwt), !is.na(smoken), !is.na(gaweek))
#Univariable analysis of variables of interest
u1= lm(bwt ~ babysex, data=clean_bwt1)
summary(u1)
u2= lm(bwt ~ bhead, data=clean_bwt1)
summary(u2)
u3= lm(bwt ~ blength, data=clean_bwt1)
summary(u3)
u4= lm(bwt ~ ppwt, data=clean_bwt1)
summary(u4)
u5= lm(bwt ~ mrace, data=clean_bwt1)
summary(u5)
u6= lm(bwt ~ smoken, data=clean_bwt1)
summary(u6)
u7= lm(bwt ~ gaweek, data=clean_bwt1)
summary(u7)
```
Looking at the graphs, I saw linear trends for the association between baby head cirumference and birthweight, babylength and birthweight, and gestational age in weeks and birthweight. The univariables analyses for baby sex, baby head, baby length, pre-pregnancy weight, number of cigarettes, and gestational age all had significant p-values. 1 of the 4 categorical levels of mother's race had an insignificant p-value. The association of birthweight of babies from other race mothers and birthweight of babies from white mothers had a p-value of 0.3985. However, since the association with birthweight among the other three races, Black, Asian, and Puerto rican, were significant, and because mother race is an interesting variable for the study of baby birth weight, I chose to continue to analyze this variable.
Step 2: Model Selection Process
I proceeded with forward selection using the 7 variables of interest, to see if I a multivariable model with all of my predictors of interest would have significant terms.
```{r, echo=FALSE, results="hide"}
#Multivariable model
modelA= lm(bwt~ babysex + bhead + blength + ppwt + mrace + gaweek + smoken,
data=clean_bwt1)
summary(modelA)
#Anova for mrace
lmfull= lm(bwt~ babysex + bhead + blength + ppwt + gaweek + mrace + smoken,
data=clean_bwt1)
lmred= lm(bwt~ babysex + bhead + blength + ppwt + gaweek + smoken,
data=clean_bwt1)
anova(lmfull, lmred)
```
The results of the multivariable model fitted with the 7 variables of interest, `r ls(modelA)`, showed significance for nearly all of the variables of interest at the 5% significance level. 9 of the coefficients in the model were significant out of 10 total terms. One of the coefficients for mother race was insignificant in this model, so I did an Anova test on mother race. The Anova test compared a model with and without mother race, and had a significant p-value < 0.05, so this showed that mother race is a significant predictor of birthweight, and that I should keep the mother race term in the model.
Next, I ran a backwards selection process to see if my variables of interest are alsoincluded in a model that began with all of the variables in the tided dataset. The variable for weight gain was removed from the model first because it was producing a value of NA in the selection below, probably due to it being highly correlated with another variable in the dataset such as pre-pregnancy weight of the mother or delivery weight.
```{r, echo=TRUE, eval=FALSE, results="hide"}
#Backward Selection
#Fit model with all variables of interest
model1= lm(bwt ~., data=clean_bwt)
summary(model1)
AIC(model1)
#Remove wtgain because NA
model2= lm(bwt~. -wtgain, data=clean_bwt)
summary(model2)
AIC(model2)
#Remove frace
model3= lm(bwt~. -(wtgain + frace), data=clean_bwt)
summary(model3)
AIC(model3)
#remove mheight
model4= lm(bwt~. -(wtgain+ frace+ mheight), data=clean_bwt)
summary(model4)
AIC(model4)
#remove malform
model5= lm(bwt~. -(wtgain + frace + mheight + malform), data=clean_bwt)
summary(model5)
AIC(model5)
#remove momage
model6= lm(bwt~. -(wtgain + frace+ mheight + malform + momage), data=clean_bwt)
summary(model6)
AIC(model6)
#remove parity
model7= lm(bwt~. -(wtgain + frace+ mheight + malform + momage + parity),
data=clean_bwt)
summary(model7)
AIC(model7)
#remove ppwt
model8= lm(bwt~. -(wtgain + frace+ mheight + malform + momage + parity +
ppwt), data=clean_bwt)
summary(model8)
AIC(model8)
#remove fincome
model9= lm(bwt~. -(wtgain + frace+ mheight + malform + momage + parity +
ppwt + fincome), data=clean_bwt)
summary(model9)
AIC(model9)
#remove menarche
model10= lm(bwt~. -(wtgain + frace+ mheight + malform + momage + parity +
ppwt + fincome + menarche), data=clean_bwt)
summary(model10)
AIC(model10)
#Remove smoken
model11= lm(bwt~. -(wtgain + frace+ mheight + malform + momage + parity +
ppwt + fincome + menarche + smoken), data=clean_bwt)
summary(model11)
AIC(model11)
#Test for possible confounding between delivery weight and ppbmi
#Remove delwt
model12= lm(bwt~. -(wtgain + frace+ mheight + malform + momage + parity +
ppwt + fincome + menarche + smoken + delwt), data=clean_bwt)
summary(model12)
AIC(model12)
```
```{r, echo=TRUE, results="hide"}
#Final model comparison
#Backward final
model13= lm(bwt~. -(wtgain + frace+ mheight + malform + momage + parity +
ppwt + fincome + menarche + ppbmi + smoken), data=clean_bwt)
summary(model13)
AIC(model13)
#Final Model
modelA= lm(bwt~ babysex + bhead + blength + ppwt + gaweek + mrace + smoken,
data=clean_bwt1)
summary(modelA)
#Clean dataset for this new model
clean_bwt2= filter(clean_bwt, !is.na(babysex), !is.na(bhead), !is.na(blength),
!is.na(mrace),!is.na(smoken), !is.na(gaweek), !is.na(delwt))
```
The backward selection model produced a model with 6 variables instead of 7 variables, but all 6 terms are terms that I found of interest in my original multivariable model. The AIC value for the backward selection model is `r AIC(model13)`, which is much less than the AIC for the forward selection model, which is `r AIC (modelA)`. This is expected because the backwards model calculates AIC in a different way in the R computing system than the forward model, so the two AIC values cannot be compared. I tested for correlation between delivery weight and ppbmi, since they were both left in this final model but I had originally hypothesized that one may be correlated with the other. When delivery weight is removed, the pre-pregnancy BMI p-value becomes insignificant, and is rather high, so I decided to only keep delivery weight in the backwards selection model because this has a significant, smaller p-value when in the model alone, and it also reduced the AIC more.
Since the backward selection was slightly different than the original model, I decided to make an alteration and instead of including pre-pregnancy weight in my original model I put delivery weight in the model and removed pre-pregnancy weight, as delivery weight of the mother was more significant. This still accomplishes the idea of interest that the weight of the mother will impact the birthweight of the baby, but with a different variable. In addition, I chose my original model to continue to look at interactions and confounding because smoking was a variable of interest, and was significant in my multivariable model.
Step 3: Interactions
Interactions between variables could be very informative and important to understand how to interpret the data (as stratified by the effect modifier or not), so I proceeded to check interaction terms one by one in the model. I hypothesized that the number of cigarettes smoked could effect the association between baby head circumference and birthweight since number of cigarettes smoked can sometimes cause birth defects for babies. I also hypothesized that the number of cigarettes smoked could effect the association between delivery weight and birth weight of the baby, since smoking alters the weight of the smoker and may change how weight of the mother explains weight of the baby.
```{r, echo=TRUE}
#Interaction between bhead and smoken
i2= lm(bwt~ babysex + bhead + blength + delwt + gaweek + mrace + smoken + bhead*smoken,
data=clean_bwt2)
summary(i2)
#Interaction between delwt and smoken
i3= lm(bwt~ babysex + bhead + blength + delwt + gaweek + mrace + smoken + delwt*smoken,
data=clean_bwt2)
summary(i3)
```
I tested two interactions that I thought would be interesting and potentially informative but both had insignificant p-values < 0.05. There are other interactions that could be further tested in future studies, such as interactions between mother's race and some of the other variables of interest.
Step 4: Confounding
This process was done by starting with my most recent model, and looking at whether or not smoking is a confounder, since smoking is known to alter the weight of the smoker, the weight of a baby if done during pregnancy, A potential confounder must satisfy three requirements: it must be associated with the outcome, it must be also associated with the explanatory variable (predictor), and it must not be in the causal pathway between the predictor and outcome.
```{r, echo=TRUE}
#Smoking and mrace
c1= ggplot(clean_bwt2, aes(x=mrace, y=smoken)) + geom_point()
#Smoking and delivery weight
c2= ggplot(clean_bwt2, aes(x=delwt, y=smoken)) + geom_point()
#Smoking and babylength
c3= ggplot(clean_bwt2, aes(x=blength, y=smoken)) + geom_point()
#Smoking and babyhead
c4= ggplot(clean_bwt2, aes(x=bhead, y=smoken)) + geom_point()
#Smoking and gaweek
c5= ggplot(clean_bwt2, aes(x=gaweek, y=smoken)) + geom_point()
```
Using the graphs, I hoped to identify relationships or any possible correlations between smoking and several variables (I did not include the graphs in the output because more exploratory). If smoking is associated with any of the variables, we know smoking is also associated with birthweight, the outcome, and therefore could be identified as a possible confounder. From the graphs, smoking may be associated with race, does not appear to be extremely associated with delivery weight, may be associated with baby length and baby head circumference, and is possibly associated with gestational age in weeks, but the graphs are difficult to fully interpret a result. In further analysis, confounder results could be quantified to determine if the confounder changes the coefficient of the confounded variable by >10%. The best way to handle these confounders is to leave them in the model and to stratify by the confounding variable so that the analysis of the effects can be interpreted without being biased.
###Part B.###
```{r, echo=TRUE}
#clean data of final model
finalmod= lm(bwt~ blength + bhead + delwt + mrace + gaweek +smoken , data=clean_bwt2)
finalbwt= dplyr::select (clean_bwt2, -c(wtgain, frace, mheight, malform, momage, parity,
ppwt, fincome, menarche, ppbmi))
finalmod= lm(bwt~., data=finalbwt)
#Residuals
finalbwt= mutate(finalbwt, fitted= fitted(finalmod), resid=resid(finalmod))
#Plot residuals against fitted values
ggplot(finalbwt, aes(x=fitted, y=resid)) + geom_point() + theme_bw() +
labs(title="Residuals against Fitted Values")
```
```{r, echo=TRUE, fig.width=10}
#Graphs of final variables
r1= ggplot(finalbwt, aes(x=babysex, y=bwt)) + geom_point() + theme_bw()
r2= ggplot(finalbwt, aes(x=bhead, y=bwt)) + geom_point() + theme_bw()
r3= ggplot(finalbwt, aes(x=blength, y=bwt)) + geom_point() + theme_bw()
r4= ggplot(finalbwt, aes(x=gaweek, y=bwt)) + geom_point() + theme_bw()
r5= ggplot(finalbwt, aes(x=mrace, y=bwt)) + geom_point() + theme_bw()
r6= ggplot(finalbwt, aes(x=delwt, y=bwt)) + geom_point() + theme_bw()
r7= ggplot(finalbwt, aes(x=smoken, y=bwt)) + geom_point() + theme_bw()
#Arrange in grid
grid.arrange(r1, r2, r3, r4, r5, r6, r7, nrow=2, ncol=4)
```
```{r, echo=TRUE}
# QQ plot
ggplot(finalbwt, aes(sample = resid)) + stat_qq() + theme_bw()+ labs(title="QQPlot")
#Leverage plots
set.seed(1)
#Create hat values
finalbwt= mutate(finalbwt, hatvals= hatvalues(finalmod),index= seq_along(hatvals))
#Graph leverage points and decide on a cutoff point
ggplot(finalbwt, aes(x=index, y=hatvals)) + geom_point() +
geom_hline(yintercept = 6*8/3546) + theme_bw() + labs(title="Possible Leverage Points")
#Quantify number of leverage points
sum(finalbwt$hatvals>=48/3546)
#Cook's distance
finalbwt= mutate(finalbwt, cooks=cooks.distance(finalmod), index= seq_along(cooks))
#Graph cook's distance and cutoff point
ggplot(finalbwt, aes(x=index, y=cooks)) + geom_point() +
geom_hline(yintercept = 2*8/3546) + theme_bw() + labs(title="Possible Influential Points")
#Quantify number of influential points
sum(finalbwt$cooks>=16/3546)
```
The graphs above provided information about the residual plots, leverage points and influential points.
1. Residual Plots/QQplot
The residual plots and QQplot were used to identify if the assumptions for linear regression were held. The QQplot indicated a very linear trend, therefore indicating that the normality assumption was held. The original plot of residuals against fitted values was overall spread evenly across y=0 line, with some extreme exceptions on the left side of the graph that may be potential outliers. To determine which variables these outliers may come from, I also created plots for each individual variable against birthweight the outcome. Based on these plots, I tried to determine several potential outliers, or points that did not make sense given the variable. Most of the points however made sense with the data and the individual plots did not take into account multivariable interactions. The model relatively meets the linearity assumption of constant variance (homoscedasticity) and linearity (shown by randmoness of points around zero).
2. Leverage Points
I created a graph to show the leverage points in the final linear model of 7 variables modeled against birthweight. There were a lot of leverage points way above the data. When I made the cutoff point 2*(p+1)/n, where p= number of parameters and n= number of observations, the cutoff point was too low and contained around 500 data points. I increased the cutoff point to a reasonable level and then quantified the number of observations, which is `r sum(finalbwt$hatvals>=48/3546)`. The graphs of each of the leverage points were used to see if there were any variables that had any crazy number of leverage points.
3. Cook's Plots
The same protocol used for leverage points was applied to Cook's distance plots to look for highly influential points in the final dataset. I made a cutoff point of 4*(p+1)/n originally, but later altered this to be lower because it did not contain enough influential points. I then found that there were `r sum(finalbwt$cooks>=16/3546)` in the final dataset.
I did not delete any of the values from the final birthweight model because some of these potential outliers and influential points may be important in describing the association between the variables and birthweight. Perhaps, a more detailed analysis could be done on the particular subjects who produced these variables, to determine if removing the influential or leverage points is worthwhile for the analysis.
###Part C.###
```{r, echo=TRUE}
#final model summary
summary(finalmod)
```
The birthweight data set in this problem was used to model the outcome, birthweight, against a set of scientifically important predictors, `ls(data.bwt)`. The original dataset contains `r count(data.bwt)` and `r length(data.bwt)` variables and a total of `sum(!complete.cases(data.bwt))` missing values in the dataset. In terms of the variables of interest for building our model, several important variables have a large number of missing values, in this case, more than 300 observations are mentioned. Frace, variable describing father race, had the largest number of missing values, `r sum(!complete.cases(data.bwt$frace))`. There were `r length (clean_bwt)` variables of interest, and thus after cleaning the data the number of observations in the tidied dataset was `r count(clean_bwt)`. The intial selected variables of interest were baby sex, baby length, baby head circumference, mother race, gestational age in weeks, number of cigarettes smoked by mother on average per day during pregnancy, and delivery weight. The model built contains `ls(finalbwt)` variables. The model was built putting in the variables of interest into a model and then a backward selection was done to see if the variables of interest were in the backwards selection as well. Following these two selection methods, a test was done on a finalized model on interactions, but none of the interactions were deemed statistically significant, and thus were not included in the final model. Scatterplots were generated for confounders amongst a few interesting and potential confounders left in the model. The regression diagnostics indicate that the model generally meets the assumptions of linearity. The leverage and cook's plots indicated some values that could potentially be removed from the dataset due to being highly influential or high leverage points, but upon looking at the graphs of each variable against outcome it was difficult to tell which points were truly the leverage or influential points, so no observations were removed. Compared to male babies, female babies are approximately 0.05796 pounds more on average, holding all else constant. For every 1 centimeter increase in baby's head circumference at birth, birth weight in creases by 0.2732 pounds, on average, holding all else constant. For every 1 centimeter increase in baby's length at birth, birth weight increases by 0.1762 pounds on average, holding all else constant. For every 100 pound increase in delivery weight of mother, birth weight increases by 0.5698 pounds on average, holding all else constant. For every 10 week increase in gestational age, birth weight increases by 0.2044 pounds on average, holding all else constant. Compared to white mothers, black mothers have babies with birth weight approximately 0.2070 pounds less on average, holding all else constant. Compared to white mothers, puerto rican mothers have babies with birth weight approximately 0.1911 pounds less on average, holding all else constant. For every 10 cigarette increase in average number of cigarettes smoked per day during pregnancy, the baby birthweight decreases by 0.0579 pounds, holding all else constant.
Note: I interpreted all of the statistically significant coefficient terms in the final dataset, which did not include the term comparing mother race asian to white mothers, or other mother race to white mothers.
\pagebreak
#### Solution 3:
\textcolor{red}{
Solution 3 thought carefully about the motivation for this study and about variables in the dataset. As an example, in addition to noting the high degree of missingness for father's race, this Solution pointed out the general concordance between father's and mother's race for observed data. Additionally, a conscious choice was made to omit birth length and head circumference, since those would be measured at the same time as weight. The adjustment for known confounders was based on a literature review, and the analysis was guided by hypotheses regarding socioeconomic status. Showing the raw data for subjects with high cook's distance is helpful, but hard to contextualize; Solution 4 has a neat approach to this aspect.
}
```{r setup2, echo=FALSE, message=FALSE, warning=FALSE}
rm(list = ls())
library(dplyr)
library(ggplot2)
library(broom)
library(tidyr)
library(MASS)
library(glmnet)
library(knitr)
library(reshape2)
opts_chunk$set(tidy.opts=list(width.cutoff=60))
options(scipen = 999)
```
```{r clean, echo=FALSE}
#load in data
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_BWT.RDA"))
#clean data
data = mutate(data.bwt, babysex=factor(babysex, labels = c("male","female")), frace = ifelse((frace == 9), NA, frace), frace=factor(frace, labels=c("White", "Black", "Asian", "Puerto Rican", "Other")), mrace=factor(mrace, labels=c("White", "Black", "Asian", "Puerto Rican", "Other")), malform=factor(malform, labels = c("absent","present")), fincome = ifelse((fincome == 99), NA, fincome), menarche= ifelse((menarche == 99), NA, menarche)) %>% dplyr::select(babysex, bhead, blength, bwt, fincome, frace, gaweek, menarche, mheight, momage, mrace, parity, ppbmi, ppwt, smoken, wtgain)
```
*a) Build a regression model with the goal of understanding and interpreting scientifically meaningful relationships in the data. To construct and evaluate your model, you should focus on establishing statistical significance while maintaining interpretability. Clearly describe the models you consider and your model selection process.*
The following variables were excluded from the analysis:
1) low birthweight babies (pnumlbw) - all responses in data set = 0
2) previous number of small for gestational age babies (pnumsga) - all responses in data set = 0
3) presence of malformations (malform) - only 8 babies in data set had malformations, the small proportion makes it unlikely we'd have the power to detect a significant association
4) mother's delivery weight (delwt) - since mother's pre-pregnancy weight (ppwt) and mother's weight gain during pregnancy (wtgain) are also variables of interest, this variable is unnecessary
5) father's race (frace), `r NROW(filter(data, is.na(frace)))` subjects were missing this variable. Furthermore, for cases where neither mother's or father's race were missing mother's race was the same as father's `r round(100*NROW(filter(data, frace==mrace))/NROW(filter(data, !is.na(frace))),2)`% of the time. This indicates that mother's and father's race are highly concordant so the inclusion of father's race in addition to mother's does not add much information.
```{r clean2, echo=FALSE}
#drop frace
data = dplyr::select(data, -frace)
#create a function for correlation reporting in panel graphic
panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
# correlation coefficient
r <- cor(x, y, use="pairwise.complete.obs")
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.6, txt)
# p-value calculation
p <- cor.test(x, y, use="pairwise.complete.obs")$p.value
txt2 <- format(c(p, 0.123456789), digits = digits)[1]
txt2 <- paste("p= ", txt2, sep = "")
if(p<0.01) txt2 <- paste("p= ", "<0.01", sep = "")
text(0.5, 0.4, txt2)
}
```
Smoking is known to be associated with low birthweight. I am interested in how the amount of smoking (as quantified by the average number of cigarettes smoked per day during pregnancy) affects birthweight.
A scatter plot of birthweight and number of cigarettes smoked per day is given below. Based on this univariate assessment, there appears to be a small, negative correlation between the average number of cigarettes smoked per day during pregnancy and birthweight.
```{r, echo=FALSE, results='markup', tidy=TRUE}
pairs(~bwt+smoken, data=data, upper.panel=panel.cor)
```
In addition to smoking, socioeconomic status, maternal age and maternal nutrition have also been previously published as factors associated with birthweight. In order to assess how the amount of smoking is associated with birthweight, I will need to adjust my analysis for these other known factors.
Before building the model, I will first look at the univariate associations between the socioeconomic variables in the dataset and birthweight. Family income, race and parity may all serve as indicators for socioeconomic status. Tables of birthweight stratified by these variables are given below. on average, birthweight seems to increase with family income. Due to the high number of categories for the variable, I think it is reasonable to treat family income continuously, assigning each category the mid-point of the category's range. A scatter plot and the correlation between birthweight and financial income is given below as well. The scatter plot and correlation coefficient also indicate a positive correlation between birthweight and family income. Differences in average birthweight between maternal race may also exist. Based on the table below, White mothers have a higher mean birthweight than Black, Asian, Puerto Rican and other mothers. Parity may also be associated with birthweight, however, there are few data points for mothers with parity >2. If we look at the scatter plot of parity, there is no apparent trend between parity and birthweight. Most mothers in this dataset have never had a child before (parity =0), so it is likely we do not have the power to detect an association if there is one. Since race and family income can serve as surrogates for socioeconomic status, I will not consider parity for the full model.
```{r, echo=FALSE, results='markup', tidy=TRUE}
filter(data, !is.na(fincome)) %>% group_by(fincome) %>% summarise(n = n(), mean.bwt = mean(bwt, na.rm=TRUE), sd.bwt = sd(bwt, na.rm=TRUE), median.bwt = median(bwt, na.rm=TRUE))
pairs(~bwt+fincome, data=data, upper.panel=panel.cor)
filter(data, !is.na(mrace)) %>% group_by(mrace) %>% summarise(n = n(), mean.bwt = mean(bwt, na.rm=TRUE), sd.bwt = sd(bwt, na.rm=TRUE), median.bwt = median(bwt, na.rm=TRUE))
filter(data, !is.na(parity)) %>% group_by(parity) %>% summarise(n = n(), mean.bwt = mean(bwt, na.rm=TRUE), sd.bwt = sd(bwt, na.rm=TRUE), median.bwt = median(bwt, na.rm=TRUE))
pairs(~bwt+parity, data=data, upper.panel=panel.cor)
```
A scatter plot and correlation for birthweight and maternal age is given below. Younger mothers appear more likely to have lower birthweight babies compared to older mothers and there is a small, but positive correlation between age and birthweight.
```{r, echo=FALSE, results='markup', tidy=TRUE}
pairs(~bwt+momage, data=data, upper.panel=panel.cor)
```
Scatter plots to explore the relationship between mother's nutrition and birthweight are given below. There is a small but positive correlation between baby weight and pre-pregnancy BMI. Baby weight is more highly correlated with pre-pregnancy weight and even more highly correlated with pregnancy weight gain. Looking at the scatter plot matrix below, pre-pregnancy BMI and pre-pregnancy weight are strongly correlated (r=0.84) so there may be a multicollinearity issue with having both in the model. Arguably BMI is more informative than weight alone, so I will include BMI rather than weight in my model. However, weight gain without a corresponding height is not as informative as a change in BMI would be, so I have created a new variable called chg_BMI (calculated as wtgain/(mheight x mheight))703) which is also positively correlated with birthweight. Pre-pregnancy BMI and change in BMI during pregnancy will serve as indicators for mother's nutrition in the model.
```{r, echo=FALSE, results='markup', tidy=TRUE}
data = mutate(data, bmi_chg=(wtgain/(mheight*mheight))*703)
pairs(~bwt+ppbmi+ppwt+wtgain+mheight+bmi_chg, data=data, upper.panel=panel.cor)
```
Variables in the dataset that are not associated with socioeconomic status, maternal age and nutrition may also be important predictors of birthweight. Other variables in this dataset are baby's head circumference, baby's length, gestational age in weeks and menarche. A scatter plot matrix of all of these variables with birthweight is given below. Mother's age of menarche does not appear to be associated with birthweight (correlation close to 0). Gestational age is positively associated with weight. This is an important variable to include in the final model since gestation time directly corresponds to baby's development and babies born pre-maturely will naturally weigh less than they would have had they not been born pre-maturely. Both baby's head circumference and baby length are highly associated with baby weight. However, I would argue that these variables are not appropriate for the model. Since we are aiming to predict birthweight, temporality wise, knowing the baby's length and head circumference does not help us as they are measured at the same time birthweight is. Furthermore, one could argue that all 3 are measuring the same thing, baby size. Therefore, of the 4 variables mentioned above, only gestational age will be included in the model.
```{r, echo=FALSE, results='markup'}
data = mutate(data, bmi_chg=(wtgain/(mheight*mheight))*703)
pairs(~bwt+bhead+blength+gaweek+menarche, data=data, upper.panel=panel.cor)
```
Based on the univariate analyses and hypothesis driven selection of variables, the final model of birthweight and average number of cigarettes smoked per day will be adjusted for family income, mother's race, mother's age, pre-pregnancy BMI, BMI change during pregnancy and gestational age. The results of the model are given below.
```{r, echo=FALSE, results='markup'}
mult_fit = lm(bwt ~ smoken + fincome + mrace + momage + ppbmi + bmi_chg + gaweek, data = data)
summary(mult_fit)
```
**b) Conduct approriate regression diagnostics for your selected model. Show residual plots; identify high leverage points and influential points, if they exist. If necessary, update your final model based on these diagnostics.**
A plot of the residuals versus the fitted values is given below (Figure 1). Based on this figure, there does not appear to be any pattern in the residuals indicating constant variance has not been violated. A qqplot of the residuals is given in Figure 2. The points fall reasonably well on the line indicating the normality assumption has not been violated.
Figure 1: Residuals against fitted values
```{r, echo=FALSE, results='markup'}
data_for_plots = dplyr::select(data, bwt, smoken, fincome, mrace, momage, ppbmi, bmi_chg, gaweek)
data_for_plots = na.omit(data_for_plots)
## save fitted values and residuals
plotset = data_for_plots %>% mutate(fitted = fitted(mult_fit), resid = resid(mult_fit))
ggplot(plotset, aes(x = fitted, y = resid)) + geom_point() + theme_bw()
```
\pagebreak
Figure 2: QQ Plot of Residuals
```{r, echo=FALSE, results='markup'}
ggplot(plotset, aes(sample = resid)) + stat_qq() + theme_bw()
```
```{r, echo=FALSE, results='markup'}
## plot hat values
plotset = mutate(plotset,
hatvals = hatvalues(mult_fit),
index = seq_along(hatvals),
cooks = cooks.distance(mult_fit),
index2 = seq_along(cooks))
highlev = filter(plotset, hatvals>3*(11)/NROW(plotset))
highinf = filter(plotset, cooks>4/NROW(plotset))
reallyhighinf = filter(plotset, cooks>0.01) %>% dplyr::select(-index2)
```
A Leverage plot is given in Figure 3. The horizontal line is 3(p+1)/n. Points above the line are considered high leverage. There are `r NROW(highlev)` high leverage points. A Cook's Distance plot is given in Figure 4. The horizontal line is 4/n. Some texts consider points above the line influential. There are `r NROW(highinf)` points above the line. Other texts state that cook's D values > 1 indicate influential points. We have no values > 1 (likely due to our large N). Still other texts say you should look at your plot and not depend on a cut-off to tell you what the influential points are. Instead you can identify the points that stand out from the rest of the data. I looked at subjects who had Cook's D levels greater than 0.01. A table of the data for these subjects is given below (Table 1). None of these data points stand out as unreasonable. There are several that have very low birthweights (0 to 1.52 pounds), but these babies all have low gestational ages. Others have a high average number of cigarettes smoked daily (20-30). However, in the dataset values of up to 60 were observed so values of 20-30 are not unreasonable. Some of the pre-pregnancy BMIs seem extreme (16 and 44), but these values are plausible. Because none of the data stands out as implausible and because our interest is in capturing associations across the entire population of births, I will not drop any of the points from the dataset.
\pagebreak
Figure 3: Leverage Plot
```{r, echo=FALSE, results='markup'}
## plot hat values
ggplot(plotset, aes(x=index, y=hatvals)) + geom_point() +
ylim(0,0.075) + geom_hline(yintercept = 3*(11)/NROW(plotset)) +
theme_bw()
```
\pagebreak
Figure 4: Cook's D Plot
```{r, echo=FALSE, results='markup'}
ggplot(plotset, aes(x=index2, y=cooks)) + geom_point() +
ylim(0,0.025) + geom_hline(yintercept = 4/NROW(plotset)) +
theme_bw()
```
Table 1: Data for subjects with Cook's D > 0.01
```{r, echo=FALSE, results='markup'}
reallyhighinf
```
**c) Write a one-paragraph summary of your findings for a scientific collaborator. You should give some summary information (size of dataset, number of missing values, variables for which missingness is relatively high), succinctly describe your model building process and regression diagnostics, and interpret coefficients for variables of interest.**
The birthweight dataset contains `r NROW(data.bwt)` subjects. Our goal was to assess to what degree the amount of smoking (quantified as the average number of cigarettes smoked per day during pregnancy) affects birthweight. Previous studies have indicated that socioeconomic status, maternal age and nutrition are associated with birthweight. To assess the association between the amount of smoking and birthweight, we fit a multivariable linear regression model with mother's race, family's monthly income, mother's age, mother's pre-pregnancy BMI , mother's change in BMI during pregnancy, and gestational age as covariates. Of the `r NROW(data.bwt)` subjects in the full dataset, `r NROW(data_for_plots)` had complete data for all variables in the multivariable model. The most common missing variable was family income (580 missing), BMI change (524 missing), and pre-pregnancy BMI (379 missing). Using the complete cases only, our model yielded an R-square of `r summary(mult_fit)$r.squared`. Residual diagnostics revealed no deviations from the assumptions of constant variance or normality of error. Highly influential observations were detected, however, none of the data points for these observations were unreasonable and in the interest of reporting a general population association, no observations were removed from analysis. Consistent with previous reports, the socioeconomic measures (family income and race) were significantly associated with birthweight at the 0.05 level. Higher income was associated with higher birthweight and on average Black, Asian and Puerto Rican mothers had significantly lower birthweights than white mothers. Interestingly maternal age was not significantly associated with birthweight. However, pre-pregnancy BMI and increase in BMI during pregnancy were both positively and significantly associated with birthweight. Gestational age was also significantly and positively associated with birthweight. (estimates, standard errors and p-values for all coefficients are given in the table below). For our primary hypothesis, adjusting for all other covariates in the model, the average number of cigarettes smoked daily during pregnancy was significantly associated with birthweight at the 0.05 level. On average, a one cigarette increase in daily average cigarettes smoked was associated with a `r mult_fit$coef[2]` gram decrease in birthweight (p=`r round(summary(mult_fit)$coefficients[2,4], 20)`).
```{r, echo=FALSE, results='markup', tidy=TRUE}
tableout = tidy(mult_fit)
tableout = mutate(tableout, p.value= round(p.value,16))
tableout
```
\pagebreak
#### Solution 4:
\textcolor{red}{
Solution 4 made several choices motivated by scientific interpretation at the outset; for example, income was dichotomized at the poverty line and smoking status was categorized into three levels. This may lose some information, but there are clear gains in interpretability. The presentation of possible predictors in three contextual domains helped to formalize the sources of information about birthweight. One very useful element was revisiting some of the exploratory plots after building the model and coloring these points according to their Cook's distance; this made it clear in what sense the data points were influential.
}
```{r setup3, echo=FALSE, tidy=TRUE, message=FALSE, warning=FALSE}
## this code chunk loads necessary packages, data, and clears the workspace.
library(ggplot2)
library(dplyr)
library(knitr)
library(rmarkdown)
library(SemiPar)
library(splines)
library(broom)
library(gridExtra)
library(boot)
library(tidyr)
library(car)
library(parcor)
library(grid)
rm(list = ls())
opts_chunk$set(tidy.opts=list(width.cutoff=10000))
```
##Part a
Preliminary data cleaning: keep only variables of interest; convert categorical variables into factors; dichotomize fincome to those making less than $3000 a month to those making more than $3000 a month (below poverty line and above poverty line); categorize smoken variable into light, medium, and heavy smokers as defined as less than 10 sticks a day, 10-20 sticks a day, and more than 20 sticks a day and those who do not smoke is qualified as non-smoking; and remove any observations where we have missing values for bwt (birthweight). We cannot get rid of NAs for all data because some variables have huge volumes of NAs:
```{r, echo=TRUE, tidy=TRUE, message=FALSE, warning=FALSE}
#load data:
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_BWT.RDA"))
bwt1 = data.frame(dplyr::select(data.bwt, c(babysex, bhead, blength, bwt, delwt, fincome, frace, gaweek, malform, menarche, mheight, momage, mrace, parity, pnumlbw, pnumsga, ppbmi, ppwt, smoken, wtgain)))%>%
mutate(., babysex=factor(babysex, c(1, 2),label=c("Male", "Female")))%>%
mutate(., malform=factor(malform, c(0, 1),label=c("absent", "present")))%>%
mutate(., frace=factor(frace, c(1, 2, 3, 4, 8, 9),label=c("White", "Black", "Asian", "Puerto Rican", "Other", "Unknown")))%>%
mutate(., mrace=factor(mrace, c(1, 2, 3, 4, 8),label=c("White", "Black", "Asian", "Puerto Rican", "Other")))%>%
mutate(., fincomecat = fincome)%>%
mutate(., fincomecat=replace(fincomecat, fincome==99, NA))%>%
mutate(., fincomecat=replace(fincomecat, fincome<=25, "low"))%>%
mutate(., fincomecat=replace(fincomecat, fincome>25&fincome<=96, "high"))%>%
mutate(., smokencat = smoken)%>%
mutate(., smokencat=replace(smokencat, smoken==0, "none"))%>%
mutate(., smokencat=replace(smokencat, smoken>0&smoken<=10, "light"))%>%
mutate(., smokencat=replace(smokencat, smoken>10&smoken<=20, "medium"))%>%
mutate(., smokencat=replace(smokencat, smoken>20, "heavy"))%>%
filter(., bwt!="NA")
#frequencies of categorical data:
bwt1 %>%group_by(babysex) %>%summarize(n = n())
bwt1 %>%group_by(malform) %>%summarize(n = n())
bwt1 %>%group_by(frace) %>%summarize(n = n())
bwt1 %>%group_by(mrace) %>%summarize(n = n())
bwt1 %>%group_by(fincomecat) %>%summarize(n = n())
bwt1 %>%group_by(smokencat) %>%summarize(n = n())
#visually inspect continous data:
var1 = ggplot(data.bwt, aes(bhead, bwt)) + geom_point() + ggtitle("bhead")
var2 = ggplot(data.bwt, aes(blength, bwt)) + geom_point() + ggtitle("blength")
var3 = ggplot(data.bwt, aes(delwt, bwt)) + geom_point() + ggtitle("delwt")
var4 = ggplot(data.bwt, aes(gaweek, bwt)) + geom_point() + ggtitle("gaweek")
var5 = ggplot(data.bwt, aes(menarche, bwt)) + geom_point() + ggtitle("menarche")
var6 = ggplot(data.bwt, aes(mheight, bwt)) + geom_point() + ggtitle("mheight")
var7 = ggplot(data.bwt, aes(momage, bwt)) + geom_point() + ggtitle("momage")
var8 = ggplot(data.bwt, aes(parity, bwt)) + geom_point() + ggtitle("parity")
var9 = ggplot(data.bwt, aes(pnumlbw, bwt)) + geom_point() + ggtitle("pnumlbw")
var10 = ggplot(data.bwt, aes(pnumsga, bwt)) + geom_point() + ggtitle("pnumsga")
var11 = ggplot(data.bwt, aes(ppbmi, bwt)) + geom_point() + ggtitle("ppbmi")
var12 = ggplot(data.bwt, aes(ppwt, bwt)) + geom_point() + ggtitle("ppwt")
var13 = ggplot(data.bwt, aes(wtgain, bwt)) + geom_point() + ggtitle("wtgain")
grid.arrange(var1, var2, var3, var4,var5, var6, ncol=3)
grid.arrange(var7, var8, var9, var10, var11, var12, var13, ncol=3)
#explore the correlation of numerical variables
cor(dplyr::select(bwt1, -c(babysex, fincome, fincomecat, smoken, smokencat, frace, malform, mrace)), use="complete.obs")
```
Birthweight of a newborn can be predicted by numerous factors that can be separated into characteristics about the baby and characteristics about its mother.
In our dataset, variables that describe the baby are (not counting bwt):
1. babysex
2. bhead
3. blength
Variables that describe the mother are:
1. delwt
2. gaweek
3. malform
4. menarche
5. mheight
6. momage
7. mrace
8. parity
9. pnumlbw
10. pnumgsa
11. ppbmi
12. ppwt
13. smoken
14. wtgain
Variables that describe other potential factors impacting the mother or baby:
1. fincome
2. frace
From the initial exploration of the data, we realize that:
1. over 99% of the data has not experience malformation before
2. over 50% of the data for father race is missing or unknown
3. pnumlbw and pnumgsa only takes on the value of 0
4. over 99% of the data's age at menarche was aroun 13
For the reasons stated above, malform, frace, pnumlbw, pnumgsa, and menarche will be excluded from building of the model due to poor quality data on these potentially interesting factors.
*Building a preliminary model*
In terms of a baby's characteristic, one can argue that its head circuference and body length would be good predictors of its weight. The strength of bhead and blength's relationship can also be seen in the scattorplot above and therefore should be included in the preliminary model.
In terms of a mother's characteristics, the three factors that directly dictates the size of the baby is length of time it was allowed to grow, how much did it grow as measured by the change in the mother's body, and the health of the mother. The first factor can be measured by gaweek. The data offers several options to measure the second factor: delwt, mheight, ppbmi, ppwt, and wtgain. We saw in the correlation matrix above that delivery weight was highly correlated with pre-pregancy size (BMI and weight) so delwt is not an ideal candidate. Mother's pre-pregancy BMI and weight does not tell us how much the baby has grown during the course of the pregancy which makes these two variables unideal. Mother's height does not change due to pregancy which makes this variable unuseful. Finally, the variable that is most useful would be wtgain. While there are no variables that documents mother's health at the time of pregancy, we can use mother's age as a proxy because one can argue that the younger you are, the more likely you are to have a healthy pregnancy.
Therefore, our preliminary model is:
$$bwt_i = \beta_0 + \beta_1 bhead_i \beta_2 blength_i \beta_3 gaweek_i + \beta_4 wtgain + \beta_5 momage + \epsilon_i$$
```{r, echo=TRUE, tidy=TRUE, message=FALSE, warning=FALSE}
model1 = lm(bwt~bhead+blength+gaweek+wtgain+momage, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage))%>%na.omit())
```
*Consider other risk factors, modifiers, and confounders*
Of the eligible variables left, we consider the following risk factors that could be associated with baby's birthweight:
1. number of live births prior to pregancy because one can argue that the more healthy live births you've had, the more likely you have another healthy weight baby
2. mother's smoking status (none, light, medium, or heavy smoker) because it is known that women who smoke has an increased chance of producing low birthweight babies
3. family's monthly income (below poverty line of making less than $3000 a month or above)
4. babysex as the impact of other variable may be different between female and male babies
5. mother's race because certain ethnicity mothers are more prone to having low birth weight babies
We fit 4 more models with each of the four variables mentioned above where for each new variable added, we will look its associated p-value to answer the question of whether there is a significant association between the new variable and birthweight.
```{r, echo=TRUE, tidy=TRUE, message=FALSE, warning=FALSE}
model2=lm(bwt~bhead+blength+gaweek+wtgain+momage+parity, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, parity))%>%na.omit())
model3=lm(bwt~bhead+blength+gaweek+wtgain+momage+smokencat, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, smokencat))%>%filter(., smokencat!="NA")%>%na.omit())
model4=lm(bwt~bhead+blength+gaweek+wtgain+momage+fincomecat, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, fincomecat, momage))%>%filter(., fincomecat!="NA")%>%na.omit())
model5=lm(bwt~bhead+blength+gaweek+wtgain+momage+babysex, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, babysex, momage))%>%na.omit())
model6=lm(bwt~bhead+blength+gaweek+wtgain+momage+babysex+mrace, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, babysex, mrace))%>%na.omit())
```
From the process above, it appears that while financial status is not significant, its inclusion changed the effect on momage by more than 10%; however one can argue that financial status is in the causal pathway between age and birthweight because one usually accumulates more wealth as one age and more wealth could lead to higher birthweight (healthier pregancy). So our current preliminary main effects model is now the following (model 6):
$$bwt_i = \beta_0 + \beta_1 bhead_i + \beta_2 blength_i + \beta_3 gaweek_i + \beta_4 wtgain + \beta_5 momage_i + \beta_6 babysex_i + \beta_7 mrace_i + \epsilon_i$$
Of the five variables mentioned above, we reexamine as potential modfiers on the variables included in model1:
```{r, echo=TRUE, tidy=TRUE, message=FALSE, warning=FALSE}
model7=lm(bwt~bhead+blength+gaweek+wtgain+momage+parity+parity*bhead+parity*blength+parity*gaweek+parity*wtgain++parity*momage, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, parity))%>%na.omit())
model8=lm(bwt~bhead+blength+gaweek+wtgain+momage+smokencat+smokencat*bhead+smokencat*blength+smokencat*gaweek+smokencat*wtgain++smokencat*momage, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, smokencat))%>%filter(., smokencat!="NA")%>%na.omit())
model9=lm(bwt~bhead+blength+gaweek+wtgain+momage+fincomecat+fincomecat*bhead+fincomecat*blength+fincomecat*gaweek+fincomecat*wtgain+fincomecat*momage, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, fincomecat))%>%filter(., fincomecat!="NA")%>%na.omit())
model10=lm(bwt~bhead+blength+gaweek+wtgain+momage+babysex+babysex*bhead+babysex*blength+babysex*gaweek+babysex*wtgain+babysex*momage, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, babysex))%>%na.omit())
model11=lm(bwt~bhead+blength+gaweek+wtgain+momage+mrace+mrace*bhead+mrace*blength+mrace*gaweek+mrace*wtgain+mrace*momage, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, mrace))%>%na.omit())
model12=lm(bwt~bhead+blength+gaweek+wtgain+momage+babysex+babysex*blength, data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, babysex))%>%na.omit())
AIC(model1,model2,model3,model4,model5,model6,model7,model8,model9, model10, model11, model12)
model12.coef=tidy(model12)
model12.beta <- round(model12.coef["estimate"], 2)
```
It appears that baby's gender was a significant potential modifer of baby's length while other factors were not as significant of a modfier overall. Therefore, our population model after considering potential modifers and confounders can be expressed as the following (model 12):
$$bwt_i = \beta_0 + \beta_1 bhead_i \beta_2 blength_i \beta_3 gaweek_i + \beta_4 wtgain + \beta_5 momage_i + \beta_6 babsex_i + \beta_7 babysex_i*blength_i + \epsilon_i$$
##Part b
```{r, echo=TRUE, tidy=TRUE, message=FALSE, warning=FALSE}
#plot residual of model11:
plot.dat=data.frame(resid = resid(model12), fitted = fitted(model12))
ggplot(plot.dat, aes(x = fitted, y = resid)) + geom_point() + theme_bw()
#QQ plot of residuals:
ggplot(plot.dat, aes(sample = resid)) + stat_qq() + theme_bw() + ylab("resid")
#multicollinearity
vif(model12) # variance inflation factors
sqrt(vif(model12)) > 2
#identify leverage and calculate cook's distance
bwt2=data=dplyr::select(bwt1, c(bwt, bhead, blength, gaweek, wtgain, momage, babysex))%>%na.omit()
bwt2 = mutate(bwt2,
hatvals = hatvalues(model12),
indexhat = seq_along(hatvals),
cooks = cooks.distance(model12),
indexcook = seq_along(cooks))%>%
mutate(., testhat = hatvals<0.018, testcook=cooks<0.01)%>%
mutate(., testhat = factor(testhat, c(TRUE,FALSE), label=c("normal","high")))%>%
mutate(., testcook = factor(testcook, c(TRUE,FALSE), label=c("normal","high")))
ggplot(bwt2, aes(x=indexhat, y=hatvals)) + geom_point(aes(colour = factor(testhat))) + geom_hline(yintercept = 0.018) + theme_bw()
ggplot(bwt2, aes(x=indexcook, y=cooks)) + geom_point(aes(colour = factor(testcook))) + geom_hline(yintercept = 0.01) + theme_bw()
#look at where they are for each continuous variable we've included:
plot1=ggplot(bwt2, aes(x=bhead, y=bwt)) + geom_point(aes(colour = factor(testhat))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot2=ggplot(bwt2, aes(x=blength, y=bwt)) + geom_point(aes(colour = factor(testhat))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot3=ggplot(bwt2, aes(x=gaweek, y=bwt)) + geom_point(aes(colour = factor(testhat))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot4=ggplot(bwt2, aes(x=wtgain, y=bwt)) + geom_point(aes(colour = factor(testhat))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot5=ggplot(bwt2, aes(x=momage, y=bwt)) + geom_point(aes(colour = factor(testhat))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot6=ggplot(bwt2, aes(x=bhead, y=bwt)) + geom_point(aes(colour = factor(testcook))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot7=ggplot(bwt2, aes(x=blength, y=bwt)) + geom_point(aes(colour = factor(testcook))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot8=ggplot(bwt2, aes(x=gaweek, y=bwt)) + geom_point(aes(colour = factor(testcook))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot9=ggplot(bwt2, aes(x=wtgain, y=bwt)) + geom_point(aes(colour = factor(testcook))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
plot10=ggplot(bwt2, aes(x=momage, y=bwt)) + geom_point(aes(colour = factor(testcook))) +
stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")
grid.arrange(plot1, plot2, plot3,plot4, plot5, plot6, plot7, plot8, plot9, plot10, ncol=5)
#removing potential trouble points:
model14=lm(bwt~bhead+blength+gaweek+wtgain+momage+babysex+babysex*blength, data=subset(bwt2, testhat!="high"))
model15=lm(bwt~bhead+blength+gaweek+wtgain+momage+babysex+babysex*blength, data=subset(bwt2, testcook!="high"))
```
We see from the residual plot that the points are pretty symmetrically distributed, tending to cluster towards the middle of the plot. There are also, in general, no clear patterns in the residual plot. We see from the QQ plot that the graph is more or less a straight line. Therefore, normality of error is ensured. We also checked for signs of multicollinearity using VIF.
In terms of influential points: we identified a handful of these points using leverage and cook's distance calculations. We can see from the graphs above that these points tend to be babies with exceptionally high weight given its head circumference or body length; or is exceptionally heavy given its amount of gestational weeks, mother's weight gain, or mother's age. Removing these points decreased the significance of the interaction between baby sex and baby length but did not have any other major impacts. Therefore, it is perhaps not necessary to remove these points as our final model.
##Part c
The original "data.bwt" dataset has `r dim(data.bwt)[1]` observations and `r dim(data.bwt)[2]` variables. After an intial cleaning of the data in which categorical variables were converted into factor variables and observations with missing values for birthweights were excluded, we are left with `r dim(bwt1)[1]` observations and `r dim(bwt1)[2]` variables. These variables are `r ls(bwt1)` with the following number of missing values:
```{r echo=FALSE}
sapply(bwt1, function(x) sum(is.na(x)))
```
We see that father's race has over half of the data as missing which is the main reason it was not used in our analysis. Malformation, pnumlbw, pnumgsa, and menarche age were also exluded due to lack of variation.
In the model building process, we first came up with a preliminary effects model which looks at the factors that directly impacts a baby's weight specific to the baby itself and to its mother. The factors included in the preliminary main effects model were: baby's length, head circumference, mother's weight gain, gestational length, and mother's age. We then explored other risk factors that could be confounders or modifiers which included number of prior live births, smoking status, income status, baby's gender, and mother's race. In the end, we found that baby's gender is a significant modifier of baby's length and birthweight. With our modfieid model, we performed model diagnostic and found that normality of errors was ensured and there was no issue with multicollinarity. We identified a few influential points but decided it was not worth excluding from our data. Our final fitted model is:
$$hat{birth.weight}_i = (`r model12.beta[1,1]`) + (`r model12.beta[2,1]`) bhead_i + (`r model12.beta[3,1]`) blength_i + (`r model12.beta[4,1]`) gaweek_i + (`r model12.beta[5,1]`) wtgain + $$
$$ (`r model12.beta[6,1]`) momage_i + (`r model12.beta[7,1]`) babsex_i + (`r model12.beta[8,1]`) babysex_i*blength_i$$
With every centimeter increase in head circuferece, a baby's weight will increase by `r model12.beta[2,1]` pounds on average, adjusting for all other variables. With every increase in gestation week, a baby's weight will increase by `r model12.beta[4,1]` pounds on average, adjusting for all other variables. With every pound gained in prregancy, a baby's weight will increase by `r model12.beta[5,1]` pounds on average, adjusting for all other varialbes. With every year increase in age, a baby's weight increases by `r model12.beta[6,1]` pounds on average, adjusting for other variables. For female babies, every centimeter increase in length results in an increase of `r model12.beta[3,1]+ model12.beta[8,1]` pounds on average, adjusting for all other variables. For male babies, every centimber increase in length results in an increase of `r model12.beta[3,1]` pounds on average, adjusting for all other varialbes. All coefficients were significant at the 5% level. In the future, it may be worth revisiting this dataset and perfor LASSO on it for model selection as well as using CV to compare the quality between models.
\pagebreak
### Problem 4. [7+7+6=20 points]
#### Solution:
Arrange the training and testing sets first:
```{r, echo=TRUE, cache=TRUE}
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_Accel.RDA")) # load data
dat = cbind(age, activity) # arrange data
num.fold = 5 # number of folds
n = length(age) # sample size
num_predictor = ncol(activity) # predictor size
# cv train/test allocation, kept fixed
set.seed(50)
flds = createFolds(1:n, num.fold)
dat_train = vector(mode="list",length=num.fold)
dat_test = vector(mode="list",length=num.fold)
for (ii in 1:num.fold){
dat_test[[ii]] = dat[flds[[ii]], ]
dat_train[[ii]] = dat[-flds[[ii]],]
}
```
#### (a)
Build predictive model by ridge regression:
```{r, echo=TRUE, cache=TRUE}
lambda_candidate = 2^seq(-1, 8, length.out=10) # lambda candidates
len = length(lambda_candidate) # lambda candidate size
# cv computation function, ridge
fit_pred_ridge = function(train, test, parameter){
m.ridge = glmnet(train[, -1], train[, 1], alpha=0, lambda=parameter)
pred = predict(m.ridge, newx=test[, -1])
return(mean((pred - test[, 1])^2))
}
# cv, ridge
compare_ridge = rep(0, len)
for (l in 1:len){
para <- rep(lambda_candidate[l], num.fold)
compare_ridge[l] <- mean(mapply(fit_pred_ridge, dat_train, dat_test, para))
}
# cv curve, ridge
ggplot(data.frame(compare_ridge), aes(x=lambda_candidate,y=compare_ridge)) +
geom_path() +
xlab('Lambda') +
ylab('Cross Validation Score') + ggtitle('Cross Validation Curve of Ridge')
# final model, ridge
opt_lambda_ridge = lambda_candidate[which.min(compare_ridge)]
final.ridge = glmnet(dat[, -1], dat[, 1], alpha=0, lambda=opt_lambda_ridge)
# show regression coefficients (no intercept)
ggplot(data.frame(as.numeric(coef(final.ridge))[-1]),
aes(x=1:num_predictor,y=as.numeric(coef(final.ridge))[-1])) +
geom_point() +
xlab('Predictor') +
ylab('Coefficient') + ggtitle('Regression Coefficients of Ridge')
final.pred = predict(final.ridge, newx=dat[, -1])
final.MSE = mean((final.pred - dat[, 1])^2) # quantify prediction accuracy
```
The final MSE of ridge, with optimal $\lambda$ `r opt_lambda_ridge`, is `r final.MSE`.
#### (b)
Build predictive model by lasso regression:
```{r, echo=TRUE, cache=TRUE}
lambda_candidate = seq(0, 1.8, length.out=10) # lambda candidates
len = length(lambda_candidate) # lambda candidate size
# cv computation function, lasso
fit_pred_lasso = function(train, test, parameter){
m.lasso = glmnet(train[, -1], train[, 1], lambda=parameter)
pred = predict(m.lasso, newx=test[, -1])
return(mean((pred - test[, 1])^2))
}
# cv, lasso
compare_lasso = rep(0, len)
for (l in 1:len){
para <- rep(lambda_candidate[l], num.fold)
compare_lasso[l] <- mean(mapply(fit_pred_lasso, dat_train, dat_test, para))
}
# cv curve, lasso
ggplot(data.frame(compare_lasso), aes(x=lambda_candidate,y=compare_lasso)) +
geom_path() +
xlab('Lambda') +
ylab('Cross Validation Score') + ggtitle('Cross Validation Curve of Lasso')
# final model, lasso
opt_lambda_lasso = lambda_candidate[which.min(compare_lasso)]
final.lasso = glmnet(dat[, -1], dat[, 1], lambda=opt_lambda_lasso)
# show regression coefficients (no intercept)
ggplot(data.frame(as.numeric(coef(final.lasso))[-1]),
aes(x=1:num_predictor,y=as.numeric(coef(final.lasso))[-1])) +
geom_point() +
xlab('Predictor') +
ylab('Coefficient') + ggtitle('Regression Coefficients of Lasso')
final.pred = predict(final.lasso, newx=dat[, -1])
final.MSE = mean((final.pred - dat[, 1])^2) # quantify prediction accuracy
```
The final MSE of ridge, with optimal $\lambda$ `r opt_lambda_lasso`, is `r final.MSE`.
#### (c)
Compare the cross-validation score of (a) (b), also out-of-mean sample and unpenalized least square:
```{r, echo=TRUE, cache=TRUE}
cv.ridge = min(compare_ridge) # cv of selected model in (a)
cv.lasso = min(compare_lasso) # cv of selected model in (b)
# cv computation function, out-of-sample mean
fit_pred_outofsample = function(train, test){
return(mean((rep(mean(train[, 1]), dim(test)[1]) - test[, 1])^2))
}
# cv computation function, ols
fit_pred_ols = function(train, test){
m.ols = lm(age~., data=data.frame(train))
pred = predict(m.ols, data.frame(test[, -1]))
return(mean((pred - test[, 1])^2))
}
# cv of out-of-sample mean
cv.outofsample = mean(mapply(fit_pred_outofsample, dat_train, dat_test))
# cv of ols
cv.ols = mean(mapply(fit_pred_ols, dat_train, dat_test))
```
The cross-validation score of optimal ridge, optimal lasso, out-of-sample mean, and unpenalized least square are `r cv.ridge`, `r cv.lasso`, `r cv.outofsample`, and `r cv.ols`. Obviously, out-of-mean sample is the worst, it doesn't use any information in predictors. Unpenalized regression is not as good as penalized ones, as some of the predictors have high collinearity, penalization can reduce overfitting. In this round of cross-validation, ridge is a bit better than lasso, but the difference is not noticeably huge, so we regard them both good predictive models.