---
output:
pdf_document: default
html_document: default
word_document: default
widgets:
- mathjax
- bootstrap
- quiz
- shiny
- interactive
- latex_macros
---
**Biostatistics P8111, Spring 2016**
**Homework 5 Solutions**
Due April 22 by 5:00pm
Please submit an electronic copy (PDF) of your solutions using CourseWorks. Use the title
"P8111HW5_Lastname_Firstname.pdf".
Solutions to Problems 1-2 can be typed or handwritten and scanned; Problems 3-5 should be completed using R markdown. In addition to your PDF, please also submit the .Rmd that produces your written solutions to Problems 3-5 with the title "P8111HW4_Lastname_Firstname.Rmd".
```{r, echo=FALSE, message=FALSE, warning=FALSE}
## load all needed packages at the beginning of the document
library(dplyr)
library(SemiPar)
library(mgcv)
library(glmnet)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(broom)
library(lme4)
library(splines)
```
### Problem 1. [7 points]
\begin{eqnarray*}
Var(y) &=& E[Var(y| b)] + Var[E(y| b)] \\
&=& E( \sigma_{\epsilon} ^2 I) + Var(X\beta + Zb)\\
&=& \sigma_{\epsilon} ^2 I + ZVar(b)Z^T\\
&=&\sigma_{\epsilon} ^2 I +\sigma_{b} ^2ZZ^T
\end{eqnarray*}
For a random intercept model, $ZZ^T$ is a block diagonal matrix with all entries inside the diagonal blocks 1, and the remaining entries 0. Thus, $\sigma_{b} ^2ZZ^T$ is a block diagonal matrix in which entries of diagonal blocks are all $\sigma_{b}^2$. Meanwhile, $\sigma_{\epsilon} ^2 I$ is a diagonal matrix with all diagonal elements $\sigma_{\epsilon} ^2$. Therefore, $\sigma_{\epsilon} ^2 I +\sigma_{b} ^2ZZ^T$ is a block diagonal matrix with blocks for each subject and entries off the diagonal are zero matrices; within the subject blocks, all the diagonal elements are $\sigma_{\epsilon} ^2 +\sigma_{b} ^2$, and all the off-diagonal elements are $\sigma_{b} ^2$, indicating a uniform covariance structure.
### Problem 2. [13 points]
\begin{eqnarray*}
f ( b_i \mid y_i ) & \propto & f (b_i, y_i) \\
& \propto & exp \left\{ - \frac{1}{2 \sigma_y^2} \sum_{j = 1}^{J_i} (y_{ij} - b_i)^2 - \frac{1}{2 \sigma_b^2} b_i^2\right\} \\
& \propto & exp \left\{ \left(- \frac{1}{2 \sigma_y^2} J_i b_i^2 - \frac{1}{2 \sigma_b^2} b_i^2\right) + \frac{1}{\sigma_b^2} \sum_{j = 1}^{J_i} y_{ij} b_i\right\} \\
& = & exp \left\{ - \frac{1}{2} \frac{J_i \sigma_b^2 + \sigma_y^2}{\sigma_y^2 \sigma_b^2} b_i^2 + \frac{J_i}{\sigma_b^2} \bar{y}_i b_i\right\} \\
& \propto & exp \left\{ - \frac{1}{2} \frac{1}{ \frac{ \frac{\sigma_y^2}{J_i} \sigma_b^2 } {\frac{\sigma_y^2}{J_i} + \sigma_b^2 }} \left[b_i - \frac{ \sigma_b^2 } { \frac{\sigma_y^2}{J_i} + \sigma_b^2 } \bar{y}_i \right]^2\right\} \\
\end{eqnarray*}
So by the kernel argument,
$$b_i \mid y_i \sim N \left(\frac{ \sigma_b^2 } { \frac{\sigma_y^2}{J_i} + \sigma_b^2 } \bar{y}_i + \frac{ \frac{\sigma_y^2}{J_i} } { \frac{\sigma_y^2}{J_i} + \sigma_b^2 } 0, \frac{ \frac{\sigma_y^2}{J_i} \sigma_b^2 } {\frac{\sigma_y^2}{J_i} + \sigma_b^2 } \right)$$
\noindent The idea of borrowing strength from the population in a random effects model is to combine subject-specific observations (the $y_{ij}$) with information about the population from which subject i comes. It may be that the data points for a subject indicate that his values are large with respect to the population, in that case, shrinking toward the population average results in a better estimate of the subject mean than simply averaging observations for that subject.
\noindent This phenomenon is also very much related to the mathematical equivalence of ridge regression and BLUPs. Ridge estimates are shrunk toward zero and tend to have better MSEs; similarly BLUPs, which borrow from the population (and are the same as the form in this question) tend to be better overall estimates.
\noindent In the formula above, the population mean for the $b_i$ is 0. So, the following generally hold:
\begin{itemize}
\item Better information at the subject level, either through smaller $\sigma^2_y$ or larger $J_i$, means that you trust the subject average $\bar{y}_i$ more
\item Worse information at the subject level, either through larger $\sigma^2_y$ or smaller $J_i$, means that you trust the population average 0 more
\end{itemize}
### Problem 3. [10 points]
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE}
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW5_Data_BWT.RDA"))
data.bwt <- data.bwt %>%
select(., babysex, bhead, blength, bwt, delwt, fincome, frace, gaweek, malform,
menarche, mheight, momage, mrace, parity, pnumlbw, pnumsga, ppbmi, ppwt,
smoken, wtgain) %>%
mutate(fincome = replace(fincome, data.bwt$fincome == 99, NA),
menarche = replace(menarche, (data.bwt$menarche == 99) | (data.bwt$menarche == 4), NA),
babysex = factor(babysex, labels = c("Male", "Female")),
mrace = factor(mrace, level = c(1:4, 8),
labels = c("White", "Black", "Asian",
"Puerto Rican", "Others")),
frace = factor(frace, level = c(1:4, 8, 9),
labels = c("White", "Black", "Asian",
"Puerto Rican", "Others", "Unknown")),
malform = factor(malform, labels = c("Absent", "Present")))
```
Responses to this problem will vary widely, based on the original model fitting goal and procedure. Here is an example of rerunning the model in which _bhead_, _blength_, _delwt_, _gaweek_, _babysex_, _mrace_, _smoken_ and interaction term _mrace_ * _blength_ are used as predictors of birth weight _bwt_.
### Descriptive Statistics
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE, fig.height=5, fig.width=10, fig.align="center"}
p <- list()
p[[1]] <- ggplot(data = data.bwt, aes(x = bhead, y = bwt)) + geom_point(size = 0.2)
p[[2]] <- ggplot(data = data.bwt, aes(x = blength, y = bwt)) + geom_point(size = 0.2)
p[[3]] <- ggplot(data = data.bwt, aes(x = delwt, y = bwt)) + geom_point(size = 0.2)
p[[4]] <- ggplot(data = data.bwt, aes(x = gaweek, y = bwt)) + geom_point(size = 0.2)
p[[5]] <- ggplot(data = data.bwt, aes(x = momage, y = bwt)) + geom_point(size = 0.2)
p[[6]] <- ggplot(data = data.bwt, aes(x = smoken, y = bwt)) + geom_point(size = 0.2)
p[[7]] <- ggplot(data = data.bwt, aes(x = mrace, y = bwt)) + geom_boxplot(outlier.size = 0.2)
p[[8]] <- ggplot(data = data.bwt, aes(x = babysex, y = bwt)) + geom_boxplot(outlier.size = 0.2)
grid.arrange(grobs = p, ncol = 4, nrow = 2, bottom = "Pairwise correlation: bwt against other covariates")
```
### Model Fitting
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE}
data.model <- data.bwt %>% select(bwt, blength, bhead, gaweek, delwt, babysex, mrace, smoken) %>%
na.omit()
### Full model
linmod <- lm(bwt ~ . + blength * mrace , data = data.model)
summary(linmod)
```
Most of the coefficients have kept same sign and similar significance, except for race.
### Model Diagnosis
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE, fig.height=5, fig.width=10, fig.align="center"}
### Model diagnosis
p=list()
diagnosis <- data.frame(fitted = linmod$fitted.values,
resid = linmod$residuals,
hat_value = hatvalues(linmod),
cook_d = cooks.distance(linmod),
index = seq_along(hatvalues(linmod)),
resid_t = rstudent(linmod))
X.des <- model.matrix(linmod)
n <- dim(X.des)[1]
np <- dim(X.des)[2]
### Residual plot
p[[1]] <-ggplot(data = diagnosis, aes(x = fitted,y = resid)) + geom_point(size = 0.5) + ggtitle("Residual Plot")
p[[2]]<-ggplot(data=diagnosis,aes(sample=resid)) + stat_qq(size = 0.5) + ggtitle("QQ Plot of Residuals")
grid.arrange(p[[1]], p[[2]], nrow = 1, ncol = 2)
```
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE, fig.height=5, fig.width=10}
### Leverage, incluential, outlier plot
p[[3]] <- ggplot(data = diagnosis, aes(x = index,y = hat_value)) + geom_point(size = 0.5)+
geom_hline(yintercept = 2 * np / n,color="red") + xlab("Index (i)") + ylab("Hat Value (h_ii)") + ggtitle("Leverage Plot")
p[[4]] <- ggplot(data = diagnosis,aes(x = index,y = cook_d)) + geom_point(size = 0.5)+
geom_hline(yintercept = 2 * np / n,color="red") + xlab("Index (i)") + ylab("Cook's Distance (D_i)") + ggtitle("Influence Plot")
p[[5]] <- ggplot(data = diagnosis,aes(x = index,y= resid_t ))+ geom_point(size = 0.5)+
geom_hline(yintercept = qt(0.005,n - np),color="red")+geom_hline(yintercept = qt(1-0.005,n - np),color="red")+
xlab("Index (i)") + ylab("Studentized Residuals (t_i)") + ggtitle("Outlier Plot")
grid.arrange(p[[3]], p[[4]],p[[5]], nrow = 1, ncol = 3)
```
There are some outliers and a few influential points. Upon examining thoe observations. A white, male baby with birth weight 7.63 pounds and 38 weeks of gestational age has only 20cm length, which seems odd:
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE}
filter(data.model, blength == 20)
```
I removed this point and refit the model. The coefficients did not change significantly. (Model estimates not reported here)
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE}
### Remove the unusual point and refit the model
data.model <- filter(data.model, blength > 20)
linmod <- update(linmod, data = data.model)
```
### Analysis of model estimation: Training - test validation
I use the new data as test data and the old data as training data to check the model estimation of my model. To do so I compare coefficients estimate from test data to that from training data:
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE}
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW4_Data_BWT.RDA"))
data.bwt.old <- data.bwt %>%
mutate(fincome = replace(fincome, data.bwt$fincome == 99, NA),
menarche = replace(menarche, (data.bwt$menarche == 99) | (data.bwt$menarche == 0), NA),
babysex = factor(babysex, labels = c("Male", "Female")),
mrace = factor(mrace, level = c(1:4, 8),
labels = c("White", "Black", "Asian",
"Puerto Rican", "Others")),
frace = factor(frace, level = c(1:4, 8, 9),
labels = c("White", "Black", "Asian",
"Puerto Rican", "Others", "Unknown")),
malform = factor(malform, labels = c("Absent", "Present")))
# summary(data.bwt)
data.bwt.old <- data.bwt.old %>% select(bwt, blength, bhead, gaweek, delwt, babysex, mrace, smoken) %>%
na.omit()
linmod.old <- lm(bwt ~ . + blength * mrace , data = data.bwt.old)
coef <- round(data.frame(linmod.old$coefficients, confint(linmod.old), linmod$coefficients), 2)
colnames(coef) <- c("Coef.Train", " 2.5% Train", " 97.5% Train", " Coef.Test")
coef
```
According to table above, except for _mrace = Puerto Rico_, _mrace = Others_ and their interaction terms, all other coefficients from fitting test data are within the 95% CI of coefficient estimates from training data. The training model provides a reasonable estimation of coefficients, but the effect of race on birth weight may be sensitive to data.
### Problem 4 [20 points]
### Step 1. Data Cleaning
Some very large values in variable ht, wt, and arm should be coded as missing; see RMD for complete cleaning code.
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE}
load(url("http://jeffgoldsmith.com/P8111/P8111_HWs/P8111HW5_Data_Nepal.RDA"))
# summary(data.long)
### Data cleaning
data.long <- data.long %>%
mutate(ht = replace(ht, ht >300, NA),
wt = replace(wt, wt > 40, NA),
arm = replace(arm, arm > 40, NA),
sex = factor(sex, labels = c("Male","Female"))) %>% na.omit()
```
### Model selection
There are many possible "correct" solutions to this problem; here is an example of building a model to examine the weight / arm circumference association, adjusting for height, age, and gender as known confounders with a linear relationship with arm circumference.
From the scatter plot below, I see a positive asociation betwewn weight and arm circumference with the possibility of non-linearity; this is similar to the cross-sectional model fit in lecture. I start with random intercept model, and compare linear and nonlinear effects of weight.
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE, fig.height=6, fig.width=10, fig.align="center"}
ggplot(data.long, aes(x = wt, y = arm, group = id)) + geom_line(alpha = .7,size = 1.1, aes(color = sex)) + geom_point() +
labs(x = "Weight", y = "Arm Circumference") + ggtitle("Spaghetti plot of weight vs. arm circumference ")
### random effect: linear vs polynomial
ranmod <- lmer(arm ~ (1 | id) + wt + ht + age + sex , data = data.long)
tidy(ranmod)
AIC_model <- data.frame(Random_Intercept_linear = AIC(ranmod))
ranmod <- lmer(arm ~ (1 | id) + bs(wt, df = 4)+ ht + age + sex , data = data.long)
tidy(ranmod)
AIC_model <- mutate(AIC_model, Random_Intercept_nonlin = AIC(ranmod))
round(AIC_model, 2)
#### Use polynomial
```
AIC in linear model is `r round(AIC_model[1, 1], 2)`, while that from the B spline fit is `r round(AIC_model[1, 2], 2)`; the B spline fits substantially better according to this criterion. Next I use the B spline model to compare random intercepts to random intercepts and slopes.
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE, fig.height=6, fig.width=10, fig.align="center"}
### random slope vs random intercept only
ranint.mod <- lmer(arm ~ (1 | id) + bs(wt, df = 4) + ht + age + sex , data = data.long)
fitted <- data.frame(wt = data.long$wt, sex = data.long$sex, fitted = getME(ranint.mod,'X') %*% fixef(ranint.mod))
data.long$ranmod.fit = fitted(ranint.mod)
ggplot(data.long, aes(x = wt, y = arm, group = id)) +
geom_point() + geom_path(alpha = .4) +
labs(x = "Weight", y = "Arm Circumference") +
ggtitle("Random Intercept (fitted by ignoring confounders)") +
geom_line(data = fitted, aes(group = sex, y = fitted, color = sex), size = 1.1) +
geom_line(aes(y = ranmod.fit, color = sex), alpha = .5)
ranslope.mod <- lmer(arm ~ (1 + wt| id) + bs(wt, df = 4) + ht + age + sex, data = data.long)
data.long$ranslope.mod <- fitted(ranslope.mod)
fitted <- data.frame(wt = data.long$wt, sex = data.long$sex, fitted = getME(ranslope.mod,'X') %*% fixef(ranslope.mod))
```
Model parameter estimates and inference for fitting the random slope model are reported below:
```{r,echo=FALSE,message=FALSE,cache=TRUE,warning=FALSE}
random_slope <- lmer(arm ~ (1 + wt| id) + bs(wt, df = 4)+ ht + age + sex , data = data.long)
tidy(random_slope)
AIC_model <- mutate(AIC_model, Random_slope_bs = AIC(random_slope))
```
The model with random intercept and random slope has AIC `r round(AIC_model[1, 3], 2)` has smaller AIC compared to random intercept only. However, this difference is small and may not justify the additional complexity of the model.
\pagebreak
### Problem 5 [9 + 6 = 15 points]
In HW2, we examined a collection of approaches to model non-linearity in the association between log ratio and range in the LIDAR dataset. Now, we will address the non-constant variance in those data. Throughout, use a cubic B spline with 6 degrees of freedom to model the mean structure.
### Solution (a):
It appears that it has a constant variance in the beginning, and after a certain point, let's say range[i] = 600, there is a non-constant variance which is proportional to range. I defined weights to be 1 for the first i elements and $1 / sqrt(x)$ for the remaining (n-i) elements. The plot below shows the original data, the fitted value curve with WLS (dark blue), and the 95% pointwise confidence interval for the fitted value curve (blue dashed). The 95% pointwise confidence interval appears to be reasonable since the intervals for the higher values of range are wider.
```{r, echo=FALSE, warning=FALSE, fig.height=4, fig.width=8, fig.align="center"}
#---------------------- (a)
set.seed(1)
library(splines)
library(SemiPar)
library(dplyr)
library(reshape2)
library(ggplot2)
library(gridExtra)
data(lidar)
bs.lidar <- lidar %>% bind_cols(., data.frame(bs(.[['range']], df = 6))) %>%
rename(sp.1 = X1, sp.2 = X2, sp.3 = X3, sp.4 = X4, sp.5 = X5, sp.6 = X6)
# Define weights
w <- NULL
w[1:141] = 1 # 600
w[142:221] = 1 / sqrt((lidar$range[142:221]))
wbs.lidar = mutate(bs.lidar, w = w)
wbspline.6 = lm(logratio ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 + sp.6,
weights = w, data = wbs.lidar)
wbs.fitted = data.frame(lidar, wfit.6 = wbspline.6 %>% fitted())
# Pointwise 95% C.I. for wls
SP = model.matrix( ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 + sp.6, data = wbs.lidar)
var_mat = (summary(wbspline.6)$sigma)^2 * summary(wbspline.6)$cov.unscaled
fitted.dat = lidar[, c("range", "logratio")]
w.fitted.dat = mutate(fitted.dat, w.fitted = fitted(wbspline.6),
w.UB = w.fitted + 2 * sqrt(diag(SP %*% var_mat %*% t(SP))),
w.LB = w.fitted - 2 * sqrt(diag(SP %*% var_mat %*% t(SP))))
p1 = ggplot(w.fitted.dat, aes(x = range, y = logratio)) + geom_point(alpha = .2) + theme_bw() +
geom_path(aes(y = w.fitted), color = "dark blue", size = 1.2) +
geom_path(aes(y = w.UB), color = "blue", size = .5, linetype=2) +
geom_path(aes(y = w.LB), color = "blue", size = .5, linetype=2) +
labs(title = "WLS")
p1
```
### Solution (b):
For each of 1000 iterations, we sampled our data with replacement to obtain a bootstrap sample with 221 observations; refit the B spline model; and saved the estimated coefficients. The 95% pointwise confidence interval, based on the fitted values and covariance of estimated coefficients across bootstrap samples, is shown below.
```{r, echo=FALSE, warning=FALSE, fig.height=4, fig.width=8, fig.align="center"}
#---------------------- (b)
# OLS
bspline.6 <- lm(logratio ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 +
sp.6, data = bs.lidar)
summary(bspline.6)
bs.fitted <- data.frame(lidar, fit.6 = bspline.6 %>% fitted())
ggplot(data = bs.fitted, aes(x = range)) +
geom_point(aes(y = logratio), alpha = 0.5) +
geom_line(aes(y = fit.6), color="dark red", size=1) +
xlab("Range") + ylab("Logratio") + ggtitle("OLS Fitted B-spline Plot")
# Bootstrap OLS
boot.lidar = NULL
fitted.vals = matrix(NA, nrow = 1000, ncol = 7)
for(i in 1:1000){
boot.lidar = lidar[sort(sample(1:221, 221, replace=TRUE)),]
boot.bs.lidar <- boot.lidar %>% bind_cols(., data.frame(bs(.[['range']], df = 6))) %>%
rename(sp.1 = X1, sp.2 = X2, sp.3 = X3, sp.4 = X4, sp.5 = X5, sp.6 = X6)
model.bs = lm(logratio ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 + sp.6, data = boot.bs.lidar)
fitted.vals[i,] = as.numeric(model.bs$coef)
}
var.boot = cov(fitted.vals)
SP = model.matrix( ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 + sp.6, data = bs.lidar)
var.boot
fitted.dat = lidar[, c("range", "logratio")]
o.fitted.dat = mutate(fitted.dat, o.fitted = fitted(bspline.6),
o.UB = o.fitted + 2 * sqrt(diag(SP %*% var.boot %*% t(SP))),
o.LB = o.fitted - 2 * sqrt(diag(SP %*% var.boot %*% t(SP))))
p2 = ggplot(o.fitted.dat, aes(x = range, y = logratio)) + geom_point(alpha = .2) + theme_bw() +
geom_path(aes(y = o.fitted), color = "dark red", size = 1.2) +
geom_path(aes(y = o.UB), color = "red", size = .5, linetype=2) +
geom_path(aes(y = o.LB), color = "red", size = .5, linetype=2) +
labs(title = "BS")
p2
```
Lastly, as a frame of reference, we show the CI constructed from the OLS approach.
```{r, echo=FALSE, warning=FALSE, fig.height=4, fig.width=8, fig.align="center"}
# Pointwise 95% C.I. for ols
SP = model.matrix( ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 + sp.6, data = bs.lidar)
var_mat = (summary(bspline.6)$sigma)^2 * summary(bspline.6)$cov.unscaled
fitted.dat = lidar[, c("range", "logratio")]
o.fitted.dat = mutate(fitted.dat, o.fitted = fitted(bspline.6),
o.UB = o.fitted + 2 * sqrt(diag(SP %*% var_mat %*% t(SP))),
o.LB = o.fitted - 2 * sqrt(diag(SP %*% var_mat %*% t(SP))))
p3 = ggplot(o.fitted.dat, aes(x = range, y = logratio)) + geom_point(alpha = .2) + theme_bw() +
geom_path(aes(y = o.fitted), color = "dark green", size = 1.2) +
geom_path(aes(y = o.UB), color = "dark green", size = .5, linetype=2) +
geom_path(aes(y = o.LB), color = "dark green", size = .5, linetype=2) +
labs(title = "OLS")
p3
```
```{r, echo=FALSE, warning=FALSE, fig.align="center", fig.height=4, fig.width=12}
grid.arrange(p1, p2, p3, nrow = 1, ncol = 3)
```
Both WLS and the boostrap give more plausible intervals than OLS. The intervals are narrower for lower values of range and wider for higher values, which reflects the variability in the observed data. Meanwhile, the pointwise interval by using OLS incorrectly assumes roughly constant error variance over the domain of the predictor, which is reflected in the roughly constant width of these intervals.
the comparison of WLS and boostrap is less obvious. Bootstrap intervals seem to be wider, which may reflect less dependence on the assumed form for the non-constant variance. However, this may also stem from using OLS estimates within the bootstrap -- OLS estimates are still inefficient, and the bootstrap may not by itself account for this. An alternative (not examined here) is to obtain weights from a less structured approach (e.g. kernel estimation). This could be compared to the bootstrap using OLS, or even to the bootstrap using the same less structured approach; the second method would use (efficient) WLS and account for uncertainty in the estimated weights.