---
output:
pdf_document: default
html_document: default
widgets:
- mathjax
- bootstrap
- quiz
- shiny
- interactive
- latex_macros
---
**Biostatistics P8111, Spring 2016**
**Homework 2 Solutions**
Due February 19 by 6:00pm
Please submit an electronic copy (PDF) of your solutions using CourseWorks. Use the title "P8111HW2_Lastname_Firstname.pdf".
Solutions to Problems 1-4 can be typed or handwritten and scanned; Problems 5-7 should be completed using R markdown. In addition to your PDF, please also submit the .Rmd that produces your written solutions to Problems 5-7 with the title "P8111HW2_Lastname_Firstname.Rmd".
```{r Setup, echo=FALSE, message=FALSE, warning=FALSE}
## load all needed packages at the beginning of the document and dataset
library(MASS)
library(ggplot2)
library(magrittr)
library(dplyr)
library(broom)
library(SemiPar)
```
### Problem 1. [5 points]
#### Solution:
\noindent For SLR, $\displaystyle \sum e_i = 0 \,\,\, \sum x_i e_i = 0$ (from the derivation of LSEs -- these sums appeared in the partial derivatives of the RSS and were set to zero). Therefore
\begin{align*}
SS_{tot} &= \sum (y_i - \bar{y})^2 = \sum (y_i - \widehat{y_i} + \widehat{y_i} - \bar{y})^2\\
&= SS_{res} + SS_{reg} + 2 \sum e_i ( \widehat{y_i} - \bar{y})\\
&= SS_{res} + SS_{reg} + 2 (\widehat{\beta_0} \sum e_i + \widehat{\beta_1} \sum x_i e_i - \bar{y} \sum e_i) \\
&= SS_{res} + SS_{reg}
\end{align*}
### Problem 2. [6 + 4 = 10 points]
#### Solution:
#### (a)
\noindent With $p(x, y)$ stated in the problem and
$$\displaystyle p(x) = \frac{1}{\sqrt{2 \pi} \sigma_x} \exp\{- \frac{1}{2} \frac{(x - \mu_x)^2}{\sigma_x^2}\} $$ (can derived by integrating the joint density)
\noindent The conditional density is computed by
\begin{align*}
p(y | x) &= \frac{p(x, y)}{p(x)}\\
&= \frac{\frac{1}{2 \pi \sigma_x \sigma_y \sqrt{1 - \rho^2}} \exp\{- \frac{1}{2 (1 - \rho^2)} [\frac{(x - \mu_x)^2}{\sigma_x^2} + \frac{(y - \mu_y)^2}{\sigma_y^2} - \frac{2 \rho (x - \mu_x) (y - \mu_y)}{\sigma_x \sigma_y}]\} }{\frac{1}{\sqrt{2 \pi} \sigma_x} \exp\{- \frac{1}{2} \frac{(x - \mu_x)^2}{\sigma_x^2}\} } \\
&= \frac{1}{\sqrt{2 \pi} \sigma_y \sqrt{1 - \rho^2}} \exp\{- \frac{1}{2 (1 - \rho^2) \sigma_y^2} [(y - \mu_y)^2 + \frac{\rho^2 (x - \mu_x)^2 \sigma_y^2}{\sigma_x^2} - \frac{2 \rho (x - \mu_x) (y - \mu_y) \sigma_y}{\sigma_x}]\} \\
&= \frac{1}{\sqrt{2 \pi} \sigma_y \sqrt{1 - \rho^2}} \exp\{- \frac{1}{2 (1 - \rho^2) \sigma_y^2} [y - \mu_y - \rho \frac{\sigma_y}{\sigma_x}(x - \mu_x)]^2\} \\
\end{align*}
By the fact that pdf characterizes its distribution, it can be inferred that
$$ y | x \sim N (\mu_y + \rho \frac{\sigma_y}{\sigma_x} (x - \mu_x), \sigma_y^2 (1 - \rho^2)) $$
#### (b)
\noindent In the SLR model $y_i | x_i \sim (\beta_0 + \beta_1 x_i, \sigma^2)$. To match this to the solution in (a), we find
\begin{align*}
\beta_0 + \beta_1 x_i &= \mu_y + \rho \frac{\sigma_y}{\sigma_x} (x_i - \mu_x) \\
&= (\mu_y - \rho \frac{\sigma_y}{\sigma_x} \mu_x) + \rho \frac{\sigma_y}{\sigma_x} x_i
\end{align*}
\noindent and $\epsilon_i$'s follow Gaussian distribution with
$$\sigma^2 = \sigma_y^2 (1 - \rho^2)$$
\noindent So the model is
$$ y_i = (\mu_y - \rho \frac{\sigma_y}{\sigma_x} \mu_x) + \rho \frac{\sigma_y}{\sigma_x} x_i + \epsilon_i \,\,\, \epsilon_i \sim^{iid} N (0, \sigma_y^2 (1 - \rho^2))$$
Note that the forms for the intercept and slope mirror those for the LSEs $\hat{\beta}_0$ and $\hat{\beta}_1$ in a simple linear regression.
### Problem 3. [4 + 4 + 2 = 10 points]
#### Solution:
#### (a)
By the linearity of expectation
\begin{align*}
E (\widehat{\beta_1}) &= E (\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}) = \frac{\sum (x_i - \bar{x})E (y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \\
&= \frac{\sum (x_i - \bar{x}) (\beta_0 + \beta_1 x_i - \beta_0 - \beta_1 \bar{x})}{\sum (x_i - \bar{x})^2}\\
&= \frac{\sum (x_i - \bar{x})^2 \beta_1}{\sum (x_i - \bar{x})^2} = \beta_1\\
\end{align*}
#### (b)
$\epsilon_i$'s are iid, then $y_i$'s are independent, therefore
\begin{align*}
Var (\widehat{\beta_1}) &= Var (\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}) \\
&= Var (\frac{\sum (x_i - \bar{x})y_i - \bar{y} \sum(x_i - \bar{x}) }{\sum (x_i - \bar{x})^2}) \\
&= Var (\frac{\sum (x_i - \bar{x})y_i }{\sum (x_i - \bar{x})^2}) \\
&= \frac{\sum (x_i - \bar{x})^2 Var(y_i)}{\{\sum (x_i - \bar{x})^2\}^2} = \frac{\sum (x_i - \bar{x})^2 \sigma^2}{\{\sum (x_i - \bar{x})^2\}^2} \\
&= \frac{\sigma^2}{\sum (x_i - \bar{x})^2}
\end{align*}
#### (c)
By the formula derived in (b)
\begin{align*}
Var(\widehat{\beta_1^*}) &= \frac{\sigma^2}{\sum (c x_i - c \bar{x})^2} \\
&= \frac{\sigma^2}{c^2 \sum (x_i - \bar{x})^2}
\end{align*}
### Problem 4. [6 + 4 = 10 points]
Assue a "true" MLR model
$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i.$$
Suppose we ignore $x_2$ and fit the simple linear regression
$$y_i = \beta_0^* + \beta_1^* x_{i1} + \epsilon_i^* $$
(a) Show that
$$ \hat{\beta_1^*} = \beta_1 + \beta_2 \frac{ \sum (x_{i1} - \bar{x}_1) (x_{i2} - \bar{x}_2) }{ \sum (x_{i1} - \bar{x}_1)^2 }
+ \frac{ \sum (x_{i1} - \bar{x}_1) (\epsilon_i - \bar{\epsilon} ) }{\sum (x_{i1} - \bar{x}_1)^2} $$
and
$$ E( \hat{\beta_1^*} )
= \beta_1 + \beta_2 \frac{ \sum (x_{i1} - \bar{x}_1) (x_{i2} - \bar{x}_2) }{ \sum (x_{i1} - \bar{x}_1)^2 } $$
\noindent (b) In Lecture 6, we gave two conditions under which $E( \hat{\beta_1^*} ) = \beta_1$. Discuss these conditions in terms of your solution to part (a). Relate the definition of confounding to the omitted variable bias found above.
#### Solution:
#### (a)
We use the formula for the LSE of $\beta_1$ in a SLR. Since $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i$ and $\bar{y} = \beta_0 + \beta_1 \bar{x}_1 + \beta_2 \bar{x}_2 + \bar{\epsilon}$
\begin{align*}
\hat{\beta_1^*}
& = \frac{\sum (y_i - \bar{y})(x_{i1} - \bar{x}_1) }{ \sum ( x_{i1} - \bar{x}_1 )^2 } \\
& = \frac{ \sum [ \beta_1 (x_{i1} - \bar{x}_1) + \beta_2 (x_{i2} - \bar{x}_2) + (\epsilon_i - \bar{\epsilon} ) ] (x_{i1} - \bar{x}_1) }{ \sum ( x_{i1} - \bar{x}_1 )^2 } \\
& = \beta_1 + \beta_2 \frac{ \sum (x_{i1} - \bar{x}_1) (x_{i2} - \bar{x}_2) }{ \sum (x_{i1} - \bar{x}_1)^2 }
+ \frac{ \sum (x_{i1} - \bar{x}_1) (\epsilon_i - \bar{\epsilon} ) }{\sum (x_{i1} - \bar{x}_1)^2}.
\end{align*}
\noindent Also, since $E(\epsilon_i - \bar{\epsilon}) = 0$,
\begin{align*}
E( \hat{\beta_1^*} )
= \beta_1 + \beta_2 \frac{ \sum (x_{i1} - \bar{x}_1) (x_{i2} - \bar{x}_2) }{ \sum (x_{i1} - \bar{x}_1)^2 }.
\end{align*}
#### (b)
There are two conditions to satisfy $E( \hat{\beta_1^*} ) = \beta_1$ when $E( \hat{\beta_1^*} )$ is given by (a). First, $\beta_2 = 0$. That means, the omitted variable is unrelated to the outcome. Secondly, $\sum (x_{i1} - \bar{x}_1) (x_{i2} - \bar{x}_2) = 0$ or simply $Cov(x_1, x_2) = 0$, which implies that the omitted variable is uncorrelated with the retained variable.
To be a confounder, a variable must both be related to the predictor of interest and the outcome. The derivation above shows that omitting a confounder causes the OLS estimator to be biased for the true value.
### Problem 5. [10 points]
Conduct a simulation to empirically demonstrate the properties of the LSEs $\hat{\beta}_0$ and $\hat{\beta}_1$ in a SLR. Specifically, illustrate the findings that LSEs are unbiased and have the expected covariance matrix. A suggested structure for your simulation is:
* Set parameter values of your choice for $\hat{\beta}_0$, $\hat{\beta}_1$, $\sigma_{\epsilon}^2$, and choose a sample size $n$.
* Select values for $x_i$ (recall that in our derivations, the $x$'s have not been considered random - you
should generate only one collection of $x$'s which remain fixed throughout your simulations).
* Repeat the following many times to find the empirical distribution of $\hat{\beta}_1$.
+ Simulate errors $\epsilon_i$ from a mean-zero distribution with variance $\sigma_{\epsilon}^2$.
+ Construct observed responses $y_i$ according to the simple linear regression model.
+ Fit the simple linear regression and save the LSEs.
You should describe how this simulation illustrates the properties of LSEs using numeric summaries and visual displays.
#### Solution:
```{r Prob5, echo=FALSE, cache=TRUE}
set.seed(1234)
N.iter = 10000
n = 30
beta0 = 2
beta1 = 4
sigma2 = 9
x = 1:30
X = model.matrix( ~ x)
varX = solve(t(X)%*%X)*sigma2
betahat0 = rep(0, N.iter)
betahat1 = rep(0, N.iter)
for(i in 1:N.iter){
y = beta0 + beta1*x + rnorm(n, 0, sd = sqrt(sigma2))
betahat0[i] = lm(y~x)$coef[1]
betahat1[i] = lm(y~x)$coef[2]
}
```
We set parameter values to be $n=30$, $\beta_0=2$, $\beta_1=4$, $\sigma_{\epsilon}^2 = 9$ and $x = 1,2,3, \ldots, 30$. We generate errors from a mean-zero distribution with variance $\sigma_{\epsilon}^2$.
We expect that $E(\hat{\beta}_0) =$ `r beta0`, $E(\hat{\beta}_1)$ = `r beta1`, $Var(\hat{\beta}_0) = \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i-\bar x)^2} \right] \sigma^2 =$ `r sigma2* (1/n + mean(x)^2 / sum((x-mean(x))^2))`, and $Var(\hat{\beta}_1) = \frac{\sigma^2}{ \sum_{i=1}^n (x_i-\bar x)^2}=$ `r sigma2 / sum((x-mean(x))^2)`. Across simulated dataset, the mean of $\hat{\beta}_0$ was `r round(mean(betahat0),4)` and the mean of $\hat{\beta}_1$ was `r round(mean(betahat1),4)`. Also, the variance of $\hat{\beta}_0$ was `r var(betahat0)`, and that of $\hat{\beta}_1$ was `r var(betahat1)`.
In matrix notation, let $X$ be the design matrix for our regression. Then the expected value and variance of $[\hat{\beta}_0, \hat{\beta}_1]^{T}$ is
$$ E\begin{pmatrix}
\hat{\beta}_0 \\
\hat{\beta}_1
\end{pmatrix} = \begin{pmatrix}
2 \\
4
\end{pmatrix}, \quad Var \begin{pmatrix}
\hat{\beta}_0 \\
\hat{\beta}_1
\end{pmatrix} = \sigma^2 (X^TX)^{-1} =
\begin{pmatrix}
1.26206897 & -0.062068966 \\
-0.06206897 & 0.004004449
\end{pmatrix} $$
Across simulated datasets, the covariance between $\hat{\beta}_0$ and $\hat{\beta}_1$ was be `r cov(betahat0, betahat1)`.
We show the empirical distribution of $\hat{\beta}_0$ and $\hat{\beta}_1$ using several visual displays, including density plots and scatterplots.
```{r Prob5plots, echo=FALSE, fig.width=2, fig.height=2}
betahat0 = data.frame(betahat0)
betahat1 = data.frame(betahat1)
dat = data.frame(betahat0, betahat1)
ggplot(betahat0, aes(x = betahat0)) + geom_density(fill = "blue", alpha = .3)
ggplot(betahat1, aes(x = betahat1)) + geom_density(fill = "blue", alpha = .3)
ggplot(dat, aes(x = betahat0, y = betahat1)) + geom_point()
```
Unbiasedness and asymptotic normality of coefficients are supported through our simulation studies; our results are also consistent with the theoretical results.
### Problem 6. [2 + 4 + 4 = 10 points]
(a) Using the iris dataset, focus on plants with species versicolor and virginica and on Petal.Length
as an outcome. Do a two sample t-test assuming equal variances; in particular, give the standard error
for the difference in means.
(b) For the same dataset and species, pose a regression model in which one parameter is interpretable as
the mean difference in Petal.Length. Fit the model and show a summary of the results. Comment on
the relationship between your fitted model and your t-test in part (a).
(c) Pose a model that is similar to that in (b), but include all species; make versicolor the reference group
for your categorical predictor. What parameter is interpretable as the mean difference in Petal.Length
comparing versicolor to virginica? Comment on similarities and differences between your fitted
models in parts (b) and (c) for this parameter.
#### Solution:
#### (a)
Denote $\mu_1$ as mean petal length for species "virginica" and $\mu_2$ as mean petal length for species "vesicolor".
$$H_0: \mu_1 = \mu_2 \qquad H_1: \mu_1 \ne \mu_2$$
```{r Prob6a, echo=FALSE, cache=TRUE}
data(iris)
versicolor <- iris$Petal.Length[iris$Species == "versicolor"]
#iris %>% filter(Species == "versicolor") %>% select(Petal.Length) %>% unlist()
virginica <- iris$Petal.Length[iris$Species == "virginica"]
#iris %>% filter(Species == "virginica") %>% select(Petal.Length) %>% unlist()
t.test(virginica, versicolor, var.equal = TRUE)
sd.err <- sqrt(((length(versicolor)-1)*var(versicolor)+(length(virginica)-1)*var(virginica))/
(length(versicolor) + length(virginica) - 2) *
(1 / length(versicolor) + 1 / length(virginica)))
mean.diff <- mean(virginica) - mean(versicolor)
```
At significance level $\alpha = 0.05$, $H_0$ is rejected. The data provides enough evidence to conclude that mean petal length of species "vericolor" is difference from mean petal length of species "virginica".
Difference in sample means: $\bar{X}_1 - \bar{X}_2 = `r round(mean.diff, 3)`$
Standard error for the difference in means (assuming equal variance): `r round(sd.err, 3)`.
#### (b)
Our regression model is
$$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$
where $x_i$ is a binary indicator taking the value 1 when unit $i$ is of type virginica and taking the values 0 otherwise.
```{r Prob6b, echo=FALSE, cache=TRUE}
iris %>% filter(Species %in% c("versicolor", "virginica")) %>%
lm(data = ., Petal.Length ~ Species) %>%
tidy()
```
As we can see from the regression summary above, the following relationships are observed:
1. Intercept ($\hat{\beta}_0$) is equal to mean petal length of versicolor
2. Coefficient of virginica ($\hat{\beta}_1$) is equal to mean difference in petal length comparing virginica to versicolor.
3. Standard error of $\hat{\beta}_1$ is equal to the standard error of the difference in means in part (a).
#### (c)
Our regression model is
$$ y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i$$
where $x_{i,1}$ is a binary indicator taking the value 1 when unit $i$ is of type setosa and taking the value 0 otherwise, and $x_{i,2}$ is a binary indicator taking the value 1 when unit $i$ is of type virginica and taking the value 0 otherwise.
```{r Problem6c, echo=FALSE, cache=TRUE}
iris %>%
mutate(Species = relevel(iris$Species, ref = "versicolor")) %>%
lm(data = ., Petal.Length ~ Species) %>% tidy()
```
The mean difference in Petal.Length comparing versicolor to virginica is still the coefficient of virginica.
If categorical variable has more than 2 levels, the coefficient of fitting linear model is still the difference in mean comparing each category to reference group. However, the standard error is no longer equal to the standard error of two-sample t test as in part (a), because in this model the residual standard deviation is estimated using data from all groups.
### Problem 7. [1 + 4 + 4 + 4 + 2 = 15 points]
LIDAR - light detection and ranging - is a technique similar to radar that measures distances using the
reflection of light. In this problem, we will analyze a dataset consisting of 221 observations of range (distance)
and log ratio of light received by the sensor. This problem uses the LIDAR dataset available in the SemiPar
package.
(a) Make a plot of log ratio against range, and describe the relationship in the data.
(b) Construct a sequence of polynomial models with degree 3, 6, and 8 for the association between log ratio
and range. Make a plot illustrating your models. Comment on their fit, and indicate which polynomial
model you prefer.
(c) Construct a piece-wise linear model for the association between log ratio and range. How did you decide
on knot location?
(d) Construct a sequence of models for the association between log ratio and range using B-splines, using
4, 6, and 8 basis functions. Make a plot illustrating your models. Comment on their fit, and indicate
which spline model you prefer.
(e) Of the possible approaches, which do you prefer? Are there any issues in the data that these regressions
do not address?
#### Solution:
#### (a)
```{r Prob7a, cache=TRUE, echo=FALSE, message=FALSE, warning=FALSE, fig.width=5, fig.height=3, fig.align="center"}
data(lidar)
ggplot(data=lidar,aes(x = range,y = logratio)) +
geom_point() + xlab("Range") + ylab("Logratio") + ggtitle("Scatter Plot of Logratio against Range")
```
A plot of log ratio against range is shown above. From this plot, we see that there is a clear
relationship between log ratio and range, but that the relationship changes over the domain of range. For
values of range up to 550, the mean of log ratio is essentially constant; following this, there is a sharp decrease
in log ratio for values of range between 550 and 650, after which the relationship largely levels off. There is
low variance for small values of range and high variance for large values of range.
#### (b)
```{r Prob7b, cache=TRUE, echo=FALSE, message=FALSE, warning=FALSE, fig.width=5, fig.height=3, fig.align="center"}
poly.3 <- lm(data = lidar, logratio~poly(range,3))
poly.6 <- lm(data = lidar, logratio~poly(range,6))
poly.8 <- lm(data = lidar, logratio~poly(range,8))
poly.fitted <- data.frame(lidar,
fit.3 = poly.3 %>% fitted(),
fit.6 = poly.6 %>% fitted(),
fit.8 = poly.8 %>% fitted())
ggplot(data=poly.fitted,aes(x = range)) +
geom_point(aes(y = logratio), alpha = 0.5) +
geom_line(aes(y = fit.3), color = "sky blue", size=1)+
geom_line(aes(y = fit.6), color="dark blue", size=1)+
geom_line(aes(y = fit.8), color="red", size=1) +
xlab("Range") + ylab("Logratio") + ggtitle("Fitted Polynomial Plot")
```
The fitted lines of degree 3 (sky blue), degree 6 (dark blue) and degree 8 (red) are shown above. Both 6-degree and 8-degree model look like reasonable fits for the data. We can use F-test to compare the three models (although this is not required for the problem).
```{r Prob7b_2, cache=TRUE, echo=FALSE, message=FALSE, warning=FALSE, fig.align="center"}
anova(poly.3, poly.6, poly.8)
```
The ANOVA table above indicates that degree 6 significantly improves on degree 3, and that degree 8 significantly improves on degree 6. Model with degree 8 is preferred.
#### (c)
One can set knot locations by visually inspecting the data. Based on the scatter plot in part (a), putting a knot at ranges 550 and 650 seems reasonable. Alternatively, one can distribute knots evenly across the range of data and use a variable selection procedure to pick the most relevant knots. In this case I set 5 evenly ditributed knots between range 400 to 700, and use a stepwise selection function to select knots. The best model based on AIC criteria has 3 knots at range = 550, 600, and 650, respectively. (Please see the .Rmd file for complete code; also, note that several solutions to this problem are possible.)
The summary of the final piecewise model is shown below, along with a visualization:
```{r Prob7c,cache=TRUE,echo=FALSE,message=FALSE,warning=FALSE, fig.width=5, fig.height=3, fig.align="center"}
pw.lidar <-mutate(lidar,
spline_450 = (range - 450) * (range >= 450),
spline_500 = (range - 500) * (range >= 500),
spline_550 = (range - 550) * (range >= 550),
spline_600 = (range - 600) * (range >= 600),
spline_650 = (range - 650) * (range >= 650))
piecewise <- step(object = lm(data = pw.lidar, logratio ~ range),
scope = list(upper = lm(data = pw.lidar, logratio ~ .)),
direction = "both", trace = 0)
tidy(piecewise)
pw.fitted <- data.frame(lidar, fit = piecewise %>% fitted())
ggplot(data=pw.fitted,aes(x = range)) +
geom_point(aes(y = logratio), alpha = 0.5) +
geom_line(aes(y = fit), color = "sky blue", size=1) +
xlab("Range") + ylab("Logratio") + ggtitle("Fitted Piecewise Model")
```
#### (d)
The graph below shows fitted curve for B-splines models using 4, 6, and 8 basis functions. Sky blue line refers to 4 basis B-spline, dark blue line is 6 basis B-spline, and red line is 8 basis B-spline.
```{r Prob7d, cache=TRUE, echo=FALSE, message=FALSE, warning=FALSE, fig.width=5, fig.height=3, fig.align="center"}
library(splines)
bs.lidar <- lidar %>% bind_cols(., data.frame(bs(.[['range']], df = 4))) %>%
rename(sp.1 = X1, sp.2 = X2, sp.3 = X3, sp.4 = X4)
bspline.4 <- lm(logratio ~ sp.1 + sp.2 + sp.3 + sp.4, data = bs.lidar)
bs.lidar <- lidar %>% bind_cols(., data.frame(bs(.[['range']], df = 6))) %>%
rename(sp.1 = X1, sp.2 = X2, sp.3 = X3, sp.4 = X4, sp.5 = X5, sp.6 = X6)
bspline.6 <- lm(logratio ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 + sp.6, data = bs.lidar)
bs.lidar <- lidar %>% bind_cols(., data.frame(bs(.[['range']], df = 8))) %>%
rename(sp.1 = X1, sp.2 = X2, sp.3 = X3, sp.4 = X4, sp.5 = X5, sp.6 = X6, sp.7 = X7, sp.8 = X8)
bspline.8 <- lm(logratio ~ sp.1 + sp.2 + sp.3 + sp.4 + sp.5 + sp.6 + sp.7 + sp.8, data = bs.lidar)
bs.fitted <- data.frame(lidar,
fit.4 = bspline.4 %>% fitted(),
fit.6 = bspline.6 %>% fitted(),
fit.8 = bspline.8 %>% fitted())
ggplot(data=bs.fitted,aes(x = range)) +
geom_point(aes(y = logratio), alpha = 0.5) +
geom_line(aes(y = fit.4), color = "sky blue", size=1)+
geom_line(aes(y = fit.6), color="dark blue", size=1)+
geom_line(aes(y = fit.8), color="red", size=1) +
xlab("Range") + ylab("Logratio") + ggtitle("Fitted B-spline Plot ")
```
B-spline with 4 basis functions (sky blue) doesn't fit the data well enough. B-splines with 6 basis functions (dark blue) looks too wavy, and it doesn't fit well at boundaries. I prefer 8 basis functions (red). It has a good fit but not so much as to overfit the data.
#### (e)
Fitted plot comparing three models is shown below. Here I chose the best model from each approach. Sky blue line refers to polynomial model with degree 8, dark blue refers to piecewise model with 3 knots, and red line refers to B-spline with 8 basis functions. Based on the plot, all three approaches have good fit.
```{r Prob7e, cache=TRUE, echo=FALSE, message=FALSE, warning=FALSE, fig.width=5, fig.height=3, fig.align="center"}
plot.compare <- data.frame(lidar,
poly = poly.8 %>% fitted(),
piecewise = piecewise %>% fitted(),
bspline = bspline.8 %>% fitted())
ggplot(data=plot.compare,aes(x = range)) +
geom_point(aes(y = logratio), alpha = 0.5) +
geom_line(aes(y = poly), color = "sky blue", size=1)+
geom_line(aes(y = piecewise), color="dark blue", size=1)+
geom_line(aes(y = bspline), color="red", size=1) +
xlab("Range") + ylab("Logratio") + ggtitle("Comparing Three Approaches")
```
We'll soon learn that AIC can provide a method to compare different modeling approaches, with smaller values indicating better fits. For these models, the AICs were:
\begin{table}[ht]
\centering
\begin{tabular}{lrrr}
\hline
& Polynomial & Piecewise & B-spline \\
\hline
AIC & `r round(AIC(poly.8), 2)` & `r round(AIC(piecewise), 2)` & `r round(AIC(bspline.8), 2)` \\
\hline
\end{tabular}
\end{table}
All three models have similar AIC, but piecewise model has the smallest. Moreover, the interpretation of a piecewise model can be more straightforward than the other approaches. For these reasons I prefer piecewise approach: it fits the data well and is relatively simple.
From the scatter plot, it is apparent that variance increases with range. The problem of non-constant variance is not addressed in any of the three approaches above.