This problem focuses on the collinearity problem
> set.seed(1)
> x1=runif(100)
> x2=0.5*x1+rnorm(100)/10
> y=2+2*x1+0.3*x2+rnorm(100)
The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?
What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \(\hat{\beta_0}, \hat{\beta_1}\), and \(\hat{\beta_2}\)? How do these relate to the true \(\beta_0, \beta_1\), and \(\beta_2\)? Can you reject the null hypothesis \(H_0: \beta_1 = 0\)? How about the null hypothesis \(H_0 : \beta_2 = 0\)?
Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis \(H_0: \beta_1 = 0\)?
Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis \(H_0: \beta_1 = 0\)?
Do the results obtained in (c)–(e) contradict each other? Explain your answer.
Now suppose we obtain one additional observation, which was unfortunately mismeasured.
> x1=c(x1, 0.1)
> x2=c(x2, 0.8)
> y=c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
set.seed(0) # Setting the random seed
Generating x1 data using runif (provides uniform distribution from 0-1)
x1 = runif(100)
Generating data for x2 using random values from normal distribution
x2 = 0.5*x1 + rnorm(100)/10
Creating a linear model. y is a function of x1 and x2
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)
plot(x1, x2)
Based on the graph, there is a postive correlation between x1 and x2
cor(x1, x2) # 0.8324
## [1] 0.8323673
Fitting the a Least Squares Regression to predict y given x1 and x2
lm.fit = lm(y~x1+x2)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0787 -0.7651 -0.1705 0.8081 2.4605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0966 0.2289 9.160 8.75e-15 ***
## x1 2.5370 0.7044 3.602 0.000501 ***
## x2 -0.8687 1.1899 -0.730 0.467116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.049 on 97 degrees of freedom
## Multiple R-squared: 0.2345, Adjusted R-squared: 0.2187
## F-statistic: 14.86 on 2 and 97 DF, p-value: 2.353e-06
The coefficient estimates for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are similar to true coefficients (\(\beta_0\) and \(\beta_1\)). It is off by at most 0.5 units.
However the coefficient estimate for Bhat2 is off by 1.1 units. Its extremely high p-value (0.4671) tells us that we cannot reject the null hypothesis, which means that there is a high chance that there is no relationship between y and x2.
library(car)
## Loading required package: carData
vif(lm.fit)
## x1 x2
## 3.255583 3.255583
lm.x1 = lm(y~x1)
summary(lm.x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0009 -0.6889 -0.1621 0.7807 2.5362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1015 0.2282 9.207 6.36e-15 ***
## x1 2.1089 0.3895 5.415 4.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.047 on 98 degrees of freedom
## Multiple R-squared: 0.2303, Adjusted R-squared: 0.2224
## F-statistic: 29.32 on 1 and 98 DF, p-value: 4.373e-07
We can reject the null hypothesis because the p-value (4.37E-7) is extremely below 0.05.
lm.x2 = lm(y~x2)
summary(lm.x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4823 -0.8368 -0.1330 0.8120 2.5029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5227 0.2076 12.154 < 2e-16 ***
## x2 2.6985 0.6986 3.863 0.000201 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.112 on 98 degrees of freedom
## Multiple R-squared: 0.1321, Adjusted R-squared: 0.1233
## F-statistic: 14.92 on 1 and 98 DF, p-value: 0.0002014
We can reject the null hypothesis because the p-value (0.0002) is extremely below 0.05.
In (c), the p-value for Bhat2 was not siginificant but in (e) if we do a LSR model separately, then it does become significant. Also the coefficient for Bhat2 changed by almost 2 units from (c) and (e).
Adding mismeasured observations into the data
x1 = c(x1, 0.1)
x2 = c(x2, 0.8)
y = c(y, 6)
lm.new.fit = lm(y~x1+x2)
summary(lm.new.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8386 -0.7164 -0.2169 0.7466 2.5947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2092 0.2358 9.368 2.85e-15 ***
## x1 1.2177 0.5870 2.075 0.0407 *
## x2 1.5187 0.9491 1.600 0.1128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.095 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1862
## F-statistic: 12.44 on 2 and 98 DF, p-value: 1.536e-05
lm.new.x1 = lm(y~x1)
summary(lm.new.x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9627 -0.7071 -0.1782 0.7696 3.5646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2453 0.2366 9.491 1.41e-15 ***
## x1 1.9013 0.4056 4.687 8.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.103 on 99 degrees of freedom
## Multiple R-squared: 0.1816, Adjusted R-squared: 0.1733
## F-statistic: 21.97 on 1 and 99 DF, p-value: 8.861e-06
lm.new.x2 = lm(y~x2)
summary(lm.new.x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5156 -0.8391 -0.1422 0.7896 2.5563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4708 0.2025 12.199 < 2e-16 ***
## x2 2.9518 0.6616 4.462 2.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.113 on 99 degrees of freedom
## Multiple R-squared: 0.1674, Adjusted R-squared: 0.159
## F-statistic: 19.91 on 1 and 99 DF, p-value: 2.152e-05
Given the new data, it makes the \(R^2\) worse and also the coefficient estimate for Bhat1 is farther that before.
plot(hatvalues(lm.new.fit))
Based on the graph above, the new data point has high leverage
par(mfrow = c(2,2)) # 4 plots in same picture
plot(lm.new.fit)
Since the studentized residuals for new data is not greater than 3, we can say it is not an outlier.