p122
This question involves the use of multiple linear regression on the Auto data set.
Produce a scatterplot matrix which includes all of the variables in the data set.
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Try a few different transformations of the variables, such as log(X), \(\sqrt{X}, X^2\). Comment on your findings.
library(ISLR)
library(tidyverse)
library(GGally)
library(car) # scatterplotMatrix
Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)
auto <- as_tibble(Auto)
auto <- select(auto, -name)
colnames(auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin"
names(auto)[names(auto) == "displacement"] <- "displ"
names(auto)[names(auto) == "horsepower"] <- "hp"
names(auto)[names(auto) == "acceleration"] <- "accel"
ggpairs(auto)
scatterplotMatrix(auto, smooth = FALSE, main="Scatter Plot Matrix")
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
options(digits=2)
cor(auto[,!colnames(auto) %in% c("name")]) # Skip name column
## mpg cylinders displ hp weight accel year origin
## mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42 0.58 0.57
## cylinders -0.78 1.00 0.95 0.84 0.90 -0.50 -0.35 -0.57
## displ -0.81 0.95 1.00 0.90 0.93 -0.54 -0.37 -0.61
## hp -0.78 0.84 0.90 1.00 0.86 -0.69 -0.42 -0.46
## weight -0.83 0.90 0.93 0.86 1.00 -0.42 -0.31 -0.59
## accel 0.42 -0.50 -0.54 -0.69 -0.42 1.00 0.29 0.21
## year 0.58 -0.35 -0.37 -0.42 -0.31 0.29 1.00 0.18
## origin 0.57 -0.57 -0.61 -0.46 -0.59 0.21 0.18 1.00
ggcorr(auto)
Running a MLR on all predictors except for name
auto.mlr = lm(mpg ~ . -name, data=Auto)
summary(auto.mlr)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.590 -2.157 -0.117 1.869 13.060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.72e+01 4.64e+00 -3.71 0.00024 ***
## cylinders -4.93e-01 3.23e-01 -1.53 0.12780
## displacement 1.99e-02 7.51e-03 2.65 0.00844 **
## horsepower -1.70e-02 1.38e-02 -1.23 0.21963
## weight -6.47e-03 6.52e-04 -9.93 < 2e-16 ***
## acceleration 8.06e-02 9.88e-02 0.82 0.41548
## year 7.51e-01 5.10e-02 14.73 < 2e-16 ***
## origin 1.43e+00 2.78e-01 5.13 4.7e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.3 on 384 degrees of freedom
## Multiple R-squared: 0.821, Adjusted R-squared: 0.818
## F-statistic: 252 on 7 and 384 DF, p-value: <2e-16
There are multiple predictors that have relationship with the response because their associated p-value is significant
The predictors: displacement, weight, year, and origin have a statistically significant relationship.
The coefficient of year suggests that every 4 years, the mpg goes up by 3
(9d) Use the plot() function to produce diagnostic plots of the linear regression fit.
par(mfrow=c(2,2))
plot(auto.mlr)
#qplot(auto.mlr)
Non-Linearity: The residual plot shows that there is a U-shape pattern in the residuals which might indicate that the data is non-linear.
Non-constant Variance: The residual plot also shows that the variance is not constant. There is a funnel shape appearing at the end which indicates heteroscedasticity (non-constant variance)
Outliers: There seems to not be any outliers because in the Scale-Location, all values are within the range of [-2,2]. It will only be an outlier if standardized residual is outside the range of [-3, 3].
High Leverage Points: Based on the Residuals vs. Leverage graph, there is no observations that provides a high leverage
Use the * and : symbols to fit linear regression models with interaction effects.
Do any interactions appear to be statistically significant?
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
interact.fit = lm(mpg ~ . -name + horsepower*displacement, data=Auto)
origin.hp = lm(mpg ~ . -name + horsepower*origin, data=Auto)
summary(origin.hp)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.277 -1.875 -0.225 1.570 12.080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20e+01 4.40e+00 -5.00 8.9e-07 ***
## cylinders -5.28e-01 3.03e-01 -1.74 0.082 .
## displacement -1.49e-03 7.61e-03 -0.20 0.845
## horsepower 8.17e-02 1.86e-02 4.40 1.4e-05 ***
## weight -4.71e-03 6.55e-04 -7.19 3.5e-12 ***
## acceleration -1.12e-01 9.62e-02 -1.17 0.243
## year 7.33e-01 4.78e-02 15.33 < 2e-16 ***
## origin 7.70e+00 8.86e-01 8.69 < 2e-16 ***
## horsepower:origin -7.95e-02 1.07e-02 -7.40 8.4e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.1 on 383 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.841
## F-statistic: 259 on 8 and 383 DF, p-value: <2e-16
Statistically Significant Interaction Terms:
inter.fit = lm(mpg ~ . -name + horsepower:origin + horsepower:
+ horsepower:displacement,
data=Auto)
summary(inter.fit)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:origin + horsepower:+horsepower:displacement,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.722 -1.525 -0.097 1.355 12.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.71e+00 4.69e+00 -1.00 0.316
## cylinders 5.14e-01 3.14e-01 1.64 0.102
## displacement -6.97e-02 1.14e-02 -6.10 2.6e-09 ***
## horsepower -1.54e-01 3.55e-02 -4.34 1.8e-05 ***
## weight -3.08e-03 6.48e-04 -4.76 2.7e-06 ***
## acceleration -2.28e-01 9.10e-02 -2.50 0.013 *
## year 7.35e-01 4.46e-02 16.48 < 2e-16 ***
## origin 2.28e+00 1.09e+00 2.09 0.037 *
## horsepower:origin -1.92e-02 1.28e-02 -1.50 0.134
## displacement:horsepower 4.67e-04 6.13e-05 7.61 2.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.9 on 382 degrees of freedom
## Multiple R-squared: 0.864, Adjusted R-squared: 0.861
## F-statistic: 271 on 9 and 382 DF, p-value: <2e-16
Adding more interactions, decreases the significance of previous significant values
Try a few different transformations of the variables, such as log(X), \(sqrt(X), X^2\)
summary(lm(mpg ~ . -name + log(acceleration), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + log(acceleration), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.793 -2.005 -0.128 1.930 13.108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.55e+01 1.48e+01 3.08 0.0022 **
## cylinders -2.80e-01 3.19e-01 -0.88 0.3817
## displacement 8.04e-03 7.81e-03 1.03 0.3034
## horsepower -3.43e-02 1.40e-02 -2.45 0.0147 *
## weight -5.34e-03 6.85e-04 -7.79 6.1e-14 ***
## acceleration 2.17e+00 4.78e-01 4.53 7.8e-06 ***
## year 7.56e-01 4.98e-02 15.19 < 2e-16 ***
## origin 1.33e+00 2.72e-01 4.88 1.6e-06 ***
## log(acceleration) -3.51e+01 7.89e+00 -4.46 1.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.2 on 383 degrees of freedom
## Multiple R-squared: 0.83, Adjusted R-squared: 0.827
## F-statistic: 234 on 8 and 383 DF, p-value: <2e-16
log(acceleration) is still very significant but less significant than acceleration
summary(lm(mpg ~ . -name + log(horsepower), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + log(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.578 -1.662 -0.121 1.491 12.023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.67e+01 1.11e+01 7.84 4.5e-14 ***
## cylinders -5.53e-02 2.91e-01 -0.19 0.84923
## displacement -4.61e-03 7.11e-03 -0.65 0.51729
## horsepower 1.76e-01 2.27e-02 7.77 7.0e-14 ***
## weight -3.37e-03 6.56e-04 -5.13 4.6e-07 ***
## acceleration -3.28e-01 9.67e-02 -3.39 0.00078 ***
## year 7.42e-01 4.53e-02 16.37 < 2e-16 ***
## origin 8.98e-01 2.53e-01 3.55 0.00043 ***
## log(horsepower) -2.69e+01 2.65e+00 -10.13 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 383 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.856
## F-statistic: 292 on 8 and 383 DF, p-value: <2e-16
log(horsepower) is more significant than horsepower
summary(lm(mpg ~ . -name + I(horsepower^2), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.550 -1.731 -0.224 1.588 11.995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.323656 4.624770 0.29 0.77487
## cylinders 0.348906 0.304831 1.14 0.25309
## displacement -0.007565 0.007373 -1.03 0.30555
## horsepower -0.319463 0.034345 -9.30 < 2e-16 ***
## weight -0.003271 0.000679 -4.82 2.1e-06 ***
## acceleration -0.330598 0.099185 -3.33 0.00094 ***
## year 0.735341 0.045992 15.99 < 2e-16 ***
## origin 1.014413 0.254555 3.99 8.1e-05 ***
## I(horsepower^2) 0.001006 0.000106 9.45 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 383 degrees of freedom
## Multiple R-squared: 0.855, Adjusted R-squared: 0.852
## F-statistic: 283 on 8 and 383 DF, p-value: <2e-16
Squaring horsepower doesn’t change the significance
summary(lm(mpg~ . -name + I(weight^2), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.471 -1.670 -0.149 1.638 12.543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.48e+00 4.61e+00 0.32 0.7487
## cylinders -2.84e-01 2.92e-01 -0.97 0.3308
## displacement 1.37e-02 6.79e-03 2.02 0.0442 *
## horsepower -2.43e-02 1.24e-02 -1.96 0.0508 .
## weight -2.05e-02 1.58e-03 -12.97 <2e-16 ***
## acceleration 6.57e-02 8.90e-02 0.74 0.4606
## year 8.00e-01 4.62e-02 17.33 <2e-16 ***
## origin 7.42e-01 2.60e-01 2.85 0.0046 **
## I(weight^2) 2.24e-06 2.34e-07 9.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 383 degrees of freedom
## Multiple R-squared: 0.856, Adjusted R-squared: 0.853
## F-statistic: 284 on 8 and 383 DF, p-value: <2e-16
Squaring the weights doesn’t change significance
lm.fit = lm(mpg ~ . -name + I(cylinders^2), data=Auto)
par(mfrow = c(2,2))
plot(lm.fit)
summary(lm(mpg~.-name+I(cylinders^2), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + I(cylinders^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.426 -2.028 -0.161 1.717 12.876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.017889 5.729087 -0.35 0.7249
## cylinders -5.817956 1.264357 -4.60 5.7e-06 ***
## displacement 0.019789 0.007346 2.69 0.0074 **
## horsepower -0.031265 0.013872 -2.25 0.0248 *
## weight -0.006291 0.000639 -9.85 < 2e-16 ***
## acceleration 0.104852 0.096778 1.08 0.2793
## year 0.745314 0.049840 14.95 < 2e-16 ***
## origin 1.227920 0.275660 4.45 1.1e-05 ***
## I(cylinders^2) 0.464469 0.106791 4.35 1.8e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.2 on 383 degrees of freedom
## Multiple R-squared: 0.83, Adjusted R-squared: 0.826
## F-statistic: 234 on 8 and 383 DF, p-value: <2e-16
Squaring the cylinders makes cylinders and horsepower significant variables