In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results
Setting the seed to keep random values consistent
set.seed(1)
# Creating a vector with values drawn from a random distribution
x = rnorm(100)
Creating a vector with values drawn from a random distribution with mean 0, standard deviation 0.25
eps = rnorm(100, 0, 0.25)
Y = -1 + 0.5*x + eps
plot(x, Y)
There is a linear relationship
lm.fit = lm(Y~x)
summary(lm.fit)
##
## Call:
## lm(formula = Y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
The coefficients for estimated model are off by a few decimals
plot(x, Y)
abline(lm.fit, col='red')
abline(-1, 0.5, col="blue")
legend(-1, c("Predicted", "Actual"), col=1:2, lwd=2)
Fitting a model introducing a quadratic term
lm.poly.fit = lm(Y~x+I(x^2))
summary(lm.poly.fit)
##
## Call:
## lm(formula = Y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4913 -0.1563 -0.0322 0.1451 0.5675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.98582 0.02941 -33.516 <2e-16 ***
## x 0.50429 0.02700 18.680 <2e-16 ***
## I(x^2) -0.02973 0.02119 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2395 on 97 degrees of freedom
## Multiple R-squared: 0.7828, Adjusted R-squared: 0.7784
## F-statistic: 174.8 on 2 and 97 DF, p-value: < 2.2e-16
Adding the quadratic term does NOT improve the model much (about 0.002). Also quadratic term is not significant according to t-statistic.
Generating data randomly for x from normal distribution
x.less = rnorm(100)
Generating random error terms with small standard deviation (0.05) from normal distribution
eps.less = rnorm(100, 0, 0.05) # epsilon
# Creating the Y value given x and eps
Y.less = -1 + 0.5*x.less + eps.less
lm.less.fit = lm(Y.less ~ x.less)
summary(lm.less.fit)
##
## Call:
## lm(formula = Y.less ~ x.less)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.137090 -0.028070 -0.000874 0.033987 0.092421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.997578 0.004955 -201.3 <2e-16 ***
## x.less 0.505311 0.004813 105.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04953 on 98 degrees of freedom
## Multiple R-squared: 0.9912, Adjusted R-squared: 0.9911
## F-statistic: 1.102e+04 on 1 and 98 DF, p-value: < 2.2e-16
plot(x.less, Y.less)
# Plotting the least square regression line
abline(lm.less.fit, col='red')
# Plotting the population regression line
abline(-1, 0.5, col="blue")
# Creating a legend for plot
legend(-1, c("Predicted", "Actual"), col=1:2, lwd=2)
When the standard deviations of the error terms are small, that means the predictors will vary less from the response giving the model a better accuracy.
Generating data randomly for x from normal distribution
x.more = rnorm(100)
Generating random error terms with big standard deviation (2) from normal distribution
eps.more = rnorm(100, 0, 2)
Creating the Y value given x and eps
Y.more = -1 + 0.5*x.more + eps.more
# Fitting a least square regression
lm.more.fit = lm(Y.more~x.more)
summary(lm.more.fit)
##
## Call:
## lm(formula = Y.more ~ x.more)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0203 -1.2110 0.0413 1.4097 4.1796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0949 0.1935 -5.658 1.52e-07 ***
## x.more 0.3501 0.1662 2.106 0.0377 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.934 on 98 degrees of freedom
## Multiple R-squared: 0.04332, Adjusted R-squared: 0.03355
## F-statistic: 4.437 on 1 and 98 DF, p-value: 0.03772
plot(x.more, Y.more)
# Plotting the least square regression line
abline(lm.more.fit, col='red')
# Plotting the population regression line
abline(-1, 0.5, col="blue")
# Creating a legend for plot
legend(-1, c("Predicted", "Actual"), col=1:2, lwd=2)
When the standard deviations of the error terms are huge, the predictors will vary a lot from the response giving the model a harder time to statistically identify what value it will be.
Getting the confidence intervals for each linear model
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) -1.0575402 -0.9613061
## x 0.4462897 0.5531801
confint(lm.less.fit)
## 2.5 % 97.5 %
## (Intercept) -1.0074105 -0.9877445
## x.less 0.4957597 0.5148621
confint(lm.more.fit)
## 2.5 % 97.5 %
## (Intercept) -1.47895440 -0.7108553
## x.more 0.02027816 0.6799263
The higher the standard deviation of the error terms, the less confident the model has in predicting a value.