Introduction
Load a couple libraries
Read the training data
Read test data
Keep only a subset of the data
Correlation Plot
Fit the Linear Model
Model Summary
Plot the Regression Line
Predict SalePrice in Test Data
Generate Kaggle Submission File

Introduction

The goal is to introduce linear regression in R by solving the Kaggle Ames Housing competition.

Don’t expect a great score. There’s a lot more to learn but this blog will take you from zero to submission.

Simple linear regression only uses one variable/predictor/feature to make a prediction. In our case, the feature is the ground living area: GrLivArea. We chose this parameter by reading this document.

The R code can be found on Github

Load a couple libraries

library(tidyverse) # A lot of magic in here
library(GGally)

Read the training data

train <- read_csv("../input/train.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   Id = col_double(),
##   MSSubClass = col_double(),
##   LotFrontage = col_double(),
##   LotArea = col_double(),
##   OverallQual = col_double(),
##   OverallCond = col_double(),
##   YearBuilt = col_double(),
##   YearRemodAdd = col_double(),
##   MasVnrArea = col_double(),
##   BsmtFinSF1 = col_double(),
##   BsmtFinSF2 = col_double(),
##   BsmtUnfSF = col_double(),
##   TotalBsmtSF = col_double(),
##   `1stFlrSF` = col_double(),
##   `2ndFlrSF` = col_double(),
##   LowQualFinSF = col_double(),
##   GrLivArea = col_double(),
##   BsmtFullBath = col_double(),
##   BsmtHalfBath = col_double(),
##   FullBath = col_double()
##   # ... with 18 more columns
## )
## ℹ Use `spec()` for the full column specifications.

Read test data

test <- read_csv("../input/test.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   Id = col_double(),
##   MSSubClass = col_double(),
##   LotFrontage = col_double(),
##   LotArea = col_double(),
##   OverallQual = col_double(),
##   OverallCond = col_double(),
##   YearBuilt = col_double(),
##   YearRemodAdd = col_double(),
##   MasVnrArea = col_double(),
##   BsmtFinSF1 = col_double(),
##   BsmtFinSF2 = col_double(),
##   BsmtUnfSF = col_double(),
##   TotalBsmtSF = col_double(),
##   `1stFlrSF` = col_double(),
##   `2ndFlrSF` = col_double(),
##   LowQualFinSF = col_double(),
##   GrLivArea = col_double(),
##   BsmtFullBath = col_double(),
##   BsmtHalfBath = col_double(),
##   FullBath = col_double()
##   # ... with 17 more columns
## )
## ℹ Use `spec()` for the full column specifications.

Keep only a subset of the data

train <- select(train, c("Id", "GrLivArea", "LotArea", "TotalBsmtSF", "YearBuilt", "SalePrice")) # Tidyverse
test <- select(test, c("Id", "GrLivArea", "LotArea", "TotalBsmtSF", "YearBuilt")) # Tidyverse

Correlation Plot

ggpairs(train, binwidth=30)

## Warning in warn_if_args_exist(list(...)): Extra arguments: 'binwidth' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

In this chart, we can verify that SalePrice and GrLivArea are correlated.

Fit the Linear Model

lm.fit = lm(SalePrice ~ GrLivArea, data = train)

Model Summary

summary(lm.fit)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18569.026   4480.755   4.144 3.61e-05 ***
## GrLivArea     107.130      2.794  38.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

The coefficient for GrLivArea is 107.130 and the p-value is 2e-16, which means it is significant.

Plot the Regression Line

attach(train)
{plot(GrLivArea, SalePrice) # Plot points
abline(lm.fit) # Add Least Squares Regression Line
}

Predict SalePrice in Test Data

predSalePrice = predict(lm.fit, newdata = test)
test$SalePrice = predSalePrice

Generate Kaggle Submission File

test %>% 
  select(Id, SalePrice) %>%
  write.csv("intro_lr_submission.csv", quote = FALSE, row.names = FALSE)

The Kaggle Score: 0.29117

An Introduction to Simple Linear Regression in R