The goal is to introduce logistic regression in R by solving the Kaggle Titanic Survivor competition.
Don’t expect a great score. There’s a lot more to learn but this blog will take you from zero to submission.
library(tidyverse) # A lot of magic in here
library(GGally)
train <- read_csv("../input/train.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## PassengerId = col_double(),
## Survived = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
test <- read_csv("../input/test.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## PassengerId = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
train <- select(train, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse
test <- select(test, -c("Name", "Ticket", "Cabin", "Embarked", "Fare")) # Tidyverse
Categorical variables are converted to factors in R.
In our data, sex is only one of 2 values (‘Male’, ‘Female’), so we treat it differently. They are categories.
Survived is only 1 or 0, so we also treat it differently. It’s also a category variable.
Ignore the details for now. We’re only learning enough to get us to the first submission.
train$Survived <- as_factor(train$Survived) # Only provided in training data
#contrasts(train$Survived)
table(train$Survived)
##
## 0 1
## 549 342
table(train$Sex)
##
## female male
## 314 577
train$Sex <- as_factor(train$Sex)
test$Sex <- as_factor(test$Sex)
contrasts(train$Sex)
## female
## male 0
## female 1
table(train$Survived)
##
## 0 1
## 549 342
table(train$Sex)
##
## male female
## 577 314
Using the skim() function, we can see that the only missing data is Age, where 177 ages are missing in the training data.
library(skimr)
skim(train)
Name | train |
Number of rows | 891 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Survived | 0 | 1 | FALSE | 2 | 0: 549, 1: 342 |
Sex | 0 | 1 | FALSE | 2 | mal: 577, fem: 314 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | 0 | 1.0 | 446.00 | 257.35 | 1.00 | 223.50 | 446 | 668.5 | 891 | ▇▇▇▇▇ |
Pclass | 0 | 1.0 | 2.31 | 0.84 | 1.00 | 2.00 | 3 | 3.0 | 3 | ▃▁▃▁▇ |
Age | 177 | 0.8 | 29.70 | 14.53 | 0.42 | 20.12 | 28 | 38.0 | 80 | ▂▇▅▂▁ |
SibSp | 0 | 1.0 | 0.52 | 1.10 | 0.00 | 0.00 | 0 | 1.0 | 8 | ▇▁▁▁▁ |
Parch | 0 | 1.0 | 0.38 | 0.81 | 0.00 | 0.00 | 0 | 0.0 | 6 | ▇▁▁▁▁ |
In the test data, we are missing 86 ages. We will also replace those with the average.
skim(test)
Name | test |
Number of rows | 418 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Sex | 0 | 1 | FALSE | 2 | mal: 266, fem: 152 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | 0 | 1.00 | 1100.50 | 120.81 | 892.00 | 996.25 | 1100.5 | 1204.75 | 1309 | ▇▇▇▇▇ |
Pclass | 0 | 1.00 | 2.27 | 0.84 | 1.00 | 1.00 | 3.0 | 3.00 | 3 | ▃▁▃▁▇ |
Age | 86 | 0.79 | 30.27 | 14.18 | 0.17 | 21.00 | 27.0 | 39.00 | 76 | ▂▇▃▂▁ |
SibSp | 0 | 1.00 | 0.45 | 0.90 | 0.00 | 0.00 | 0.0 | 1.00 | 8 | ▇▁▁▁▁ |
Parch | 0 | 1.00 | 0.39 | 0.98 | 0.00 | 0.00 | 0.0 | 0.00 | 9 | ▇▁▁▁▁ |
Combine train and test data
df = bind_rows(train, test)
avg_age <- mean(df$Age, na.rm=TRUE)
avg_age
## [1] 29.88114
sum(is.na(train$Age))
## [1] 177
#train[is.na(train$Age)] <- avg_age
#train %>% replace_na(avg_age)
#replace_na(train, list(Age=avg_age))
train[is.na(train$Age), "Age"] <- avg_age
test[is.na(test$Age), "Age"] <- avg_age
We should now get zero rows with na.
sum(is.na(train$Age))
## [1] 0
We won’t look at this now, but a correlation pairs plot can be useful when investigating if data is related.
ggpairs(train, binwidth=30)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'binwidth' are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In our training data, only 342 survived.
table(train$Survived)
##
## 0 1
## 549 342
lr.fit = glm(Survived ~ ., data = train, family = binomial)
summary(lr.fit)
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6487 -0.6091 -0.4194 0.6109 2.4383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.433e+00 4.557e-01 5.340 9.30e-08 ***
## PassengerId 8.673e-05 3.472e-04 0.250 0.8028
## Pclass -1.172e+00 1.197e-01 -9.789 < 2e-16 ***
## Sexfemale 2.772e+00 1.993e-01 13.908 < 2e-16 ***
## Age -4.013e-02 7.776e-03 -5.161 2.46e-07 ***
## SibSp -3.328e-01 1.087e-01 -3.061 0.0022 **
## Parch -8.437e-02 1.151e-01 -0.733 0.4636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 790.23 on 884 degrees of freedom
## AIC: 804.23
##
## Number of Fisher Scoring iterations: 5
p = predict(lr.fit, type = "response")
table(train$Survived, p > 0.5)
##
## FALSE TRUE
## 0 460 89
## 1 100 242
True negative, False Positive
False negative, True Positive
Abbreviated like this in matrix form:
TN FP
FN TP
Accuracy is the percentage that the model got right: TN + TP divided by the total
The formula: \(Accuracy = \frac{TN + TP}{n}\)
accuracy = (242 + 460) / nrow(train)
accuracy
## [1] 0.7878788
Sensitivity is the true positives divided by all the positives.
The formula: \(Sensitivity = \frac{TP}{TP + FP}\)
sensitivity = 242 / (242 + 100)
sensitivity
## [1] 0.7076023
Specificity is the true negatives divide by all the negatives.
The formula: \(Specificity = \frac{TN}{TN + FN}\)
specificity = 460 / (460 + 89)
specificity
## [1] 0.8378871
Now we will take the model and use it to predict who survives in the test data.
lr.fit.probs = predict(lr.fit, test, type = "response")
Add the column to the test data.
test$Survived = ifelse(lr.fit.probs > 0.5, 1, 0)
table(test$Survived)
##
## 0 1
## 255 163
We predict only 163 people will survive.
test %>%
select(PassengerId, Survived) %>%
write.csv("intro_lr_submission.csv", quote = FALSE, row.names = FALSE)
The intro submission gives a Kaggle score of 0.74880