Logistic regression relies in a straight line equation p88
Logistic Regression can handle continuous and categorical predictor variables
It uses Linear Regression
The dotted line with y= 0.5 is the threshold
The s-curve gives fewer misclassifications
p91
“Because probabilities are bounded between 0 and 1, it’s difficult to combine the information from two predictors”
“Odds are a convenient way of representing the likelihood of something occurring. They tell us how much more likely an event is to occur, rather than how likely it is not to occur.”
“whereas probability is bounded between 0 and 1, odds can take any positive value.”
“C3PO got his odds the wrong way around (as many people do). He should have said the odds of successfully navigating an asteroid field are approximately 1 to 3,720!”
“successfully navigating an asteroid field are approximately 3,720 to 1!” - wrong way to say it.
\[odds = \frac{p}{1-p}\]
\[log\;odds = ln \left(\frac{p}{1-p} \right) \; 4.3\]
4.3 converts probabilities into log odds, is also called the logit function
When interpreting log odds:
“The more influence a predictor variable has on the log odds, the steeper the slope will be, while variables that have no predictive value will have a slope that is nearly horizontal.”
\[p = \frac{1}{1+e^{-z} }\;\;(4.4)\]
z is the log odds
\[log\;odds = ln \left(\frac{p}{1-p} \right) = \beta_0 + \beta_1 X\; (4.5)\]
\[log\;odds = ln \left(\frac{p}{1-p} \right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \; (4.7)\]
p - number of parameters
softmax regression/multinomial logistic regression"
Handles more than 2 classes
Logistic Regression Example: Titanic
library(mlr)
library(tidyverse)
#install.packages("titanic")
data(titanic_train, package = "titanic")
titanicTib <- as_tibble(titanic_train)
titanicTib
## # A tibble: 891 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> <dbl> <chr>
## 1 1 0 3 Brau… male 22 1 0 A/5 2… 7.25 ""
## 2 2 1 1 Cumi… fema… 38 1 0 PC 17… 71.3 "C85"
## 3 3 1 3 Heik… fema… 26 0 0 STON/… 7.92 ""
## 4 4 1 1 Futr… fema… 35 1 0 113803 53.1 "C12…
## 5 5 0 3 Alle… male 35 0 0 373450 8.05 ""
## 6 6 0 3 Mora… male NA 0 0 330877 8.46 ""
## 7 7 0 1 McCa… male 54 0 0 17463 51.9 "E46"
## 8 8 0 3 Pals… male 2 3 1 349909 21.1 ""
## 9 9 1 3 John… fema… 27 0 2 347742 11.1 ""
## 10 10 1 2 Nass… fema… 14 1 0 237736 30.1 ""
## # … with 881 more rows, and 1 more variable: Embarked <chr>
We’ll perform three tasks:
Feature engineering comes in two flavors:
Feature extraction - Predictive information is held in a variable, but in a format that is not useful. For example, let’s say you have a variable that contains the year, month, day, and time of day of certain events occurring. The time of day has important predictive value, but the year, month, and day do not. For this variable to be useful in your model, you would need to extract only the time-of-day information as a new variable.
Feature creation - Existing variables are combined to create new ones. Merging the SibSp and Parch variables to create FamSize is an example.
Feature selection - keeping variables that add predictive value, and removing those that don’t.
fctrs <- c("Survived", "Sex", "Pclass")
titanicClean <- titanicTib %>%
mutate_at(.vars = fctrs, .funs = factor) %>%
mutate(FamSize = SibSp + Parch) %>%
select(Survived, Pclass, Sex, Age, Fare, FamSize)
titanicUntidy <- gather(titanicClean, key = "Variable", value = "Value",
-Survived)
## Warning: attributes are not identical across measure variables;
## they will be dropped
titanicUntidy
## # A tibble: 4,455 x 3
## Survived Variable Value
## <fct> <chr> <chr>
## 1 0 Pclass 3
## 2 1 Pclass 1
## 3 1 Pclass 3
## 4 1 Pclass 1
## 5 0 Pclass 3
## 6 0 Pclass 3
## 7 0 Pclass 1
## 8 0 Pclass 3
## 9 1 Pclass 3
## 10 1 Pclass 2
## # … with 4,445 more rows
titanicUntidy %>%
filter(Variable != "Pclass" & Variable != "Sex") %>%
ggplot(aes(Survived, as.numeric(Value))) +
facet_wrap(~ Variable, scales = "free_y") +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
theme_bw()
## Warning: Removed 177 rows containing non-finite values (stat_ydensity).
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
violin geometric object, which is similar to a box plot but also shows the density of data along the y-axis
Redraw the plot in figure 4.10, but add a geom_point() layer, setting the alpha argument to 0.05 and the size argument to 3. Does this make the violin plot make more sense?
facet_wrap() wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid() because most displays are roughly rectangular.
titanicUntidy %>%
filter(Variable == "Pclass" | Variable == "Sex") %>%
ggplot(aes(Value, fill = Survived)) +
facet_wrap(~ Variable, scales = "free_x") +
geom_bar(position = "fill") +
theme_bw()
titanicTask <- makeClassifTask(data = titanicClean, target = "Survived")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
logReg <- makeLearner("classif.logreg", predict.type = "prob")
Redraw the plot in figure 4.11, but change the geom_bar() argument position equal to “dodge”. Do this again, but make the position argument equal to “stack”. Can you see the difference between the three methods?
Impute missing values first
imp <- impute(titanicClean, cols = list(Age = imputeMean()))
sum(is.na(titanicClean$Age)) # 177
## [1] 177
sum(is.na(imp$data$Age)) # 0
## [1] 0
titanicTask <- makeClassifTask(data = imp$data, target = "Survived")
logRegModel <- train(logReg, titanicTask)
logRegWrapper <- makeImputeWrapper("classif.logreg",
cols = list(Age = imputeMean()))
Now let’s apply stratified, 10-fold cross-validation, repeated 50 times, to our wrapped learner.
kFold <- makeResampleDesc(method = "RepCV", # Repeated Cross-Validation
folds = 10, # Break data into 10 sections
reps = 5, # Change from 50 to 5 to create less output
stratify = TRUE)
logRegwithImpute <- resample(logRegWrapper, titanicTask,
resampling = kFold,
measures = list(acc, fpr, fnr))
## Resampling: repeated cross-validation
## Measures: acc fpr fnr
## [Resample] iter 1: 0.7272727 0.4117647 0.1851852
## [Resample] iter 2: 0.9101124 0.1764706 0.0363636
## [Resample] iter 3: 0.8651685 0.1176471 0.1454545
## [Resample] iter 4: 0.7666667 0.4285714 0.1090909
## [Resample] iter 5: 0.8426966 0.1764706 0.1454545
## [Resample] iter 6: 0.7528090 0.2647059 0.2363636
## [Resample] iter 7: 0.8426966 0.2352941 0.1090909
## [Resample] iter 8: 0.7191011 0.4411765 0.1818182
## [Resample] iter 9: 0.7555556 0.3714286 0.1636364
## [Resample] iter 10: 0.7865169 0.3529412 0.1272727
## [Resample] iter 11: 0.7640449 0.2647059 0.2181818
## [Resample] iter 12: 0.8111111 0.2571429 0.1454545
## [Resample] iter 13: 0.7865169 0.3235294 0.1454545
## [Resample] iter 14: 0.8068182 0.2941176 0.1296296
## [Resample] iter 15: 0.8089888 0.2352941 0.1636364
## [Resample] iter 16: 0.7640449 0.3235294 0.1818182
## [Resample] iter 17: 0.7865169 0.2647059 0.1818182
## [Resample] iter 18: 0.8111111 0.3142857 0.1090909
## [Resample] iter 19: 0.7977528 0.2941176 0.1454545
## [Resample] iter 20: 0.7752809 0.4411765 0.0909091
## [Resample] iter 21: 0.7777778 0.3142857 0.1636364
## [Resample] iter 22: 0.8651685 0.3235294 0.0181818
## [Resample] iter 23: 0.8202247 0.2352941 0.1454545
## [Resample] iter 24: 0.7303371 0.3823529 0.2000000
## [Resample] iter 25: 0.8202247 0.2352941 0.1454545
## [Resample] iter 26: 0.8426966 0.3235294 0.0545455
## [Resample] iter 27: 0.8089888 0.2857143 0.1296296
## [Resample] iter 28: 0.7528090 0.2647059 0.2363636
## [Resample] iter 29: 0.7640449 0.2941176 0.2000000
## [Resample] iter 30: 0.7865169 0.2647059 0.1818182
## [Resample] iter 31: 0.7777778 0.4857143 0.0545455
## [Resample] iter 32: 0.7303371 0.3823529 0.2000000
## [Resample] iter 33: 0.7752809 0.2941176 0.1818182
## [Resample] iter 34: 0.8314607 0.2647059 0.1090909
## [Resample] iter 35: 0.7528090 0.2647059 0.2363636
## [Resample] iter 36: 0.8426966 0.2647059 0.0909091
## [Resample] iter 37: 0.7865169 0.3529412 0.1272727
## [Resample] iter 38: 0.8295455 0.2058824 0.1481481
## [Resample] iter 39: 0.7528090 0.4117647 0.1454545
## [Resample] iter 40: 0.8555556 0.1142857 0.1636364
## [Resample] iter 41: 0.8089888 0.2352941 0.1636364
## [Resample] iter 42: 0.7977528 0.2352941 0.1818182
## [Resample] iter 43: 0.7977528 0.3529412 0.1090909
## [Resample] iter 44: 0.7865169 0.3529412 0.1272727
## [Resample] iter 45: 0.7752809 0.2941176 0.1818182
## [Resample] iter 46: 0.8426966 0.2000000 0.1296296
## [Resample] iter 47: 0.7415730 0.4411765 0.1454545
## [Resample] iter 48: 0.8426966 0.2647059 0.0909091
## [Resample] iter 49: 0.7865169 0.2352941 0.2000000
## [Resample] iter 50: 0.7777778 0.3428571 0.1454545
##
## Aggregated Result: acc.test.mean=0.7948383,fpr.test.mean=0.2981681,fnr.test.mean=0.1471717
##
logRegwithImpute
## Resample Result
## Task: imp$data
## Learner: classif.logreg.imputed
## Aggr perf: acc.test.mean=0.7948383,fpr.test.mean=0.2981681,fnr.test.mean=0.1471717
## Runtime: 0.906848
logRegModelData <- getLearnerModel(logRegModel)
coef(logRegModelData)
## (Intercept) Pclass2 Pclass3 Sexmale Age Fare
## 3.809661697 -1.000344806 -2.132428850 -2.775928255 -0.038822458 0.003218432
## FamSize
## -0.243029114
An odds ratio is, well, a ratio of odds. For example, if the odds of surviving the Titanic if you’re female are about 7 to 10, and the odds of surviving if you’re male are 2 to 10, then the odds ratio for surviving if you’re female is 3.5. In other words, if you were female, you would have been 3.5 times more likely to survive than if you were male.
(7/10)/(2/10)
## [1] 3.5
7/2
## [1] 3.5
How do we get from log odds to odds ratios? By taking their exponent (\(e^{log\,odds}\)). We can also calculate 95% confidence intervals using the confint() function, to help us decide how strong the evidence is that each variable has predictive value.
exp(cbind(Odds_Ratio = coef(logRegModelData), confint(logRegModelData)))
## Waiting for profiling to be done...
## Odds_Ratio 2.5 % 97.5 %
## (Intercept) 45.13516691 19.14718874 109.72483921
## Pclass2 0.36775262 0.20650392 0.65220841
## Pclass3 0.11854901 0.06700311 0.20885220
## Sexmale 0.06229163 0.04182164 0.09116657
## Age 0.96192148 0.94700049 0.97652950
## Fare 1.00322362 0.99872001 1.00863263
## FamSize 0.78424868 0.68315465 0.89110044
Most of these odds ratios are less than 1. An odds ratio less than 1 means an event is less likely to occur. It’s usually easier to interpret these if you divide 1 by them. For example, the odds ratio for surviving if you were male is 0.06, and 1 divided by 0.06 = 16.7. This means that, holding all other variables constant, men were 16.7 times less likely to survive than women
data(titanic_test, package = "titanic")
titanicNew <- as_tibble(titanic_test)
titanicNewClean <- titanicNew %>%
mutate_at(.vars = c("Sex", "Pclass"), .funs = factor) %>%
mutate(FamSize = SibSp + Parch) %>%
select(Pclass, Sex, Age, Fare, FamSize)
predict(logRegModel, newdata = titanicNewClean)
## Warning in predict.WrappedModel(logRegModel, newdata = titanicNewClean):
## Provided data for prediction is not a pure data.frame but from class tbl_df,
## hence it will be converted.
## Prediction: 418 observations
## predict.type: prob
## threshold: 0=0.50,1=0.50
## time: 0.00
## prob.0 prob.1 response
## 1 0.9178036 0.08219636 0
## 2 0.5909570 0.40904305 0
## 3 0.9123303 0.08766974 0
## 4 0.8927383 0.10726167 0
## 5 0.4069407 0.59305933 1
## 6 0.8337609 0.16623907 0
## ... (#rows: 418, #cols: 3)