library(mlr)
library(tidyverse)
Dimension reduction is the process of learning how the information in a set of variables can be condensed into a smaller number of variables, with as little information loss as possible.
The phenomenon of the predictive power of a model increasing as the number of predictor variables increases, but then decreasing again as we continue to add more predictors, is called the Hughes phenomenon, after the statistician G. Hughes
LDA generalizes to QDA
LDA aims to learn a new representation of the data that separates the centroid of each class, while keeping the within-class variance as low as possible. A centroid is simply the point in the feature space that is the mean of all the predictors (a vector of means, one for each dimension). Then LDA finds a line through the origin that, when the data is projected onto it, simultaneously does the following:
This new axis is called a discriminant function (DF), and it is a linear combination of the original variables. For example, a discriminant function could be described by this equation:
DF = –0.5 × var1 + 1.2 × var2 + 0.85 × var3
LDA performs well if the data within each class is normally distributed across all the predictor variables, and the classes have similar covariances.
Covariance simply means how much one variable increases/decreases when another variable increases/decreases. So LDA assumes that for each class in the dataset, the predictor variables covary with each other the same amount.
This often isn’t the case, and classes have different covariances. In this situation, QDA tends to perform better than LDA because it doesn’t make this assumption (though it still assumes the data is normally distributed). Instead of learning straight lines that separate the classes, QDA learns curved lines. It is also well suited, therefore, to situations in which classes are best separated by a nonlinear decision boundary.
#install.packages("HDclassif")
data(wine, package = "HDclassif")
wineTib <- as_tibble(wine)
wineTib
## # A tibble: 178 x 14
## class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 14.2 1.71 2.43 15.6 127 2.8 3.06 0.28 2.29 5.64 1.04 3.92
## 2 1 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.4
## 3 1 13.2 2.36 2.67 18.6 101 2.8 3.24 0.3 2.81 5.68 1.03 3.17
## 4 1 14.4 1.95 2.5 16.8 113 3.85 3.49 0.24 2.18 7.8 0.86 3.45
## 5 1 13.2 2.59 2.87 21 118 2.8 2.69 0.39 1.82 4.32 1.04 2.93
## 6 1 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85
## 7 1 14.4 1.87 2.45 14.6 96 2.5 2.52 0.3 1.98 5.25 1.02 3.58
## 8 1 14.1 2.15 2.61 17.6 121 2.6 2.51 0.31 1.25 5.05 1.06 3.58
## 9 1 14.8 1.64 2.17 14 97 2.8 2.98 0.290 1.98 5.2 1.08 2.85
## 10 1 13.9 1.35 2.27 16 98 2.98 3.15 0.22 1.85 7.22 1.01 3.55
## # … with 168 more rows, and 1 more variable: V13 <int>
names(wineTib) <- c("Class", "Alco", "Malic", "Ash", "Alk", "Mag",
"Phe", "Flav", "Non_flav", "Proan", "Col", "Hue",
"OD", "Prol")
wineTib$Class <- as.factor(wineTib$Class)
wineTib
## # A tibble: 178 x 14
## Class Alco Malic Ash Alk Mag Phe Flav Non_flav Proan Col Hue
## <fct> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 14.2 1.71 2.43 15.6 127 2.8 3.06 0.28 2.29 5.64 1.04
## 2 1 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05
## 3 1 13.2 2.36 2.67 18.6 101 2.8 3.24 0.3 2.81 5.68 1.03
## 4 1 14.4 1.95 2.5 16.8 113 3.85 3.49 0.24 2.18 7.8 0.86
## 5 1 13.2 2.59 2.87 21 118 2.8 2.69 0.39 1.82 4.32 1.04
## 6 1 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05
## 7 1 14.4 1.87 2.45 14.6 96 2.5 2.52 0.3 1.98 5.25 1.02
## 8 1 14.1 2.15 2.61 17.6 121 2.6 2.51 0.31 1.25 5.05 1.06
## 9 1 14.8 1.64 2.17 14 97 2.8 2.98 0.290 1.98 5.2 1.08
## 10 1 13.9 1.35 2.27 16 98 2.98 3.15 0.22 1.85 7.22 1.01
## # … with 168 more rows, and 2 more variables: OD <dbl>, Prol <int>
wineUntidy <- gather(wineTib, "Variable", "Value", -Class)
ggplot(wineUntidy, aes(Class, Value)) +
facet_wrap(~ Variable, scales = "free_y") +
geom_boxplot() +
theme_bw()
wineTask <- makeClassifTask(data = wineTib, target = "Class")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
lda <- makeLearner("classif.lda")
ldaModel <- train(lda, wineTask)
ldaModelData <- getLearnerModel(ldaModel)
ldaPreds <- predict(ldaModelData)$x
head(ldaPreds)
## LD1 LD2
## 1 -4.700244 1.9791383
## 2 -4.301958 1.1704129
## 3 -3.420720 1.4291014
## 4 -4.205754 4.0028715
## 5 -1.509982 0.4512239
## 6 -4.518689 3.2131376
wineTib %>%
mutate(LD1 = ldaPreds[, 1],
LD2 = ldaPreds[, 2]) %>%
ggplot(aes(LD1, LD2, col = Class)) +
geom_point() +
stat_ellipse() +
theme_bw()
qda <- makeLearner("classif.qda")
qdaModel <- train(qda, wineTask)
Should do 50 reps.
kFold <- makeResampleDesc(method = "RepCV",
folds = 10,
reps = 5, # 50 to 5 for smaller report
stratify = TRUE)
ldaCV <- resample(learner = lda, task = wineTask, resampling = kFold,
measures = list(mmce, acc))
## Resampling: repeated cross-validation
## Measures: mmce acc
## [Resample] iter 1: 0.0000000 1.0000000
## [Resample] iter 2: 0.0555556 0.9444444
## [Resample] iter 3: 0.0000000 1.0000000
## [Resample] iter 4: 0.0000000 1.0000000
## [Resample] iter 5: 0.0000000 1.0000000
## [Resample] iter 6: 0.0555556 0.9444444
## [Resample] iter 7: 0.0000000 1.0000000
## [Resample] iter 8: 0.0555556 0.9444444
## [Resample] iter 9: 0.0000000 1.0000000
## [Resample] iter 10: 0.0000000 1.0000000
## [Resample] iter 11: 0.0000000 1.0000000
## [Resample] iter 12: 0.0000000 1.0000000
## [Resample] iter 13: 0.0000000 1.0000000
## [Resample] iter 14: 0.0000000 1.0000000
## [Resample] iter 15: 0.0555556 0.9444444
## [Resample] iter 16: 0.0000000 1.0000000
## [Resample] iter 17: 0.0000000 1.0000000
## [Resample] iter 18: 0.0000000 1.0000000
## [Resample] iter 19: 0.0000000 1.0000000
## [Resample] iter 20: 0.0555556 0.9444444
## [Resample] iter 21: 0.0000000 1.0000000
## [Resample] iter 22: 0.0588235 0.9411765
## [Resample] iter 23: 0.1111111 0.8888889
## [Resample] iter 24: 0.0000000 1.0000000
## [Resample] iter 25: 0.0000000 1.0000000
## [Resample] iter 26: 0.0000000 1.0000000
## [Resample] iter 27: 0.0000000 1.0000000
## [Resample] iter 28: 0.0000000 1.0000000
## [Resample] iter 29: 0.0000000 1.0000000
## [Resample] iter 30: 0.0000000 1.0000000
## [Resample] iter 31: 0.0000000 1.0000000
## [Resample] iter 32: 0.0000000 1.0000000
## [Resample] iter 33: 0.0000000 1.0000000
## [Resample] iter 34: 0.0000000 1.0000000
## [Resample] iter 35: 0.0000000 1.0000000
## [Resample] iter 36: 0.0555556 0.9444444
## [Resample] iter 37: 0.0555556 0.9444444
## [Resample] iter 38: 0.0000000 1.0000000
## [Resample] iter 39: 0.0000000 1.0000000
## [Resample] iter 40: 0.0000000 1.0000000
## [Resample] iter 41: 0.0000000 1.0000000
## [Resample] iter 42: 0.0000000 1.0000000
## [Resample] iter 43: 0.1111111 0.8888889
## [Resample] iter 44: 0.0000000 1.0000000
## [Resample] iter 45: 0.0000000 1.0000000
## [Resample] iter 46: 0.0555556 0.9444444
## [Resample] iter 47: 0.0555556 0.9444444
## [Resample] iter 48: 0.0000000 1.0000000
## [Resample] iter 49: 0.0000000 1.0000000
## [Resample] iter 50: 0.0000000 1.0000000
##
## Aggregated Result: mmce.test.mean=0.0156209,acc.test.mean=0.9843791
##
qdaCV <- resample(learner = qda, task = wineTask, resampling = kFold,
measures = list(mmce, acc))
## Resampling: repeated cross-validation
## Measures: mmce acc
## [Resample] iter 1: 0.0000000 1.0000000
## [Resample] iter 2: 0.0555556 0.9444444
## [Resample] iter 3: 0.0000000 1.0000000
## [Resample] iter 4: 0.0588235 0.9411765
## [Resample] iter 5: 0.0000000 1.0000000
## [Resample] iter 6: 0.0000000 1.0000000
## [Resample] iter 7: 0.0000000 1.0000000
## [Resample] iter 8: 0.0000000 1.0000000
## [Resample] iter 9: 0.0000000 1.0000000
## [Resample] iter 10: 0.0000000 1.0000000
## [Resample] iter 11: 0.0555556 0.9444444
## [Resample] iter 12: 0.0000000 1.0000000
## [Resample] iter 13: 0.0000000 1.0000000
## [Resample] iter 14: 0.0000000 1.0000000
## [Resample] iter 15: 0.0000000 1.0000000
## [Resample] iter 16: 0.0000000 1.0000000
## [Resample] iter 17: 0.0000000 1.0000000
## [Resample] iter 18: 0.0000000 1.0000000
## [Resample] iter 19: 0.0000000 1.0000000
## [Resample] iter 20: 0.0000000 1.0000000
## [Resample] iter 21: 0.0000000 1.0000000
## [Resample] iter 22: 0.0000000 1.0000000
## [Resample] iter 23: 0.0555556 0.9444444
## [Resample] iter 24: 0.0000000 1.0000000
## [Resample] iter 25: 0.0000000 1.0000000
## [Resample] iter 26: 0.0000000 1.0000000
## [Resample] iter 27: 0.0000000 1.0000000
## [Resample] iter 28: 0.0000000 1.0000000
## [Resample] iter 29: 0.0000000 1.0000000
## [Resample] iter 30: 0.0000000 1.0000000
## [Resample] iter 31: 0.0000000 1.0000000
## [Resample] iter 32: 0.0555556 0.9444444
## [Resample] iter 33: 0.0000000 1.0000000
## [Resample] iter 34: 0.0000000 1.0000000
## [Resample] iter 35: 0.0000000 1.0000000
## [Resample] iter 36: 0.0000000 1.0000000
## [Resample] iter 37: 0.0555556 0.9444444
## [Resample] iter 38: 0.0000000 1.0000000
## [Resample] iter 39: 0.0000000 1.0000000
## [Resample] iter 40: 0.0000000 1.0000000
## [Resample] iter 41: 0.0000000 1.0000000
## [Resample] iter 42: 0.0000000 1.0000000
## [Resample] iter 43: 0.0000000 1.0000000
## [Resample] iter 44: 0.0000000 1.0000000
## [Resample] iter 45: 0.0555556 0.9444444
## [Resample] iter 46: 0.0000000 1.0000000
## [Resample] iter 47: 0.0000000 1.0000000
## [Resample] iter 48: 0.0588235 0.9411765
## [Resample] iter 49: 0.0555556 0.9444444
## [Resample] iter 50: 0.0000000 1.0000000
##
## Aggregated Result: mmce.test.mean=0.0101307,acc.test.mean=0.9898693
##
ldaCV$aggr
## mmce.test.mean acc.test.mean
## 0.01562092 0.98437908
qdaCV$aggr
## mmce.test.mean acc.test.mean
## 0.01013072 0.98986928
calculateConfusionMatrix(ldaCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 1 2 3 -err.-
## 1 0.997/0.990 0.003/0.003 0.000/0.000 0.003
## 2 0.008/0.010 0.969/0.991 0.023/0.033 0.031
## 3 0.000/0.000 0.008/0.006 0.992/0.967 0.008
## -err.- 0.010 0.009 0.033 0.02
##
##
## Absolute confusion matrix:
## predicted
## true 1 2 3 -err.-
## 1 294 1 0 1
## 2 3 344 8 11
## 3 0 2 238 2
## -err.- 3 3 8 14
calculateConfusionMatrix(qdaCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 1 2 3 -err.-
## 1 0.993/0.983 0.007/0.006 0.000/0.000 0.007
## 2 0.014/0.017 0.986/0.989 0.000/0.000 0.014
## 3 0.000/0.000 0.008/0.006 0.992/1.000 0.008
## -err.- 0.02 0.01 0.00 0.01
##
##
## Absolute confusion matrix:
## predicted
## true 1 2 3 -err.-
## 1 293 2 0 2
## 2 5 350 0 5
## 3 0 2 238 2
## -err.- 5 4 0 9
Interpret the confusion matrices shown in the previous section.
Which model is better at identifying wines from vineyard 3? Does our LDA model misclassify more wines from vineyard 2 as being from vineyard 1 or vineyard 3?”
Extract the discriminant scores from our LDA model, and use only these as the predictors for a kNN model (including tuning k). Experiment with your own cross-validation strategy.