[image:A9CB156B-2345-4599-9EC6-D49998E11ED0-41190-00001CF0CE311AC9/Photo Mar 12, 2021 at 95510 AM.jpg]
leaf nodes or leaves
Tree-based models can be used for both classification and regression tasks, so you may see them described as classification and regression trees (CART)
rpart algorithm
“rpart offers two approaches: the difference in entropy (called the information gain) and the difference in Gini index (called the Gini gain). The two methods usually give very similar results; but the Gini index (named after the sociologist and statistician Corrado Gini) is slightly faster to compute, so we’ll focus on it.”
“The Gini index is the default method rpart uses to decide how to split the tree”
“Entropy and the Gini index are two ways of trying to measure the same thing: impurity. Impurity is a measure of how heterogeneous the classes are within a node.”
“By estimating the impurity (with whichever method you choose) that would result from using each predictor variable for the next split, the algorithm can choose the feature that will result in the smallest impurity. Put another way, the algorithm chooses the feature that will result in subsequent nodes that are as homogeneous as possible.”
“If a node contains only a single class (which would make it a leaf), it would be said to be pure.”
“We want to know the Gini gain of this split. The Gini gain is the difference between the Gini index of the parent node and the Gini index of the split. Looking at our example in figure 7.2, the Gini index for any node is calculated as”
\[Gini index = 1 - (p(A)^2 - p(B)^2)\]
[image:6B9F1AA3-FB42-4BEC-851C-EA9D44D5D601-41190-00001CF55C76F17F/Photo Mar 12, 2021 at 100845 AM.jpg]
cp value =
7.2. Building a decision tree model
Listing 7.1. Loading and exploring the zoo dataset
data(Zoo, package = “mlbench”) zooTib <- as_tibble(Zoo)
zooTib
“mlr won’t let us create a task with logical predictors, so let’s convert them into factors instead”
“ Listing 7.2. Converting logical variables to factors
zooTib <- mutate_if(zooTib, is.logical, as.factor)
7.4. Training the decision tree model
Listing 7.3. Creating the task and learner
zooTask <- makeClassifTask(data = zooTib, target = “type”)
tree <- makeLearner(“classif.rpart”)
“The maxcompete hyperparameter controls how many candidate splits can be displayed for each node in the model summary”
“The maxsurrogate hyperparameter is similar to maxcompete but controls how many surrogate splits are shown”
“The usesurrogate hyperparameter controls how the algorithm uses surrogate splits. A value of zero means surrogates will not be used, and cases with missing data will not be classified”
Recall from chapter 6 that we can quickly count the number of missing values per column of a data.frame or tibble by running map_dbl(zooTib, ~sum(is.na(.)))
Listing 7.4. Printing available rpart hyperparameters
getParamSet(tree)
treeParamSpace <- makeParamSet( makeIntegerParam(“minsplit”, lower = 5, upper = 20), makeIntegerParam(“minbucket”, lower = 3, upper = 10), makeNumericParam(“cp”, lower = 0.01, upper = 0.1), makeIntegerParam(“maxdepth”, lower = 3, upper = 10))
randSearch <- makeTuneControlRandom(maxit = 200)
cvForTuning <- makeResampleDesc(“CV”, iters = 5)
library(parallel) library(parallelMap)
parallelStartSocket(cpus = detectCores())
tunedTreePars <- tuneParams(tree, task = zooTask, resampling = cvForTuning, par.set = treeParamSpace, control = randSearch)
parallelStop()
tunedTreePars
tunedTree <- setHyperPars(tree, par.vals = tunedTreePars$x)
tunedTreeModel <- train(tunedTree, zooTask)
install.packages(“rpart.plot”)
library(rpart.plot)
treeModelData <- getLearnerModel(tunedTreeModel)
rpart.plot(treeModelData, roundint = FALSE, box.palette = “BuBn”, type = 5)
printcp(treeModelData, digits = 3)
“For a detailed summary of the model, run summary(treeModelData).”
outer <- makeResampleDesc(“CV”, iters = 5)
treeWrapper <- makeTuneWrapper(“classif.rpart”, resampling = cvForTuning, par.set = treeParamSpace, control = randSearch)
parallelStartSocket(cpus = detectCores())
cvWithTuning <- resample(treeWrapper, zooTask, resampling = outer)
parallelStop()
Now let’s look at the cross-validation result and see how our model-building process performed.
cvWithTuning