7.1. What is the recursive partitioning algorithm?”

[image:A9CB156B-2345-4599-9EC6-D49998E11ED0-41190-00001CF0CE311AC9/Photo Mar 12, 2021 at 95510 AM.jpg]

leaf nodes or leaves

Tree-based models can be used for both classification and regression tasks, so you may see them described as classification and regression trees (CART)

rpart algorithm

“rpart offers two approaches: the difference in entropy (called the information gain) and the difference in Gini index (called the Gini gain). The two methods usually give very similar results; but the Gini index (named after the sociologist and statistician Corrado Gini) is slightly faster to compute, so we’ll focus on it.”

“The Gini index is the default method rpart uses to decide how to split the tree”

7.1.1. Using Gini gain to split the tree

“Entropy and the Gini index are two ways of trying to measure the same thing: impurity. Impurity is a measure of how heterogeneous the classes are within a node.”

“By estimating the impurity (with whichever method you choose) that would result from using each predictor variable for the next split, the algorithm can choose the feature that will result in the smallest impurity. Put another way, the algorithm chooses the feature that will result in subsequent nodes that are as homogeneous as possible.”

“If a node contains only a single class (which would make it a leaf), it would be said to be pure.”

“We want to know the Gini gain of this split. The Gini gain is the difference between the Gini index of the parent node and the Gini index of the split. Looking at our example in figure 7.2, the Gini index for any node is calculated as”

\[Gini index = 1 - (p(A)^2 - p(B)^2)\]

[image:6B9F1AA3-FB42-4BEC-851C-EA9D44D5D601-41190-00001CF55C76F17F/Photo Mar 12, 2021 at 100845 AM.jpg]

Generalizing the Gini index to any number of classes

7.1.3. Hyperparameters of the rpart algorithm

cp value =

7.2. Building a decision tree model

Listing 7.1. Loading and exploring the zoo dataset

data(Zoo, package = “mlbench”) zooTib <- as_tibble(Zoo)

zooTib

“mlr won’t let us create a task with logical predictors, so let’s convert them into factors instead”

“ Listing 7.2. Converting logical variables to factors

zooTib <- mutate_if(zooTib, is.logical, as.factor)

7.4. Training the decision tree model

Listing 7.3. Creating the task and learner

zooTask <- makeClassifTask(data = zooTib, target = “type”)

tree <- makeLearner(“classif.rpart”)

“The maxcompete hyperparameter controls how many candidate splits can be displayed for each node in the model summary”

“The maxsurrogate hyperparameter is similar to maxcompete but controls how many surrogate splits are shown”

“The usesurrogate hyperparameter controls how the algorithm uses surrogate splits. A value of zero means surrogates will not be used, and cases with missing data will not be classified”

Recall from chapter 6 that we can quickly count the number of missing values per column of a data.frame or tibble by running map_dbl(zooTib, ~sum(is.na(.)))

Listing 7.4. Printing available rpart hyperparameters

getParamSet(tree)

Listing 7.5. Defining the hyperparameter space for tuning

treeParamSpace <- makeParamSet( makeIntegerParam(“minsplit”, lower = 5, upper = 20), makeIntegerParam(“minbucket”, lower = 3, upper = 10), makeNumericParam(“cp”, lower = 0.01, upper = 0.1), makeIntegerParam(“maxdepth”, lower = 3, upper = 10))

Listing 7.6. Defining the random search

randSearch <- makeTuneControlRandom(maxit = 200)

cvForTuning <- makeResampleDesc(“CV”, iters = 5)

Listing 7.7. Performing hyperparameter tuning

library(parallel) library(parallelMap)

parallelStartSocket(cpus = detectCores())

tunedTreePars <- tuneParams(tree, task = zooTask, resampling = cvForTuning, par.set = treeParamSpace, control = randSearch)

parallelStop()

tunedTreePars

MLR Tidyverse Chapter 7 Notes - Decision Trees

MLR Notes tmp

7.1. What is the recursive partitioning algorithm?”

7.1.1. Using Gini gain to split the tree

Generalizing the Gini index to any number of classes

7.1.3. Hyperparameters of the rpart algorithm

Listing 7.5. Defining the hyperparameter space for tuning

Listing 7.6. Defining the random search

Listing 7.7. Performing hyperparameter tuning

7.4.1. Training the model with the tuned hyperparameters

Listing 7.8. Training the final tuned model

Listing 7.9. Plotting the decision tree

Listing 7.10. Exploring the model

7.5. Cross-validating our decision tree model

Listing 7.11. Cross-validating the model-building process

Listing 7.12. Extracting the cross-validation result