In this lecture note we will introduce classification and regression trees (CART), and random forests, which incorporate trees into a bagging ensemble method. We will first describe the dataset that we will use in the following section as a motivating example.

APS Failure data set

To begin, we download the APS Failure at Scania Trucks Data Set from the UC Irvine Machine Learning Repository.

The dataset description is:

The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that are utilized in various functions in a truck, such as braking and gear changes. The datasets’ positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.

The dataset has additional information associated with it. It is mentioned that a false positive (cost 1) has a cost of 10, while a false negative (cost 2) has a cost of 500. So false negatives are 50x more costly as false positives.

In this case Cost 1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost 2 refer to the cost of missing a faulty truck, which may cause a breakdown.

There is an imbalance in the dataset, such that there are 59x more negative observations than positive observations:

The training set contains 60000 examples in total in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.

There are 170 predictors plus the outcome (class):

The attribute names of the data have been anonymized for proprietary reasons. It consists of both single numerical counters and histograms consisting of bins with different conditions. … The attributes are as follows: class, then anonymized operational data. The operational data have an identifier and a bin id, like ‘Identifier_Bin’. In total there are 171 attributes, of which 7 are histogram variabels. Missing values are denoted by ‘na’.

We begin by exploring the training data. Note that we have variable amount of missing data across the columns:

library(readr)
col_types <- paste0(c("c",rep("n",170)),collapse="")
dat <- read_csv("aps_failure_training_set.csv", skip=20, na="na", col_types=col_types)
table(dat$class)
## 
##   neg   pos 
## 59000  1000
table(sapply(dat[,-1], class))
## 
## numeric 
##     170
summary(sapply(dat[,-1], function(z) sum(is.na(z))))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     671     688    5000    2727   49264

The very first column has decent discrimination of the outcome:

with(dat, boxplot(aa_000 ~ class))

with(dat, boxplot(ab_000 ~ class))

Imbalanced data and unequal costs

There is a technical report, Using Random Forest to Learn Imbalanced Data, with Leo Breiman, the creator of the random forest, as a co-author, which goes into detail on how to use random forests to deal with imbalanced data. Here we have many more negative examples than positives, although we want to have very high sensitivity (false negatives are very costly, relative to false positives). One of the strategies mentioned in the technical report is to down-sample the majority class. For convenience, we will take this approach for this lecture note, producing equal classes by down-sampling the negatives. However, given that we have explicit costs associated with the types of error, the down-sampling may not be sufficient, and a method with customizable loss function may actually be preferred for this dataset.

We note that another approach aside from down-sampling is to use the classwt parameter in the randomForest function from the package of the same name. The report mentions that, of the class weights and the down-sampling approaches, “We show that both of our methods have favorable prediction performance.”

set.seed(1)
idx.dat <- c(which(dat$class == "pos"),
             sample(which(dat$class == "neg"), 1000))
dat2 <- dat[idx.dat,]
table(dat2$class)
## 
##  neg  pos 
## 1000 1000
with(dat2, boxplot(aa_000 ~ class))

Some quick examination of the quantiles of the first variable for each class:

by(dat2$aa_000, dat2$class, quantile, .5)
## dat2$class: neg
## [1] 28564
## -------------------------------------------------------- 
## dat2$class: pos
## [1] 584994
by(dat2$aa_000, dat2$class, quantile, .9)
## dat2$class: neg
## [1] 86212.2
## -------------------------------------------------------- 
## dat2$class: pos
## [1] 1270409

Picking a cut point for a single variable

We can already start to motivate decision trees by looking at this first variable. As it seems to discriminate the groups pretty well, what if we put a cut point at an arbitrary value, say 75,000. How would our prediction on the training data turn out? We end up with about 92% prediction accuracy:

(tab <- table(obs=dat2$class, pred=dat2$aa_000 > 75000))
##      pred
## obs   FALSE TRUE
##   neg   882  118
##   pos    37  963
round(prop.table(tab), 2)
##      pred
## obs   FALSE TRUE
##   neg  0.44 0.06
##   pos  0.02 0.48
(cost <- tab["neg","TRUE"] * 10 + tab["pos","FALSE"] * 500)
## [1] 19680

We can also try a much higher value of the first variable to define a cut point. This gives us 93% accuracy, but actually the total cost nearly doubles, because we have more false negatives:

(tab <- table(obs=dat2$class, pred=dat2$aa_000 > 125000))
##      pred
## obs   FALSE TRUE
##   neg   941   59
##   pos    74  926
round(prop.table(tab), 2)
##      pred
## obs   FALSE TRUE
##   neg  0.47 0.03
##   pos  0.04 0.46
(cost <- tab["neg","TRUE"] * 10 + tab["pos","FALSE"] * 500)
## [1] 37590

We will come back to the unequal costs later in the note, by examining a continuous prediction score and finding a new cut point which minimizes the cost. But for now, we will focus instead on prediction accuracy and Cohen’s kappa as the metric. If we wanted to also incorporate unequal costs into our parameter tuning, we could define a alternate performance metric.

Imputing missing values

Before we dive into decision trees and random forests, we need to clean up the predictors a bit. The caret package offers imputation of missing values using k-nearest neighbors, via the preProcess function. We train a preProcess fit similar to how we use train, and then apply it to the training data using predict.

Note: because the dataset has many predictors (170 possible), it would take a long time to run a random forest on the entire set of features (about half an hour on the 2000 observation training set). For demonstration, we subset to the first 20 predictors, which then can be fit in a few minutes. For a real application, one would instead use all of the predictors.

library(caret)
x <- as.data.frame(dat2[,2:21])
y <- factor(dat2$class)
summary(sapply(x, function(z) sum(is.na(z))))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    21.0    21.0   232.8   381.0  1519.0
ppfit <- preProcess(x, method=c("center","scale","knnImpute"))
x <- predict(ppfit, x)
summary(sapply(x, function(z) sum(is.na(z))))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

Decision trees

Decision trees are a class of classification algorithms which subject the observations to a series of binary decisions based on the predictors, with the goal of separating the classes in the leaves of the tree. If all of the samples begin at the root node (typically drawn at the top), then each internal node represents one of the binary decisions. The terminal nodes, or leaves, are drawn at the bottom, and the goal is to achieve high purity of the classes in the leaves. Purity can be measured by multiple metrics, typically using the Gini impurity, or the entropy. The CART (classification and regression trees) method that we will use here makes use of the Gini impurity, which is similar to the definition of Cohen’s kappa. The Gini impurity is the probability of misclassification using random labeling (but proportional to the distribution of labels). If we have K classes, each with probability \(p_i\), the Gini impurity can be calculated as:

\[ GI = \sum_{i=1}^K p_i (1 - p_i) = 1 - \sum_{i=1}^K p_i^2 \]

With two classes, this becomes:

\[ GI = 1 - p_1^2 - (1 - p_1)^2 \]

\[ = 2 p_1 - 2 p_1^2 \]

From this function we can see that an algorithm will try to create leaves where the probabilities for a given class are close to 0 or 1. A decision tree is built using an algorithm as follows: each node is recursively partitioned using a splitting rule. To consider whether to split a node, one can consider the gain (impurity of parent minus the impurity of the proposed child nodes):

\[ \textrm{gain} = I(\textrm{parent}) - \sum_{j=1}^2 \frac{N(j)}{N} I(j) \]

where N is the number of observations at the parent node, \(N(j)\) is the number of observations at the jth child node, and \(I(j)\) is the impurity at the jth child node. Here, as we are considering binary splits, we sum over the two proposed child nodes. Instead of attempting to find the optimal decision tree, a greedy algorithm is used to find the optimal split using the covariates for each terminal node. There are various options for stopping criteria. Some of these can be found in ?rpart.control:

minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted.

minbucket: the minimum number of observations in any terminal ‘’ node. If only one of ‘minbucket’ or ‘minsplit’ is specified, the code either sets ‘minsplit’ to ‘minbucket*3’ or ‘minbucket’ to ‘minsplit/3’, as appropriate.

Here we can demonstrate building a tree with our reduced 20 variable dataset. We use the rpart function from the rpart package, and then plot with a nice function from the rpart.plot package. We also print the “Complexity Parameter” table, which offers a tuning variable cp. The xerror column in the table gives the cross-validation error, where cross-validation is being performed within rpart.

library(rpart)
library(rpart.plot)
df <- data.frame(class=y,x)
rfit <- rpart(class ~ ., data=df)
printcp(rfit)
## 
## Classification tree:
## rpart(formula = class ~ ., data = df)
## 
## Variables actually used in tree construction:
## [1] aa_000 ag_003
## 
## Root node error: 1000/2000 = 0.5
## 
## n= 2000 
## 
##      CP nsplit rel error xerror     xstd
## 1 0.873      0     1.000  1.056 0.022326
## 2 0.013      1     0.127  0.129 0.010985
## 3 0.010      2     0.114  0.122 0.010703
rpart.plot(rfit)

Taking a look at the tree above, each node gives a label, the fraction of positive observations in the node, and the percent of observations at that node. Below the node is the decision rule that decides the split: the variable name and the critical value that was chosen by the greedy algorithm. A decision tree can use any variable at any step in the process, so they can generate, for example, interactions between variables or step-like functions of a single variable. We will see in a final example how decision trees can approximate any smooth function as long as the tree is given enough splits and data.

We can also prune back the tree to one with a higher complexity parameter (think of CP as a penalty on complex trees).

rfit$cptable
##      CP nsplit rel error xerror       xstd
## 1 0.873      0     1.000  1.056 0.02232559
## 2 0.013      1     0.127  0.129 0.01098542
## 3 0.010      2     0.114  0.122 0.01070318
# try a higher complexity parameter
cp <- rfit$cptable[2,"CP"]
# this is the minimizing complexity parameter:
# cp <- rfit$cptable[which.min(rfit$cptable[,"xerror"]),"CP"]
pfit <- prune(rfit, cp=cp)
rpart.plot(pfit)

This results in a tree with a single split, but which nevertheless has leaves with high purity.

Ensemble methods

A quick motivation for why ensembles of simple learners like decisions trees would do well is to consider the law of large numbers. Suppose we have a decision tree which will mis-classify a given observation with fixed probability 0.35. And then suppose we build 25 such trees, and each tree is independent. Then using the binomial density, we can see that a majority vote of the ensemble will do much better than any individual tree:

ntree <- 25
p <- 0.35
trees <- function() rbinom(ntree, 1, p)
# one ensemble vote:
sum(trees()) >= ceiling(ntree/2)
## [1] FALSE
# probability the ensemble will mis-classify:
pbinom(floor(ntree/2), ntree, p, lower.tail=FALSE)
## [1] 0.06044491

However, we will not necessarily be able to generate independent classifiers as they use the same covariates and training data. This motivates the method of random forests as well.

Random forests

Random forests are ensembles of decision trees which are built both by bootstrapping the observations (bagging) as well as randomly subsetting the set of predictors that are used to build a tree. The trees are grown without pruning. And the final prediction is made by a majority vote of the ensemble of trees. The combination of bootstrapping the samples, and subsampling the predictors will lead to reduction in the correlation of the trees, which will help if one considers the logic in the previous section with wholly independent classifiers.

Here we will use the randomForest package for constructing trees. From ?randomForest we can see the rules for how many predictors are used in each tree:

mtry: Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in ‘x’) and regression (p/3)

We will call for a random forest to be built using this library, by specifying method="rf" from the caret interface. We specify that we want to retain the predictions. Note that when we run train we are doing two levels of bootstrap resampling: there is the bootstrapping that train performs for all methods (unless cross-validation or an addition method is specified to trainControl), as well as the bootstrapping for individual trees within the forests. The number of bootstraps is controlled by number in trainControl for the former (default is 25) and by ntree in the randomForest package for the latter (default is 500).

trCtl <- trainControl(savePredictions=TRUE)
fit <- train(x, y, method="rf", trControl=trCtl) # ~150 s

We can see that the forest achieves fairly high accuracy and kappa at various levels of mtry:

fit$results
##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.9518763 0.9037100 0.007279303 0.01455500
## 2   11 0.9525360 0.9050295 0.006408594 0.01281063
## 3   20 0.9496385 0.8992311 0.006850514 0.01369102

Random forest comes with a measure of variable importance (described in ?importance:

For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).

The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.

We can see that the random forests built by caret use the second measure of importance by default:

imp <- fit$finalModel$importance
head(imp)
##        MeanDecreaseGini
## aa_000       394.007028
## ab_000         6.699326
## ac_000        11.913528
## ad_000        10.704722
## ae_000         1.768177
## af_000         1.650557
dotplot(tail(sort(imp[,1]),10), xlab="Mean Decrease Gini")

Returning to unequal costs

We previously used down-sampling as a method to address the fact that we had many more negative examples than positive examples, although the false negatives should be counted as 50x more costly than the false positives. We can now return to this question, to see if the default prediction by the trained random forest is optimal, or if we should re-calibrate to minimize total cost further (relative to a given distribution of positives and negatives).

Note that, if one were to use a test set as below to re-calibrate a predictor based on minimizing costs in a given dataset, one would want to use a final held-out test set to assess cost in a new dataset.

We read in a held-out test set:

col_types <- paste0(c("c",rep("n",170)),collapse="")
dat.test <- read_csv("aps_failure_test_set.csv", skip=20, na="na", col_types=col_types)

And we can again impute the missing values using the previously generated pre-processing rule:

summary(sapply(dat.test[,-1], function(z) sum(is.na(z))))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     189     193    1345     764   13129
x.test <- as.data.frame(dat.test[,2:21])
y.test <- factor(dat.test$class)
table(y.test)
## y.test
##   neg   pos 
## 15625   375
x.test <- predict(ppfit, x.test)

We then classify the test set and also generate class probabilities, by setting type="prob":

y.pred <- predict(fit, x.test)
y.prob <- predict(fit, x.test, type="prob")
confusionMatrix(data=y.pred, reference=y.test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   neg   pos
##        neg 14529     6
##        pos  1096   369
##                                          
##                Accuracy : 0.9311         
##                  95% CI : (0.9271, 0.935)
##     No Information Rate : 0.9766         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3779         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9299         
##             Specificity : 0.9840         
##          Pos Pred Value : 0.9996         
##          Neg Pred Value : 0.2519         
##              Prevalence : 0.9766         
##          Detection Rate : 0.9081         
##    Detection Prevalence : 0.9084         
##       Balanced Accuracy : 0.9569         
##                                          
##        'Positive' Class : neg            
## 

We can create a function which evaluates the total costs for a given cut point, and run this over the test set predictions:

test.pred <- data.frame(obs=y.test, pos=y.prob$pos)
costFn <- function(cut, df) {
  tab <- table(obs=df$obs, pred=factor(df$pos >= cut, c("FALSE","TRUE")))
  cost <- tab["neg","TRUE"] * 10 + tab["pos","FALSE"] * 500
  cost
}
costFn(cut=.5, test.pred)
## [1] 13970
s <- seq(from=0.1, to=.9, length=100)
costs <- sapply(s, costFn, df=test.pred)
plot(s, costs, type="l", lwd=3, col="dodgerblue")

Note that the default prediction (class probability > 0.5) is doing pretty well at minimizing costs, although it appears adopting a rule closer to 0.6 would do a bit better:

s <- seq(from=0.45, to=0.75, length=100)
costs <- sapply(s, costFn, df=test.pred)
plot(s, costs, type="l", lwd=3, col="dodgerblue")

Regression trees

Finally, we demonstrate with some simulated data that trees and forests can be applied to continuous data as well. Instead of focusing on decreasing impurity in the leaves, the trees are focused on reducing the variance of a continuous target in each leaf, again splitting using the predictors. If we try to learn a sine function, we can see that a tree with very low complexity penalty can learn an arbitrary shape by splitting along x (here a one dimensional surface, but consider how a multi-dimensional surface could be approximated as well).

library(rpart)
df <- data.frame(x=runif(1000))
df$y <- sin(2 * pi * df$x)
with(df, plot(x,y))

rfit <- rpart(y ~ x, data=df, method="anova", cp=.001)
rpart.plot(rfit)

pred <- predict(rfit, df)
with(df, plot(x,y))
points(df$x, pred, col="red")

By bagging and subsetting predictors in a random forest, we get a more complex shape, much closer to the true distribution of the target:

df <- data.frame(x=runif(200))
df$y <- sin(2 * pi * df$x)
trCtl <- trainControl(method="cv", number=5, savePredictions=TRUE)
tg <- data.frame(mtry=1)
rf.fit <- train(df["x"], df$y, method="rf", trControl=trCtl, tuneGrid=tg)
with(df, plot(x,y))
pred <- rf.fit$pred[order(rf.fit$pred$rowIndex),]
points(df$x, pred$pred, col="red")

Finally, we note that the random forest approach to predicting this continuous function may look similar to a k-nearest-neighbors approach, although the random forest approach will scale to higher dimensions, while k-nearest-neighbors will begin to breakdown, due to difficulties posed by distance functions in high dimensions.

trCtl <- trainControl(method="cv", number=5, savePredictions=TRUE)
kfit <- train(df["x"], df$y, method="knn", trControl=trCtl)
kfit$results
##   k       RMSE  Rsquared        MAE      RMSESD   RsquaredSD       MAESD
## 1 5 0.02092212 0.9990616 0.01466379 0.005270114 0.0004789292 0.002877171
## 2 7 0.02613216 0.9985198 0.01763203 0.007867338 0.0008981398 0.003174035
## 3 9 0.03450230 0.9973170 0.02335639 0.012341082 0.0017926172 0.005905544
with(df, plot(x,y))
pred <- kfit$pred[kfit$pred$k == 5,]
pred <- pred[order(pred$rowIndex),]
points(df$x, pred$pred, col="red")