The last module of this course will be an overview of Machine Learning (ML), including machine learning essentials, and introduction to support vector machines (SVM), random forests (RF), and neural networks (NN).

Here we begin with some conceptual and procedural frameworks that are essential for conducting analysis or research with machine learning tools. The most important considerations, which span across a variety of tools, are:

We will spend most of the lecture note on discussing various frameworks for how models are fit. We will first briefly cover some of the metrics used for evaluation of predictive models.

Metrics for evaluation of models

After fitting a model, we can use it to predict the value of the same data used for fitting the model, the training set, or on new data, the test set. As statisticians, you should now be familiar with the idea that the performance on the model on the training set will exaggerate the models performance on new data, due to overfitting.

While some methods are more prone to overfitting than others, it is an unavoidable property of model training, which we will discuss more in this note. In either case we can evaluate the prediction with various metrics. For continuous data, three common evaluation metrics are:

Two common definitions for predictive \(R^2\) are the squared Pearson correlation of the true values with the predicted values (which cannot be negative) and 1 - sum of squared residuals / total sum of squares. Most the time these will match, but may not depending on the form of linear model (regrssing the true values on the predicted values) used to compute the sums of squares.

For categorical data, there are also various evaluation metrics. If the outcome is binary, it is common to consider sensitivity, specificity, and precision. Two generic methods though are:

Cohen’s kappa will be 1 for perfect classification, 0 for no better than random guessing and negative values as well, meaning the prediction is somehow worse than random guessing.

Using caret for training and evaluating models

In the machine learning notes, we will leverage an R package called caret which provides a unified interface to many R implementations of machine learning algorithms. caret stands for Classification And REgression Training, meaning that the package can be used for predicting categorical (classification) or continuous data (regression).

While there may be reasons to use the individual packages themselves, caret gives us control of many functions in separate R packages (the degree of control varies across package), while wrapping them up in a consistent way so that users are less likely to make mistakes, and code is easier to read.

Importantly: there are also a number of packages which are native to languages other than R, such as the Keras deep learning library in Python for construction of neural networks. While there is limited support for constructing Keras models via caret, it would make sense to instead use Keras in Python or using the R keras package rather than the caret package for building certain models.

And as you go deeper into research in machine learning, and development of new methods, you will likely switch to C or C++ for the algorithm itself. In this last module of the course, we will not cover developing new machine learning methods, but instead show how the methods work (sometimes using R code to give a sense of the calculations being performed) and show how many algorithms can be fit within R using caret.

Within caret there are numerous functions for splitting data, and for evaluating and comparing models. The author of the caret package, Max Kuhn, has an online book for how to use caret, which is a useful reference for looking up, for example, a list of all the machine learning algorithms available.

Let’s start with a very simple example of using caret to run a linear model. Instead of using the lm function, we provide the predictor variables (sometimes called features) x and the outcome (or target) y to a function called train and specify method="lm". We can also pass additional arguments using the trainControl function which is provided to the trControl argument.

lm is a kind of funny example, because it doesn’t have any tuning parameters whereas most of the other methods will have some number of parameters that control behavior of the model, for example limits or penalizations for model complexity. caret actually lists a trivial parameter (intercept=TRUE or FALSE) for lm.

We will create some random Normally distributed data to demonstrate the basic concept of overfitting. We can see that despite there being no relationship between the outcome and the predictors (y consists simply of 1,000 random Normal data points), we can obtain a strong correlation with the predicted values.

This is one of the main concerns of machine learning methods: how to find a way to combine information in the predictors that will generalize to new data, while not getting “tricked” into finding spurious patterns or associations.

library(caret)
n <- 1000 # observations
p <- 500 # features

# simulate x
x <- matrix(rnorm(n*p),ncol=p,dimnames=list(seq_len(n),seq_len(p)))

# simulate y
y <- rnorm(n)

# fit model
fit <- train(x, y, method="lm", trControl=trainControl(method="none"))

# print results
fit$results
## [1] RMSE      Rsquared  MAE       intercept
## <0 rows> (or 0-length row.names)
# plot predicted values vs actual values
plot(y, predict(fit))

# print correlation
cor(y, predict(fit))
## [1] 0.7002372

Note the very high correlation, from having (1) many features and (2) not so many more observations than features, relatively.

Above, fit$results provides no output because we specified in trainControl that we will not use a resampling method to evaluate performance.

We can instead specify to use cross-validation, with 5 folds, and to save the predicted values. The default method for resampling in caret is bootstrap sampling, where samples may appear more than once in each split due to replacement, and therefore there are out-of-boot samples that can be used for prediction.

# specify cross validation setup using trainControl
trCtl <- trainControl(method="cv", number=5, savePredictions=TRUE)

# fit model using trainControl specification
fit <- train(x, y, method="lm", trControl=trCtl)

In case you have not yet seen cross-validation, the procedure is to split the data into k roughly equally sized folds, and to then train k different models, each one trained on data with the \(k\)th fold held out.

Each of the k models are different because the input data to each model leaves out one of the folds. Each model is therefore trained on roughly \(n - \frac{n}{k}\) data points. The k models can meanwhile be assessed by predicting the values of the held-out data points. We use cross-validation in this example, but note that other methods which involve repeated resampling help to produce accuracy metrics with smaller standard errors.

There are many other methods for evaluation via resampling, which can be found in the help page for ?trainControl under method.

While we had a very high correlation when we fit a linear model to the entire training set, here we see that the predictive \(R^2\) on the held out data in the cross-validation (CV) was nearly 0, indicating that the high correlation above was due to overfitting.

# print results
fit$results
##   intercept     RMSE    Rsquared      MAE     RMSESD  RsquaredSD
## 1      TRUE 1.601363 0.003842177 1.287237 0.06984746 0.003009917
##        MAESD
## 1 0.06565227
head(fit$pred)
##         pred         obs rowIndex intercept Resample
## 1  1.6390535 -1.18603497        5      TRUE    Fold1
## 2  1.2420609 -0.50253783        8      TRUE    Fold1
## 3 -1.1301362 -0.12019369       10      TRUE    Fold1
## 4 -0.6796758  0.14401127       13      TRUE    Fold1
## 5  0.9363983 -0.08409955       28      TRUE    Fold1
## 6 -1.5455422 -1.13141245       36      TRUE    Fold1

The fit object has a number of elements, which can be examined by looking at names(fit). The fit$pred element is a data.frame with a column for the observed value of the outcome and a column for the predicted value. Note that the observations are ordered by fold, not by the original order in the dataset.

What goes into the fit$pred table? If we had one or more tuning parameters, there would be multiple copies of predicted values, e.g. for each point in the tuning grid. We could pull out predictions in that case by specifying the value(s) of the tuning parameter(s).

Here,as we did not have a tuning parameter, we only have a single copy of each observation in the fit$pred table. Additionally, depending on the resampling method specified in trainControl, we could have multiple values per observation, per point in the tuning grid. For "boot" and "repeatedcv" for example, an observation can appear more than once in fit$pred.

By default, caret uses the squared-correlation definition of \(R^2\) which we confirm by manually calculating the value (there would be no reason to do this, except to convince oneself of what value is being presented in the table above).

# function calculates mean R^2 across folds
getR2 <- function(pred) {
  r2s <- sapply(1:5, function(i) {
    idx <- pred[,"Resample"] == paste0("Fold",i)
    cor(pred[idx,2], pred[idx,1])^2
  })
  mean(r2s)
}
getR2(fit$pred)
## [1] 0.003842177

We can also plot the predicted values over the observed values, and color by the fold, showing that the linear model fit to the \(n - \frac{n}{k}\) data points had no predictive power for the held out data points.

library(ggplot2)
ggplot(fit$pred, aes(obs, pred, color=Resample)) + geom_point()

Feature selection using outcome must be in the loop

Many of the methods in caret have built in feature selection properties, meaning that they pick out useful features and do not use features which are not helpful for predicting the outcome. This is in some sense wrapped up in the functions, although there are ways to control and extract the features that are selected.

Before we proceed, I want to show one example of how things can go very wrong in machine learning, if feature selection occurs before or outside of the cross-validation/resampling loop that is used for parameter tuning and evaluation. One can be mislead into thinking that the predictions are much better than they actually are.

Below we assess the correlation of all of the features with the outcome y. To do a rudimentary form of feature selection, we then look at the 80% quantile of the absolute value of the correlations, and then keep the 20% of features with correlation above this value to form a new set of predictors x.filt.

Because the feature selection occurred outside of the cross-validation loop, we can expect to find exaggerated performance, even though the performance is assessed in a “held-out” set (the held-out set was not held out for assessing the correlations).

One should not perform feature selection/screening on the entire dataset prior to or outside of cross validation!

cors <- cor(x, y)[,1]
q <- quantile(abs(cors), .8)
x.filt <- x[,abs(cors) > q]
fit <- train(x.filt, y, method="lm", trControl=trCtl)
fit$results
##   intercept      RMSE  Rsquared      MAE     RMSESD RsquaredSD      MAESD
## 1      TRUE 0.9615122 0.1162832 0.763827 0.06625501 0.03175793 0.05786917

Note that we get a small-to-moderate \(R^2\) (not so high, but clearly above 0), and the plot of predictions over the true, held-out data points looks decent, although we know from construction that the predictors have no relationship to the outcome, and therefore offer no utility in prediction:

ggplot(fit$pred, aes(obs, pred, color=Resample)) + geom_point()

To repeat, many of the methods we will use will perform their own feature selection using the outcome variable, but it will occur within the loop.

In some cases, we will perform feature selection before running train, however we will not use the outcome variable in these cases; we can, for example, remove features that have 0 variance, or perform dimension reduction and retain the top N principal components.

Parameter tuning

A convenient feature of caret is that it helps us to tune parameters within the train function. For demonstration purposes, we will load the iris dataset and show how train performs tuning for a classification problem, here classifying the label of two species based on the four numeric features.

We add an additional set of 100 non-informative predictors to demonstrate the benefit of regularization using the elastic net implementation in the glmnet package.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
idx <- 51:150
x <- cbind(iris[idx,1:4], matrix(rnorm(100*100),ncol=100))
x <- scale(x)
y <- droplevels(iris$Species[idx])
table(y)
## y
## versicolor  virginica 
##         50         50

We build a grid of lambda (the regularization parameter) and alpha (the mixing parameter, where alpha=1 corresponds to lasso or L1 regularization, and alpha=0 corresponds to ridge or L2 regularization).

tg <- expand.grid(lambda=c(.01,.05,.1,.25,.5,.75), alpha=c(.1,.25,.5,.75,1))
head(tg,8)
##   lambda alpha
## 1   0.01  0.10
## 2   0.05  0.10
## 3   0.10  0.10
## 4   0.25  0.10
## 5   0.50  0.10
## 6   0.75  0.10
## 7   0.01  0.25
## 8   0.05  0.25

It is simple then to have train handle the running of models over a grid of parameters. Note that if we had not set tuneGrid, a default grid would be used.

fit <- train(x, y, method="glmnet", tuneGrid=tg)
ggplot(fit, metric="Kappa") + ylim(0,1)

We can ask for the best set of tuning parameters (simply the set the gave the best performance as defined by metric, where a custom performance metric could be provided).

fit$bestTune
##    alpha lambda
## 15   0.5    0.1
fit$results[rownames(fit$bestTune),]
##    alpha lambda  Accuracy     Kappa AccuracySD    KappaSD
## 15   0.5    0.1 0.9354176 0.8679864 0.04481797 0.09226396

Note that other schemes for choosing the best set of tuning parameters exist other than choosing the maximum performance. These are discussed in the choosing the final model section of the caret online book.

For example, Breiman et al (1984) suggests to pick the simplest model which is within one standard error of the optimal performance (where standard errors are estimated using resampling, as is available based on the resampling method we choose). The reason that train by default chooses the best performing model is likely explained by this quote from the book:

The main issue with [rules such as within-one-standard-error] is related to ordering the models from simplest to complex. In some cases, this is easy (e.g. simple trees, partial least squares), but in cases such as this model, the ordering of models is subjective.

Training vs test set evaluation

So far we have discussed using either cross-validation or the out-of-boot samples for evaluation of model performance. This is useful, and helps as we saw earlier to avoid thinking we have a much higher predictive performance than we actually do.

However, because we look at many models, and then choose the best performing one, it is standard to use a wholly separate test set in order to evaluate the final model. The test set can be specified at the beginning, making sure to balance the labels equally across the split. More complex data splitting procedures are also available. Typically, between 20%-30% of the dataset might be held-out for testing. So, all model training, feature selection, tuning parameter selection, and cross validation will be performed on the training set samples only.

The full training and testing setup would then involve, creating a training set index:

train.idx <- createDataPartition(y, p=.75, list=FALSE)
# just demonstrating that the labels are balanced:
table(seq_len(100) %in% train.idx, y)
##        y
##         versicolor virginica
##   FALSE         12        12
##   TRUE          38        38

Splitting the data:

train.x <- x[train.idx,]
train.y <- y[train.idx]
test.x <- x[-train.idx,]
test.y <- y[-train.idx]

Fitting the model and finding the best out-of-boot performing model:

fit <- train(train.x, train.y, method="glmnet", tuneGrid=tg)
fit$results[rownames(fit$bestTune),]
##    alpha lambda  Accuracy     Kappa AccuracySD    KappaSD
## 21  0.75    0.1 0.9426133 0.8841903 0.03902545 0.07853661

Assessing performance on the test set:

pred <- predict(fit, newdata=test.x)
table(pred, test.y)
##             test.y
## pred         versicolor virginica
##   versicolor         12         2
##   virginica           0        10

We can use a caret function to provide numerous metrics, including accuracy and Cohen’s kappa:

confusionMatrix(data=pred, reference=test.y)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   versicolor virginica
##   versicolor         12         2
##   virginica           0        10
##                                         
##                Accuracy : 0.9167        
##                  95% CI : (0.73, 0.9897)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : 1.794e-05     
##                                         
##                   Kappa : 0.8333        
##                                         
##  Mcnemar's Test P-Value : 0.4795        
##                                         
##             Sensitivity : 1.0000        
##             Specificity : 0.8333        
##          Pos Pred Value : 0.8571        
##          Neg Pred Value : 1.0000        
##              Prevalence : 0.5000        
##          Detection Rate : 0.5000        
##    Detection Prevalence : 0.5833        
##       Balanced Accuracy : 0.9167        
##                                         
##        'Positive' Class : versicolor    
## 

Or for less readout:

postResample(pred=pred, obs=test.y)
##  Accuracy     Kappa 
## 0.9166667 0.8333333

Bagging and boosting

Bagging and boosting are two important topics that come up across a number of machine learning methods, and so I want to introduce them during this introductory lecture note.

Bagging is short for bootstrap aggregating, wherein the potentially high variance of a simple learning algorithm (or “learner”) is mitigated by training it multiple times across bootstrap samples of the dataset. In addition, bagging may involve random sampling of the predictors as well, such that the different models are less correlated than they would be if they all used the same predictors. Bagging was proposed by Leo Breiman in a technical report in 1994, which was subsequently published in Machine Learning.

The final bagged model is an average over the simple models learned on the bootstrap samples (either outputting the average prediction for regression, or using a majority-rule vote for classification). Bagging is very often used with decision trees which will be covered in more detail in a subsequent lecture note. It is a natural question, why is bagging not used with other simple models, like linear regression. A 1998 paper by Skurichinaa and Duina, Bagging for linear classifiers, looks into this question, showing that linear classifiers are in the class of stable classifiers that don’t benefit from bagging. Breiman makes this point in his initial paper, that bagging is mostly useful for unstable classifiers.

Boosting is somewhat similar to bagging, in that a final meta-learner is constructed which should perform better than each individual learner. However, instead of the simple learners being different according to the bootstrap sample of the observations, in boosting, a series of learners are constructed. The series is constructed so that (after the first round), the weights for mis-classified samples from the previous round are up-weighted. At a given round, the previous round’s classifier and a new simple learner are added together, with a coefficient on the new simple learner chosen to minimize the error in the training set. We will show in practice how this procedure leads to improved predictions.

Contrived example of bagging

We can implement bagging in simple R code, and then show how it can be performed with the caret library. We will create a very unstable learner for our demonstration, to highlight the benefit of bagging. As we said above, typically bagging is used in combination with a decision tree as the learner. Suppose our data looks as follows:

makeData <- function(n) {
  sd <- .5
  
  # simulate first predictor, half negative half positive
  x1 <- c(rnorm(.5*n,-.5,sd),rnorm(.5*n,.5,sd))
  
  # simulate second predictor, half negative half positive
  x2 <- c(rnorm(.5*n,-.5,sd),rnorm(.5*n,.5,sd))
  
  # pick a small random subset of samples and make as outliers
  idx <- sample(n, .025*n)
  x1[idx] <- -10 * x1[idx]
  
  # create class labels (1 or 2) with equal prevalence
  # labels overlap with negative and positive mean values of x1 and x2
  y <- factor(rep(c(1,2),n*c(.5,.5)))
  data.frame(x1,x2,y)
}
set.seed(1)
dat <- makeData(100)
y <- dat$y
x <- dat[c("x1","x2")]
ggplot(dat, aes(x1,x2,col=y)) + geom_point()

We can see that we have a group of observations that are outlying to some degree. We will create a very unstable learner (just for demonstration) which puts a 10x weight on the left- and right-most observations, and then performs LDA.

library(MASS)
learner <- function(x,y) {
  # get min and max of 1st col of x
  idx <- c(which.max(x[,1]), which.min(x[,1]))
  wt <- 10
  # repeat these observations wt times in original data and add to x and y
  x <- rbind(x, do.call(rbind, lapply(seq_len(wt), function(i) x[idx,])))
  y <- c(y, rep(y[idx],wt))
  
  # run lda
  lda(x, y)
}

We can run it on our training data and plot the decision boundary. We can see that the up-weighting of the extreme points leads to a poor boundary for the majority of the points.

l <- learner(x,y)
s <- seq(from=-5,to=5,length.out=200)
s.wide <- seq(from=-7,to=7,length.out=200)
grid <- expand.grid(x1=s.wide,x2=s)
grid.dat <- data.frame(y=predict(l, grid)$class, grid)
ggplot(dat, aes(x1,x2,col=y)) + geom_point() + 
  geom_raster(data=grid.dat, aes(x1,x2,fill=y),alpha=.2,show.legend=FALSE)

Now we write a simple function to predict the classes of the training points:

learnerPred <- function(x,l) {
  predict(l, newdata=x)$class
}
table(pred=learnerPred(x,l), obs=y)
##     obs
## pred  1  2
##    1 44  8
##    2  6 42

A bagging version of the learner involves bootstrap resampling of the dataset. In some of the bootstrap samples, the most extreme points in the dataset will not be included, and so some of the learners will hopefully provide a better decision boundary for the majority of the samples. Again, remember, our learner is contrived in a way to make it very unstable, but you can imagine how individual data points might lead an unstable learner to overfit the training data.

B <- 400
reps <- lapply(seq_len(B), function(i) {
  boot <- sample(nrow(x), nrow(x), replace=TRUE)
  learner(x[boot,], y[boot])
})

Our “bagged” predictions simply involve running each of the B LDA models and then using majority rules to vote for the final class. Note that we do a bit better even on the training data.

bagged <- sapply(seq_len(B), function(i) {
  learnerPred(x, reps[[i]])
})
bag.pred <- ifelse(rowSums(bagged == "2") > B/2, "2", "1")
table(pred=bag.pred, obs=y)
##     obs
## pred  1  2
##    1 45  7
##    2  5 43

Now for some test data, we create 2,000 new observations. It’s very clear that the single learner is doing poorly here:

dat.test <- makeData(2000)
s.wide <- seq(from=-10,to=10,length.out=200)
grid <- expand.grid(x1=s.wide,x2=s)
grid.dat <- data.frame(y=predict(l, grid)$class, grid)
ggplot(dat.test, aes(x1,x2,col=y)) + geom_point() +
  geom_raster(data=grid.dat, aes(x1,x2,fill=y),alpha=.2,show.legend=FALSE)

We can sum up the errors:

y.test <- dat.test$y
x.test <- dat.test[,c("x1","x2")]
(tab <- table(pred=learnerPred(x.test,l), obs=y.test))
##     obs
## pred   1   2
##    1 818 186
##    2 182 814
tab[1,2] + tab[2,1]
## [1] 368

Now we can show that the bagging version of our unstable learner does in fact do better:

bagged <- sapply(seq_len(B), function(i) {
  learnerPred(x.test, reps[[i]])
})
bag.pred <- ifelse(rowSums(bagged == "2") > B/2, "2", "1")
(tab <- table(pred=bag.pred, obs=y.test))
##     obs
## pred   1   2
##    1 832 177
##    2 168 823
tab[1,2] + tab[2,1]
## [1] 345

Example of bagging trees

In our example above, we contrived a very unstable learner, which up-weighted outliers, and then wrote out simple R code to show what happens inside of a bagging method. However, we can easily access bagging models via caret, including Random Forests, which we will cover in a subsequent lecture note. This hides away the bootstrapping and majority voting code, and there are a number of bagging methods, as well as a general purpose bag interface within caret.

Below we use a bagging version of decision trees, which gets close to a Random Forest, but doesn’t involve sub-sampling of the features (here we only have two features anyway). If you think of individual decision trees as dividing the predictor space into blocks where a certain class will be predicted (for classification, or a level value for regression), then by averaging over many trees, complex decision boundaries can be learned. Here we show how a bagging version of decision trees can be applied:

fit <- train(x, y, method="treebag")
fit$results
##   parameter  Accuracy     Kappa AccuracySD    KappaSD
## 1      none 0.9059868 0.8094862 0.04249848 0.08493453
pred <- predict(fit, x.test)
tab <- table(pred, obs=y.test)
tab[1,2] + tab[2,1]
## [1] 246

Note that our bagging method treebag outperforms simple LDA:

fit <- train(x, y, method="lda")
fit$results
##   parameter  Accuracy    Kappa AccuracySD    KappaSD
## 1      none 0.9062758 0.808836 0.03950847 0.07962864
pred <- predict(fit, x.test)
tab <- table(pred, obs=y.test)
tab[1,2] + tab[2,1]
## [1] 320

Example of boosting

As we have limited time in the course, we do not cover boosting in depth, but only provide a demonstration of its use. As described above, boosting involves iteratively re-fitting a learner, typically a tree, wherein the previous learners in the series are summed up and a new learner is added with additional focus on observations in the training set that we mis-classified in the previous round. As with bagging, this can lead to complex behavior and complex decision boundaries from a series of simple learners.

A very powerful boosting method is GBM, for Gradient Boosting Machine (link to Wikipedia). It can be easily called within caret by specifying method="gbm". Another popular boosting method is AdaBoost (link to Wikipedia), which can be called with caret via method="adaboost".

Here we apply GBM to our simple dataset for demonstration of how the learned surface may look. Typically we would use GBM on more complex datasets with more than two dimensions, but also notice that it outperforms the other methods on the test data:

fit <- train(x, y, method="gbm", verbose=FALSE)
fit$results
##   shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa
## 1       0.1                 1             10      50 0.9129301 0.8247225
## 4       0.1                 2             10      50 0.9122605 0.8231197
## 7       0.1                 3             10      50 0.9112539 0.8209375
## 2       0.1                 1             10     100 0.9095699 0.8180027
## 5       0.1                 2             10     100 0.9054475 0.8093053
## 8       0.1                 3             10     100 0.9150598 0.8289267
## 3       0.1                 1             10     150 0.9084776 0.8157383
## 6       0.1                 2             10     150 0.9092055 0.8170606
## 9       0.1                 3             10     150 0.9152275 0.8292131
##   AccuracySD    KappaSD
## 1 0.05481155 0.10904552
## 4 0.05526216 0.11017979
## 7 0.05304382 0.10586669
## 2 0.05366997 0.10622395
## 5 0.05254884 0.10397842
## 8 0.04912892 0.09701099
## 3 0.05485924 0.10828011
## 6 0.05552140 0.10932977
## 9 0.05621780 0.11110432
pred <- predict(fit, x.test)
tab <- table(pred, obs=y.test)
tab[1,2] + tab[2,1]
## [1] 241

We can plot, with the final model, the training error over the rounds of boosting. At each round, the method is attempting to improve on certain mis-classified samples, while still retaining good classification on the “easy” samples:

plot(fit$finalModel$train.error, type="o",
     xlab="round", ylab="train error")

Before we plot the learned surface, let’s recall what the training data looked like:

ggplot(dat, aes(x1,x2,col=y)) + geom_point()

The plot function, called on the final model, can be used to make low-dimensional plots showing how the variables relate to the class predictions. From ?plot.gbm:

plot.gbm produces low dimensional projections of the gbm.object by integrating out the variables not included in the i.var argument. The function selects a grid of points and uses the weighted tree traversal method described in Friedman (2001) to do the integration.

Here we plot the surface for x1, for x2 and in the space of both variables. Note that, by adding over many simple learners, a complex shape can be created. Still, these boosted methods are quite good at not overfitting. GBM is one of the best methods for out-of-the-box performance, that is not requiring much “feature engineering”:

plot(fit$finalModel, i.var=1)

plot(fit$finalModel, i.var=2)

plot(fit$finalModel, i.var=1:2)

Complexity of a classifier

The final section of this lecture note will cover a formal definition of complexity that is used in machine learning, and a useful theorem giving a probabilistic upper bound on how far the test error will be from the training error of a classifier.

First, we will define the concept of an algorithm f with parameters \(\theta\) shattering a set of points. To shatter a set of points means that, for any labeling of the points, there exists a \(\theta\) such that f classifies all the points accurately. For the following examples, suppose we are working on binary classification, and for simplicity of notation, we will have the two class labels as \(\{-1,+1\}\).

Suppose we consider algorithms f which are of the form:

\[ \textrm{sign}(a x_1 + b x_2 + c) \]

with \(a,b > 0\) (and a 0 will be given a positive label). We can visualize a particular instance of a, b, and c:

Now let’s lay down a single point and ask if our algorithm f as defined above can shatter it. If we have a single positive point, we can simply move the line so that it is below the point:

Likewise, with a negative point, we can move the line above it:

Intuitively, f as defined above can shatter a single point.

Now, suppose we have two points, with one in the upper right and one in the bottom left quadrant. We can correctly classify if the point toward the upper right quadrant is positive:

…however there is no \(\theta\) that will be able to correctly classify both when the labels are reversed:

Let’s now consider another set of two points. If the upper left quadrant is negative, we can tilt the slope very steeply:

And if the point in the upper left quadrant is positive, we can tilt the slope very shallowly:

So while f as defined above could not shatter the first set of points (the ones on the NE-SW axis), f could shatter the second set of points (the ones on the NW-SE axis).

If we lift the positive-valued restriction on a,b, you can work out with examples how a real-valued version of f can shatter a set of three points as long as they do not lie on a straight line.

Finally, the definition: The VC dimension of a classification algorithm is the size of the largest set of points that the algorithm can possibly shatter. There only has to be one set of such points, and so the positive-valued version of f has VC dimension 2 (no set of 3 points can be shattered) while the real-valued version of f has VC dimension 3 (no set of 4 points can be shattered). As typical, finding a set of points which an algorithm can shatter is the easy part, while proving that no \(\theta\) exists such that an algorithm can shatter so many points is more difficult. VC stands for Vapnik–Chervonenkis, who outlined the definition in a 1971 paper, “On the Uniform Convergence of the Frequencies of Occurrence of Events to Their Probabilities”.

Our main interest is in knowing the degree to which our ability to predict the labels of the training set will generalize to our ability to predict new data that arises from the same distribution. Here, when we talk about training set error, we do not refer to the cross-validated or out-of-bag error, but the error on the same observations that are used for finding, e.g. \(\theta\) for f above.

My preferred walk-through of the VC dimension, and its relation to the bound on generalization error is Andrew Ng’s notes on Learning Theory. I would recommend to walk through these. In particular, these notes show how the estimated error in the training set must be close to the generalization error (error on new data). For a finite set of fitted models (where a “fitted model” is a function that maps from the predictor space to the space of labels of the outcome), the generalization error can be bound to within \(\gamma\) of the training error, for all of the fitted models, with probability:

\[ P \ge 1 - 2k\exp(-2 \gamma^2 n)\]

where k is the number of models and n is the number of samples. This bound uses the union bound and the Hoeffding inequality in its derivation. Then using this probability bound, one can also bound the generalization error for the best fitted model (in terms of minimizing error on the training samples) among the k options.

However, we rarely work with finite sets of models, but instead work with infinite sets of models, such as the f we were describing above. Here, the VC dimension (if it is finite) becomes useful as way of describing the complexity of the set. For a classification algorithm with VC dimension d, with probability at least \(1 - \delta\) the absolute difference between generalization error and the training error will be:

\[ \le O\left( \sqrt{\frac{d}{n} \log \frac{n}{d} + \frac{1}{n} \log \frac{1}{\delta}} \right) \]

Thus for classification algorithms with finite VC dimension, large sample size will give us probabilistic guarantee on only small differences between training and generalization error. Again for a closer walk through of these bounds, I recommend consulting Andrew Ng’s notes on Learning Theory.