This lecture note will give a brief introduction to (artificial) neural networks (where we may include the word “artificial” to contrast with biological neural networks). We begin by reviewing logistic regression and the gradient descent/acsent method, and connect this to the perceptron. We then describe how layers of perceptrons can lead to complex prediction algorithms. Finally, we will construct various neural networks for classifying the MNIST handwritten digits dataset. The field of neural networks and deep learning is vast and quickly evolving. For those interested in exploring the topic, I recommend searching for current methods implementing deep learning for a specific problem of interest.

This lecture note borrows from Andrew Ng’s course notes for supervised learning, Ng and Katanforoosh’s notes on deep learning, and the (offline) machine learning course notes of Guenther Walther.

Gradient descent for LS and logistic regression

Recall that if we want to minimize least squares in a simple linear model,

\[ \hat{y} = w_1 x_1 + w_2 x_2 + \dots w_p x_p + b, \]

we can use a gradient descent update rule for each predictor j, either a batch update:

\[ w_j^{t+1} = w_j^t + \lambda \textstyle \sum_{i=1}^N (y_i - \hat{y}_i) x_{ij}, \]

where we sum over all observations i, or stochastic gradient descent where we update the \(w_j\) by looking at one i at a time:

\[ w_j^{t+1} = w_j^t + \lambda (y_i - \hat{y}_i) x_{ij} \]

This is the gradient descent method, where we move in the direction that minimizes the sum of squared error for each predictor, and where the parameter \(\lambda\) controls the rate of descent.

If we instead use the logistic function for a binary target,

\[ \hat{y} = f(w_1 x_1 + w_2 x_2 + \dots w_p x_p + b), \]

with a logistic function, f:

\[ f(z) = \frac{1}{1 + \exp(-z)}, \]

we have the same update rule for gradient descent of the negative log likelihood (here the stochastic version),

\[ w_j^{t+1} = w_j^t + \lambda (y_i - \hat{y}_i) x_{ij}, \]

where \(y_i \in \{ 0,1 \}\) and \(\hat{y} \in [0,1]\). We will see how we can build up from these relatively simple algorithms using similar update rules to learn more complex relationships between predictors and target.

Perceptron

A limiting case of the logistic function is a function that only outputs values of 0 or 1 around a critical value of \(z = 0\):

If we use this step function instead of the logistic, the classification algorithm is called the perceptron. The perceptron was invented in 1957 at the Cornell Aeronautical Laboratory. From Wikipedia:

The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the IBM 704, it was subsequently implemented in custom-built hardware as the “Mark 1 perceptron”. This machine was designed for image recognition: it had an array of 400 photocells, randomly connected to the “neurons”. Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors.

For the perceptron, again we can use the update rule:

\[ w_j^{t+1} = w_j^t + \lambda (y_i - \hat{y}_i) x_{ij} \]

The course notes from Michael Collins on the perceptron contain a useful proof that if there exists a set of parameters such that all training samples can be classified correctly, then the above algorithm will converge.

Although the update rules look very similar, the following note from Andrew Ng’s notes on supervised learning is apt:

Note however that even though the perceptron may be cosmetically similar to the other algorithms we talked about, it is actually a very different type of algorithm than logistic regression and least squares linear regression; in particular, it is difficult to endow the perceptron’s predictions with meaningful probabilistic interpretations, or derive the perceptron as a maximum likelihood estimation algorithm.

Here we construct a simple case and show how the perceptron using the above update rule moves to reduce training errors.

set.seed(1)
n <- 100
x1 <- c(rnorm(n),rnorm(n,4))
x2 <- c(rnorm(n),rnorm(n,4))
x1 <- scale(x1)
x2 <- scale(x2)
x = cbind(x1, x2)
y <- rep(0:1,each=n)
dat <- data.frame(y=factor(y),x1,x2)

Here we encode the update rule, and run five iterations:

# pick initial values for w and b
w <- c(-4,1)
b <- 3

# create matrices to track w and b for 6 iterations/updates
w.mat <- matrix(nrow=6,ncol=2)
b.vec <- numeric(6)

# choose learning rate
lambda <- 0.01

# run the for loop
for (i in 1:6) {
  # save current w and b
  w.mat[i,] <- w
  b.vec[i] <- b
  
  # get yhat given current w and b via perceptron
  y.hat <- ifelse(cbind(x1,x2) %*% w + b >= 0, 1, 0)
  
  # update each w (two)
  for (j in 1:2) {
    w[j] <- w[j] + lambda * sum((y.hat - y) * x[,j])
  }
  
  # update b
  b <- b + lambda * sum(y.hat - y)
}

# save final w and b
w.mat[i,] <- w
b.vec[i] <- b

# create data frame for plotting
dat2 <- data.frame(b=b.vec, w1=w.mat[,1], w2=w.mat[,2], iter=1:6)

We can plot the decision boundary at each iteration:

library(ggplot2)
ggplot(dat, aes(x1,x2,shape=y)) + geom_point() +
  scale_shape_discrete(solid=FALSE) +
  geom_abline(data=dat2, aes(intercept=-b/w2, slope=-w1/w2, color=iter), size=1) +
  scale_color_continuous(low="purple",high="orange")

It is clear that a classifier of the form of the perceptron will only generate linear decision boundaries. However, we can add multiple layers of perceptrons to form a more complex classifier.

First, let’s visualize the kind of contours we obtain when we go straight from inputs to the target with a single perceptron, assuming \(w = (2, -1)\) and \(b = 1\):

s <- seq(from=-2,to=2,length=400)
# grid spanning  cominbations of x1 and x2 given s
grid <- expand.grid(x1=s,x2=s)
# predict y's across  grid region
grid$y <- with(grid, x1 * 2 + x2 * -1 + 1)
ggplot(grid, aes(x1,x2,z=y,fill=y)) +
  geom_raster(alpha=.5) +
  scale_fill_continuous(low="blue",high="green") +
  geom_contour()

Now, instead of a single set of weights from the input variables directly predicting y, we can add a hidden layer in between the inputs and the output. Suppose we add four nodes in between, giving us eight weights from the inputs to the hidden nodes (plus four bias/intercept terms), and four weights from the hidden nodes to the target (plus one bias term). What kind of shapes might be recoverable with such a network?

s <- seq(from=-2,to=2,length=400)
# creating grid again
grid <- expand.grid(x1=s,x2=s)
# this is same as the previous example, except now we use the perceptron
# there are 4 nodes in the hidden layer, each with separate w's, between x  and the predicted y
# node 1, w = (1,  1), b = 1
grid$h1 <- with(grid, ifelse(1*x1 + 1*x2 + 1 >= 0, 1, 0))
# node 2, w = (1,  -1), b = 1
grid$h2 <- with(grid, ifelse(1*x1 - 1*x2 + 1 >= 0, 1, 0))
# node 3, w = (-1,  1), b = 1
grid$h3 <- with(grid, ifelse(-1*x1 + 1*x2 + 1 >= 0, 1, 0))
# node 4, w = (-1,  -1), b = 1
grid$h4 <- with(grid, ifelse(-1*x1 - 1*x2 + 1 >= 0, 1, 0))
# final output node, taking input the 4 nodes from the hidden layer to predict y
# w = (1, 1, 1, 1) here, taking the output from each hidden layer node and b = -3.5
grid$y <- with(grid, 1*h1 + 1*h2 + 1*h3 + 1*h4 - 3.5)
# plot
ggplot(grid, aes(x1,x2,z=y,fill=y)) +
  geom_raster(alpha=.5) +
  scale_fill_continuous(low="blue",high="green") +
  geom_contour()

The hidden layer adds a round of intermediate transformations of the input variables. The outputs from these nodes are then passed on to a final output layer for a final round of transformation. By applying a series of simple linear transformations, one can achieve very complex decision boundaries.

In practice, the values for the weights and biases in each node are unknown, and are therefore learned with gradient descent. More complex decision boundaries can be learned with adding additional nodes and hidden layers (the latter is where the “deep” in deep learning comes from).

There is a much more elegant visualization from Tensorflow of how multiple layers of nodes/neurons can be used to learn complex decision boundaries for various datasets (here with 2D input for ease of visualization):

Having gotten a sense for how multiple layers of nodes can be used to learn complex patterns, a natural question is how we will be able to fit the weights for these multi-layer perceptrons (MLP).

What if we attempt our approach from earlier: define a loss function, and then take the partial derivative of this with respect to each weight, and move some direction down the gradient at each iteration. There are numerous problems with the gradient descent approach here: 1. Certain activation functions such as the step function we had above and the Rectified linear unit (ReLU) function, as well as the addition of hidden layers, make taking the derivative difficult, and 2. Here will be many local minima where the descent method gets stuck.

Instead, a method called backpropagation can be used. This method starts with the layer closest to the target y and, one weight at a time, takes the partial derivative with respect to this weight. Then the weight can be adjusted using the old weight and multiplying the gradient by the learning rate \(\lambda\). The weights for the nodes in the layer second closest to the target are then updated, which can be computed with a trick that uses the chain rule to reformulate a product of terms that can be calculated. The name for the method stands for “backward propagation of error” because the derivative of the error from the layers closer to the target are propagated backward through the network recursively. We do not derive or demonstrate backpropagation in this note (it is derived in the notes from Ng and Katanforoosh on deep learning and backpropagation, but instead will describe a number of options when building neural networks, before demonstrating the use of one in classifying handwritten digits:

Activation functions

The term activation functions describe the functions \(f(z)\) that are applied to the weighted combination of the incoming nodes. Above we discussed a logistic, or sigmoid, activation function, as well as a step function \(f(x) = 1_{\{x \ge 0\}}\). There are numerous choices, which can be browsed at the following Wikipedia page. Properties of an activation function to consider are whether they have a continuous derivative, the steepness of the curve, whether it asymptotes for large absolute values of the input, and the range. We plot some of the more popular activation functions, including ones which we will use in the example at the end of this note.

In addition, the softmax activation function is useful for mapping from a vector defined in \(\mathcal{R}^n\) to a vector on the (n-1)-simplex defined such that each element is restricted between 0 and 1 and the elements sum to 1:

\[ f_i(\vec{z}) = \frac{\exp(z_i)}{\sum_{j=1}^J \exp(z_j)} \]

This activation function will be useful when we are trying to predict multiple classes, such as the handwritten digits data (0-9) we will consider below.

Regularization / weight decay

As with many classification or regression models with many parameters to learn, the models can become unstable and potentially have difficulties with overfitting. We therefore may benefit from considering regularization, that is, the penalization of large weights. As with penalized regression, we can consider an L1 or L2 regularization on the weights, as well as the tuning between the loss function and the penalty term. Another regularization implementation is to have the weights slowly decay during training iterations, by simply multiplying the weights by a value less than 1 at each iteration. It turns out that weight decay is equivalent to L2 regularization for certain training methods (stochastic gradient descent) but not for all algorithms, which should be kept in mind.

Convolutional layers

A final topic which will be important for our application below is the concept of convolutional layers of nodes in the beginning of a neural network. These are important component of networks that are designed to take images or 3D volumes as input, where the features \(x_j\) have some meaningful spatial relationship that can be captured by features learned in the hidden layers. A good extended reference for convolutional neural networks is the CNN lecture note by Andrej Karpathy.

One motivation of convolutional layers is that we may want hidden layers of the network to be able to detect common patterns across the image, regardless of where they occur positionally in the image. We can convolve a filter across the pixels of the image (or voxels of a volume) and take the sum as the output. As explained best by the lecture note linked above:

Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

In addition pooling layers are typically added as well which additionally reduce the input to reduce the number of parameters to learn. One way to reduce the image is to take the maximum of every 2x2 sub-matrix and pass this through to the next layer.

MNIST handwritten digits

We will now explore a dataset where neural networks provide a clear advantage in prediction. The MNIST handwritten digits dataset is a famous one for machine learning algorithms. MNIST stands for “Modified National Institute of Standards and Technology”. From Wikipedia:

[MNIST] was created by “re-mixing” the samples from NIST’s original datasets. The creators felt that since NIST’s training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST’s training dataset, while the other half of the training set and the other half of the test set were taken from NIST’s testing dataset. There have been a number of scientific papers on attempts to achieve the lowest error rate; one paper, using a hierarchical system of convolutional neural networks, manages to get an error rate on the MNIST database of 0.23%.

We will begin by reading in the data and visualizing some of the digits. We use a CSV file created by Joseph Redmon, and some code from David Robinson’s blog post, both linked below:

The following R code, from David Robinson’s blog post, re-organizes the raw data so that we can visualize the digits in a 2D matrix. The predictors are the 28 x 28 = 784 pixels in the image, in grayscale, with values from 0 to 255. The target is the digit itself, which takes values from 0-9. In the CSV file, the first column is the target, and the remaining 784 columns are the grayscale values of the pixels. The predictors are ordered row-by-row, so the first 28 columns after the target column make up the first row of the image. The modulo arithmetic below gives us x- and y-coordinates for plotting the pixels.

library(readr)
library(tidyr)
library(dplyr)
memory.limit(size=56000)
## [1] 56000
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2303798 123.1    4276477 228.4  4086035 218.3
## Vcells 4975765  38.0   17935836 136.9 17925408 136.8
# read in data
mnist_raw <- read_csv("mnist_train.csv", col_names=FALSE)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double()
## )
## i Use `spec()` for the full column specifications.
# check dimensions and a few observations
dim(mnist_raw)
## [1] 60000   785
mnist_raw[1:5,1:5]
## # A tibble: 5 x 5
##      X1    X2    X3    X4    X5
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     5     0     0     0     0
## 2     0     0     0     0     0
## 3     4     0     0     0     0
## 4     1     0     0     0     0
## 5     9     0     0     0     0
# reshape into table with columns as the following
  # label
  # obs ID (instance)
  # pixel number
  # value of pixel
  # x coordinate of pixel in image
  # y coordinate of pixel in image

pixels_gathered <- mnist_raw %>%
  head(10000) %>% # take 1st 10000 obs
  rename(label = X1) %>%
  mutate(instance = row_number()) %>%
  gather(pixel, value, -label, -instance) %>%
  tidyr::extract(pixel, "pixel", "(\\d+)", convert = TRUE) %>%
  mutate(pixel = pixel - 2,
         x = pixel %% 28,
         y = 28 - pixel %/% 28)
gc()
##             used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   2415566 129.1    4276477  228.4   4276477  228.4
## Vcells 100245548 764.9  163012760 1243.7 162966231 1243.4
library(ggplot2)
sub <- pixels_gathered %>% filter(instance <= 12)
ggplot(sub, aes(x, y, fill = value)) + geom_tile() +
  facet_wrap(~ instance + label, labeller=label_both) +
  scale_fill_gradient(low="white", high="black")

We will now forget for a while that the predictors make up pixels in a 2D matrix (until the end when we will use convolutional layers). We will for the meantime just consider the predictors as exchangeable and not spatially oriented.

Before we try to predict all 10 digits, let’s try just trying to predict the digits 3, 5 and 8. We will see that even these three digits are not trivial to classify:

# drop label and normalize values to between 0 and 1
dat <- as.data.frame(mnist_raw[1:10000,-1])/255

# convert label to factor 
y <- factor(mnist_raw$X1[1:10000])

# subset only to images with labels 3, 5, or 8
idx <- y %in% c("3","5","8")
dat <- dat[idx,]
y <- droplevels(y[idx])

# pca
pc <- prcomp(dat)

We can reduce the dimensionality of the data substantially by taking just the top PCs:

plot(pc$sdev[1:20]^2/sum(pc$sdev^2))

Just taking the top 7 PCs, we can see there is some separation along, e.g. PC1, or looking at PC1 vs PC2:

x <- pc$x[,1:7]
boxplot(x[,1] ~ y)

ggplot(data.frame(x,y), aes(PC1, PC2, col=y)) + geom_point()

table(y)
## y
##    3    5    8 
## 1032  863  944

A very simple method for classification is k-nearest neighbors, where we calculate the Euclidean distance of each sample in the test or out-of-bag set to those samples in the training set. For classification problems, a majority vote of the closest k training samples becomes the prediction. The difficulties for kNN are that distances in high-dimensional space become problematic, due to what is called the curse of dimensionality.

Briefly, it becomes difficult to get a good sense of neighbors in high dimensional space. Another problem is that the algorithm is not sparse: we have to save the entire training dataset in order to classify new points. However, it is very straightforward to implement, so it makes sense as a first try (or a lower bound in terms of accuracy which we will try to improve upon).

library(caret)
## Loading required package: lattice
tg <- data.frame(k=c(5,9,13,17,21,25))
trCtl <- trainControl(savePredictions=TRUE)
fit <- train(x, y, method="knn", tuneGrid=tg, trControl=trCtl)

We can plot the kappa statistics for various settings of k:

ggplot(fit, metric="Kappa")

Also the confusion matrix, to see which digits are being misclassified:

fit$results
##    k  Accuracy     Kappa  AccuracySD     KappaSD
## 1  5 0.8897146 0.8340735 0.008558958 0.012861345
## 2  9 0.8999439 0.8494692 0.006909302 0.010257884
## 3 13 0.9042597 0.8559725 0.006897549 0.010311313
## 4 17 0.9040465 0.8556389 0.007048234 0.010546805
## 5 21 0.9051552 0.8573083 0.005691909 0.008524157
## 6 25 0.9046189 0.8565088 0.005769440 0.008646379
tab <- table(obs=fit$pred$obs, pred=fit$pred$pred)
prop <- tab/rowSums(tab)
round(prop, 3) * 100 # percent
##    pred
## obs    3    5    8
##   3 87.5  4.9  7.6
##   5  5.4 91.5  3.2
##   8  5.1  3.1 91.8

Rather than looking at a single kappa statistics, we will write this small function to plot 2x the standard deviation of the statistic, over bootstrap replications:

plotWithSD <- function(fit, param, a=2) {
  min <- with(fit$results, min(Kappa - (a+2)*KappaSD))
  max <- with(fit$results, max(Kappa + (a+2)*KappaSD))
  fit$results$ymax <- with(fit$results, Kappa + a*KappaSD)
  fit$results$ymin <- with(fit$results, Kappa - a*KappaSD)
  ggplot(fit$results, aes_string(param, "Kappa", ymax="ymax", ymin="ymin")) +
    geom_ribbon(fill="black", alpha=.1) +
    geom_point(color="blue") + geom_line(color="blue") +
    ylim(min,max)
}

We see that the kappa statistic plateaus around 0.85 even though we are only trying to classify the digits 3, 5, and 8.

plotWithSD(fit, "k")

Now let’s return to all the digits and see how we do with our kNN approach:

dat <- as.data.frame(mnist_raw[1:10000,-1])/255
y <- factor(mnist_raw$X1[1:10000])
pc <- prcomp(dat)

Again, looking at PC1 and PC1 vs PC2, we see some separation of the digits already. We will work with the top 10 PCs (chosen arbitrarily, mostly for computational speed of training the methods below).

x <- pc$x[,1:10]
boxplot(x[,1] ~ y)

ggplot(data.frame(x,y), aes(PC1, PC2, col=y)) + geom_point()

We can again try kNN and we see that it’s doing fairly well across all 10 digits, with a kappa statistic that plateaus at around 0.885.

tg <- data.frame(k=c(5,9,13,17,21,25))
trCtl <- trainControl(savePredictions=TRUE)
fit.knn <- train(x, y, method="knn", tuneGrid=tg, trControl=trCtl)
fit.knn$results # 88% Kappa
##    k  Accuracy     Kappa  AccuracySD     KappaSD
## 1  5 0.8881591 0.8756581 0.004839265 0.005370614
## 2  9 0.8964846 0.8849138 0.005085023 0.005642203
## 3 13 0.8973797 0.8859076 0.004476736 0.004970826
## 4 17 0.8970967 0.8855936 0.005460835 0.006064259
## 5 21 0.8961087 0.8844949 0.004554603 0.005056181
## 6 25 0.8946448 0.8828658 0.004231196 0.004696317
plotWithSD(fit.knn, "k")

It’s useful to look at the confusion matrix:

tab <- table(obs=fit.knn$pred$obs, pred=fit.knn$pred$pred)
prop <- tab/rowSums(tab)
round(prop,3)*100 # percent
##    pred
## obs    0    1    2    3    4    5    6    7    8    9
##   0 95.8  0.0  0.2  0.6  0.1  1.5  1.3  0.1  0.3  0.1
##   1  0.0 97.9  0.6  0.2  0.1  0.0  0.1  0.1  0.3  0.4
##   2  0.8  1.0 92.3  1.1  0.5  0.4  0.9  0.8  1.6  0.4
##   3  0.9  1.2  1.3 81.2  0.6  2.9  0.5  1.2  8.3  2.0
##   4  0.0  0.3  0.5  0.1 83.5  0.7  1.7  0.7  0.2 12.3
##   5  2.5  0.5  0.2  2.8  0.6 87.8  2.5  0.2  1.8  1.3
##   6  1.5  0.4  0.1  0.0  0.2  1.2 96.2  0.0  0.2  0.1
##   7  0.1  1.4  0.6  0.1  1.2  0.3  0.0 90.7  0.4  5.3
##   8  0.7  1.2  1.2  4.5  1.2  3.2  1.5  0.4 84.1  2.1
##   9  0.6  0.8  0.2  1.6  7.9  0.7  0.8  3.0  0.6 83.9
diag(prop) <- NA

We can plot this, after removing the diagonal elements, and highlight the top three mistakes that are made by our kNN approach. We see that 4 and 9 are often misclassified as each other, as well as 3 and 8.

image(prop, xaxt="n", yaxt="n",
      xlab="obs", ylab="pred",
      col=colorRampPalette(c("white","blue"))(50))
for (i in 1:2) axis(i, 0:9/9, 0:9)
abline(0,1,col="red")
mxs <- head(sort(prop,decreasing=TRUE),3)
lbl <- 100*round(mxs,3)
arr.idx <- sapply(mxs, function(i) which(prop == i, arr.ind=TRUE)) - 1
text(arr.idx[1,]/9, arr.idx[2,]/9, labels=lbl, cex=.9, col="white")

We can also try a linear SVM. A radial basis function SVM does out-perform this but takes about 15 times longer to fit (we observed kappa statistic up to 0.94). The linear SVM doesn’t do much better than the kNN but is more efficient; it doesn’t store the whole dataset, only the support vectors (although here we have 2,000+ support vectors).

x <- pc$x[,1:20]
fit <- train(x, y, method="svmLinear", trControl=trCtl)
fit$results
##   C  Accuracy     Kappa  AccuracySD     KappaSD
## 1 1 0.8978941 0.8864713 0.003892379 0.004323728
fit$finalModel@nSV
## [1] 2442

Multi-layer perceptron

There are a few neural networks that can be accessed via caret but we will show that these underperform compared to a fully specified neural network with convolutional layers. Here, we take the top 60 PCs and train a multi-layer perceptron (MLP) with a single hidden layer with 60 nodes.

x <- pc$x[,1:60]
tg <- data.frame(size=60)
trCtl <- trainControl(method="cv", number=5, savePredictions=TRUE, verboseIter=TRUE)
fit <- train(x, y, method="mlp", trControl=trCtl, tuneGrid=tg, maxit=50)
## + Fold1: size=60 
## - Fold1: size=60 
## + Fold2: size=60 
## - Fold2: size=60 
## + Fold3: size=60 
## - Fold3: size=60 
## + Fold4: size=60 
## - Fold4: size=60 
## + Fold5: size=60 
## - Fold5: size=60 
## Aggregating results
## Fitting final model on full training set

This model gives us a kappa statistic of 0.94, which is also close to what we observed with radial basis function SVM.

fit$results
##   size  Accuracy     Kappa  AccuracySD     KappaSD
## 1   60 0.9453012 0.9391898 0.005491725 0.006103842

Let’s again see what type of digits are being misclassified:

tab <- table(obs=fit$pred$obs, pred=fit$pred$pred)
prop <- tab/rowSums(tab)
round(prop, 3) * 100 # percent
##    pred
## obs    0    1    2    3    4    5    6    7    8    9
##   0 98.6  0.0  0.2  0.1  0.3  0.3  0.1  0.1  0.2  0.1
##   1  0.0 97.4  0.5  0.4  0.3  0.3  0.1  0.4  0.4  0.2
##   2  0.5  0.7 93.0  1.1  0.6  0.3  0.7  1.1  1.5  0.4
##   3  0.2  0.5  1.0 91.6  0.1  1.9  0.4  1.4  2.1  0.9
##   4  0.2  0.2  0.5  0.1 94.6  0.1  1.1  0.4  0.3  2.4
##   5  0.5  0.1  0.9  2.2  0.8 91.4  1.9  0.2  1.4  0.6
##   6  0.6  0.4  0.2  0.1  0.6  0.6 96.7  0.1  0.7  0.0
##   7  0.1  0.7  1.3  0.6  0.7  0.2  0.0 95.5  0.3  0.7
##   8  0.5  1.2  1.5  1.1  0.3  1.0  1.3  0.5 91.8  0.8
##   9  0.5  0.3  0.1  0.9  2.6  0.2  0.1  1.2  0.5 93.6

Again, predicting 9 instead of 4 is the worst error, and the reverse is the second worst error, but the rate has dropped a lot compare to kNN.

Convolutional neural network

Finally, we show an example of a convolutional neural network (CNN) code using the keras R package, which provides an interface to the Keras Python Deep Learning library. The code that we used below is copied from the following example. We do not motivate all of the choices, as they have been tuned specifically for this problem by researchers that have focused deeply on the MNIST data, handwriting, and image recognition in general.

We being by converting the target y to a 10 dimensional binary matrix:

library(keras)
y_train <- to_categorical(y)
head(y_train)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    1    0    0    0     0
## [2,]    1    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    1    0    0    0    0     0
## [4,]    0    1    0    0    0    0    0    0    0     0
## [5,]    0    0    0    0    0    0    0    0    0     1
## [6,]    0    0    1    0    0    0    0    0    0     0

The following parameters are chosen for the MNIST data in particular (the image size and number of classes are determined obviously by the problem).

batch_size <- 128
num_classes <- 10
epochs <- 12
img_rows <- 28
img_cols <- 28

We reshape x as well into an array that takes into account the fact that we have a 2D array of pixels.

x <- as.matrix(dat)
x_train <- array_reshape(x, c(nrow(x), img_rows, img_cols, 1))
dim(x_train)
## [1] 10000    28    28     1
input_shape <- c(img_rows, img_cols, 1)

The following code chunk specifies the model: two 2D convolutional layers, a 2D pooling layer, a dropout layer, followed by a dense layer with 28 nodes and ReLU activation function, followed by a dropout layer, and finally a dense output layer using softmax activation function. Each epoch takes about 20 seconds, and we run 12 epochs in total.

set.seed(1)

model <- keras_model_sequential() %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = input_shape) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = num_classes, activation = 'softmax')

# Compile model
model %>% compile(
  loss = loss_categorical_crossentropy,
  optimizer = optimizer_adadelta(),
  metrics = c('accuracy')
  )

# Train model
history <- model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  validation_split = 0.2
  )

We can plot the accuracy of the model over time. Note that the validation accuracy approaches 98%, which beats our previous approaches (the MLP had about 94% accuracy).

plot(history)
## `geom_smooth()` using formula 'y ~ x'

We can evaluate the CNN on a test set (we only used the first 10,000 rows to train our model).

x2 <- as.matrix(mnist_raw[10001:20000,-1])/255
x_test <- array_reshape(x2, c(nrow(x2), img_rows, img_cols, 1))
y_test <- to_categorical(factor(mnist_raw$X1[10001:20000]))
scores <- model %>% evaluate(
  x_test, y_test, verbose = 0
  )
scores
##       loss   accuracy 
## 0.08599144 0.97530001

We can get the predicted classes:

pred <- model %>% predict_classes(x_test)

We can see we’ve further reduced our worst misclassification rate to 1.4%.

y.test <- factor(mnist_raw$X1[10001:20000])
tab <- table(obs=y.test, pred=pred)
prop <- tab/rowSums(tab)
round(prop, 3) * 100 # percent
##    pred
## obs    0    1    2    3    4    5    6    7    8    9
##   0 98.2  0.1  0.3  0.0  0.1  0.1  0.8  0.1  0.1  0.2
##   1  0.0 99.3  0.2  0.1  0.2  0.0  0.0  0.1  0.2  0.0
##   2  0.1  0.1 97.7  0.2  0.4  0.0  0.2  1.0  0.3  0.0
##   3  0.0  0.5  1.8 96.0  0.1  0.3  0.1  0.4  0.6  0.3
##   4  0.0  0.3  0.2  0.0 98.7  0.0  0.2  0.1  0.1  0.4
##   5  0.0  0.0  0.0  0.7  0.0 96.7  0.3  0.1  1.8  0.4
##   6  0.2  0.3  0.0  0.1  0.4  0.3 98.2  0.0  0.4  0.0
##   7  0.0  0.2  0.2  0.1  0.3  0.0  0.0 98.5  0.2  0.5
##   8  0.0  1.6  0.3  0.4  0.7  0.3  0.4  0.1 94.9  1.2
##   9  0.2  0.2  0.1  0.1  1.0  0.2  0.1  0.6  0.7 96.9