Support vector machines (SVM) are a very popular machine learning method for binary classification, which can be solved efficiently even for large datasets. SVM have a number of desirable features, including: ability to perform classification in a non-linear space using kernels, tend to have good generalization to new data, and work well in very high dimensional space.

First look at SVM

Here we will begin by demonstrating how SVM differ from other linear classifiers, such as LDA (linear discriminant analysis) on simulated data (two dimensional, so we can easily visualize the hyperplane that defines the classification boundary). We will then show the objective function that the SVM minimizes (in the completely separable case), show how this can be solved with quadratic programming. Finally, we will discuss how various kernels can be used in combination with SVM.

First we construct some data that happens to be completely separable by a hyperplane in two dimensions:

set.seed(2)
n <- 100
x1 <- c(rnorm(n),rnorm(n,5))
x2 <- c(rnorm(n),rnorm(n,5))
x1 <- scale(x1)
x2 <- scale(x2)
y <- factor(rep(c(-1,1),each=n))
dat <- data.frame(y,x1,x2)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
ggplot(dat, aes(x1,x2,col=y)) + geom_point()

Using LDA gives us the following classification boundary. We use code below which generalizes for other cases (including multiple classes, or non-linear boundaries as we will explore below). As you can recall, the boundary line for the LDA is determined by the likelihood ratio where two Gaussian distributions are fit to the data of the two classes. We draw the center of the Gaussians with a cross:

library(caret)
## Loading required package: lattice
x <- data.frame(x1,x2)
lfit <- train(x, y, method="lda")
coefs <- lfit$finalModel$scaling
means <- data.frame(lfit$finalModel$means, y=factor(c(-1,1)))
s <- seq(from=-2,to=2,length=400)
grid <- expand.grid(x1=s,x2=s)
grid$y <- as.numeric(predict(lfit, newdata=grid))
ggplot(dat, aes(x1,x2,col=y)) + geom_point() +
  geom_point(data=means,shape=10,size=10,stroke=2,alpha=.5,show.legend=FALSE) +
  geom_contour(data=grid, aes(x1,x2,z=y), breaks=1.5, col="black")

We will go ahead a fit a linear kernel SVM (to be explained later) to the same data. We will see a boundary as above, but the slope is a bit different.

# for standard SVM usage, do not set this `C` parameter so high
# this will be discussed later when we talk about "soft margin" SVM
tg <- data.frame(C=100)
fit <- train(x, y, method="svmLinear", tuneGrid=tg)
alpha <- fit$finalModel@alpha[[1]]
sv <- as.data.frame(x[fit$finalModel@SVindex,]) # the "support vectors"
sv.y <- 2 * (as.numeric(y[fit$finalModel@SVindex]) - 1.5)
w <- colSums(alpha * sv.y * as.matrix(sv))
b <- fit$finalModel@b
grid <- expand.grid(x1=s,x2=s)
grid$y.cont <- (as.matrix(grid[,1:2]) %*% w - b)[,1]
ggplot(dat, aes(x1,x2,col=y)) + geom_point() + 
  geom_point(data=sv, col="black", size=5, shape=21) +
  geom_contour(data=grid, aes(x1,x2,z=y.cont), breaks=c(-1,0,1), col="black")

The key difference is that, instead of modeling the data as two Gaussians, the SVM has attempted to put the widest margin between the two groups of samples. This ends up being equivalent to finding a set of points which define the boundary between the two groups, and putting a wide band between those sets of points. The code may not make much sense now, but it is extracting the key parameters w and b which define the following rules:

\[ w^T x - b = 1 \]

\[ w^T x - b = -1 \]

Anything on or above the line defined by the first equation will be classified as +1, while anything on or below the line in the second equation will be classified as -1. We then draw the lines for 1, 0, and -1 to show the boundaries and center of the margin dividing the two groups. The lines pass through a set of data points, these are called the support vectors. It is the nature of the constrained optimization of the SVM that a subset (sometimes small) of the training dataset ends up defining the decision boundary.

And just to show how SVM can be used to do more interesting things than finding a line between two sets of points, we show how by simply swapping out the linear kernel (so whenever we compute the dot product between two observations),

\[ K\left(x,x'\right) = \left\langle x, x' \right\rangle, \]

for a radial basis function kernel, that is,

\[ K\left(x,x'\right) = \exp\left(-\gamma \left\|x-x'\right\|^2 \right), \]

we can use the same SVM routine to find a different set of support vectors (defining the boundary of points from all sides), and a very different classification boundary. Again, we will discuss how kernels are relevant to SVM in a section below.

rfit <- train(x, y, method="svmRadial")
rsv <- as.data.frame(x[rfit$finalModel@SVindex,])
grid <- expand.grid(x1=s,x2=s)
grid$y <- predict(rfit, newdata=grid)
grid$yy <- 2*(as.numeric(grid$y) - 1.5)
ggplot(dat, aes(x1,x2,col=y)) + geom_point() + 
  geom_point(data=rsv, col="black", size=5, shape=21) +
  geom_contour(data=grid, aes(x1,x2,z=yy), breaks=0, col="black") +
  geom_raster(data=grid, aes(x1,x2,fill=y), alpha=.2)

Motivation behind the SVM solution

First we will give some motivation to how we solve for w and b above. The notes below follow closely Andrew Ng’s notes on SVM, which I recommend for more in depth derivation and details on the algorithms which solve the SVM.

Again, supposing we have two linearly separable sets of points, we want to find w and b so that the data are correctly classified, that is, \(w^T x - b \ge 1\) for all the data with \(y=1\) and \(w^T x - b \le -1\) for all the data with \(y=-1\). The distance between these two hyperplanes is given by:

\[ \frac{(1 + b) - (-1 + b)}{\|w\|} = \frac{2}{\|w\|} \]

and so to make the margin as wide as possible corresponds to minimizing \(\|w\|\). The constrained optimization is then:

\[ \begin{aligned} & \underset{w,b}{\text{min}} & & \|w\| \\ & \text{s.t.} & & w^T x_i - b \ge 1 : y_i = 1 \\ & & & w^T x_i - b \le -1 : y_i = -1 \end{aligned} \]

Note that multiplying both of the constraints by \(y_i\) then gives a cleaner form:

\[ \begin{aligned} & \underset{w,b}{\text{min}} & & \|w\| \\ & \text{s.t.} & & y_i(w^T x_i - b) \ge 1,\quad i=1,\dots,n \\ \end{aligned} \]

And we can square the norm and multiply by one half to make the optimization even easier, because we will have a quadratic objective to minimize, and linear constraints.

\[ \begin{aligned} & \underset{w,b}{\text{min}} & & \tfrac{1}{2} w^T w \\ & \text{s.t.} & & y_i(w^T x_i - b) \ge 1,\quad i=1,\dots,n \\ \end{aligned} \]

SVM objective solved with quadratic programming

We can take the above constrained optimization formulation and directly plug it into a quadratic programming package to find the optimal margin for the training data. The quadprog package in R offers optimization for problems of the form:

\[ \begin{aligned} & \underset{b}{\text{min}} & & -d^T b + \tfrac{1}{2} b^T D b \\ & \text{s.t.} & & A^T b \ge b_0 \\ \end{aligned} \]

Unfortunately, they have used a b as well as the typically b that is used in the SVM problem. We will refer to their b as \(b'\). Nevertheless, we can map our problem into their notation, by setting \(d=0\), \(b' = [w,b]\), \(D = \bigl(\begin{smallmatrix}I & 0 \\ 0 & 0\end{smallmatrix}\bigr)\), \(A^T = [y x^1, y x^2, \dots, y x^p, -y]\), and \(b_0 = [1,\dots,1]\). Here I have used \(y x^j\) to refer to a column vector where each \(y_i\) is multiplied by sample i’s value for the j-th predictor, \(x_i^j\).

Converting our SVM notation to the notation of quadprog gives:

library(quadprog)
# min_w,b wT w s.t. y_i (w x_i - b) >= 1
# quadprog gives:
# min_b 1/2 bT D b s.t. AT b >= b0
yy <- 2 * (as.numeric(y) - 1.5) # {-1,1}
n <- length(y)
p <- ncol(x)
D <- matrix(0, nrow=p+1, ncol=p+1)
diag(D) <- 1
D[p+1,p+1] <- 1e-6 # ensure D is positive def
d <- numeric(p+1)
AT <- cbind(as.matrix(x), rep(-1, n))
A <- t(AT * yy)
b0 <- rep(1, n)
wb <- solve.QP(D,d,A,b0)

We can then pull out our fitted w and b, and plot them against the training data:

w <- wb$solution[1:p]
b <- wb$solution[p+1]
ggplot(dat, aes(x1,x2,col=y)) + geom_point() +
  geom_abline(intercept=(b+1)/w[2],slope=-w[1]/w[2],alpha=.2,linetype=2) +
  geom_abline(intercept=(b-1)/w[2],slope=-w[1]/w[2],alpha=.2,linetype=2) +
  geom_abline(intercept=b/w[2],slope=-w[1]/w[2],linetype=3)

Non separable case and soft margins

The SVM would be a very brittle method if it required separability in all cases, as it would both fail for many datasets, and it wouldn’t be very robust to outliers. Since the constraints are hard, a single data point could tilt the margin. So, the SVM which is actually used in practice is not the one defined above, with a “hard margin”, but instead one with a “soft margin”, where points in the training set can cross the margin, and even cross the classification boundary and therefore be misclassified, with the advantage that the method will now work on all datasets and be more robust and have lower generalization error.

The soft margin SVM is accomplished by softening the constraints, while adding a penalty for points which are above the margin. The tradeoff between the main objective of making a wide margin and the penalty for points crossing the margin will be controlled with a tuning parameter C. The soft margin constrained optimization then looks like:

\[ \begin{aligned} & \underset{w,b}{\text{min}} & & \tfrac{1}{2} w^T w + C \sum_{i=1}^n \xi_i \\ & \text{s.t.} & & y_i(w^T x_i - b) \ge 1 - \xi_i,\quad i=1,\dots,n \\ & & & \xi_i \ge 0 \\ \end{aligned} \]

If all the points can be kept outside the margin and correctly classified then this means all the \(\xi_i = 0\). This can also be re-formulated as a “hinge loss”. The above optimization is equivalent to trying to minimize:

\[ \tfrac{1}{2} w^T w + C \left( \sum_{i=1}^n \max(0, 1 - y_i (w^T x_i - b) ) \right) \]

where the piece inside the large parentheses is equal to \(\xi_i\). This is called a hinge loss, because, again, the \(\xi_i = 0\) for those i that are outside of the margin and correctly classified, and only engages as a loss when a boundary is crossed. You may have noticed above that we set C very large when we ran the linear kernel SVM. This is so that we would obtain the hard margin classifier. We can re-run with a more typical value of \(C=1\), and notice how the margin changes. Now more points emerge as support vectors, and some of them are within the margin.

fit <- train(x, y, method="svmLinear")
fit$results
##   C Accuracy Kappa AccuracySD KappaSD
## 1 1        1     1          0       0
alpha <- fit$finalModel@alpha[[1]]
sv <- as.data.frame(x[fit$finalModel@SVindex,]) # the "support vectors"
sv.y <- 2 * (as.numeric(y[fit$finalModel@SVindex]) - 1.5)
w <- colSums(alpha * sv.y * as.matrix(sv))
b <- fit$finalModel@b
grid <- expand.grid(x1=s,x2=s)
grid$y.cont <- (as.matrix(grid[,1:2]) %*% w - b)[,1]
ggplot(dat, aes(x1,x2,col=y)) + geom_point() + 
  geom_point(data=sv, col="black", size=5, shape=21) +
  geom_contour(data=grid, aes(x1,x2,z=y.cont), breaks=c(-1,0,1), col="black")

Kernel trick

Finally, we explain what we meant earlier by saying that various kernels can be used with SVM. So far we have mostly looked at linear kernel SVM, except for a sneak peak of a radial basis function kernel where we saw very different behavior of the support vectors and the classification boundary. To explain how kernels come into play, we need to show the Lagrangian dual form of the constrained optimization we have been showing. We will go back to the hard margin SVM but the logic applies equally to the soft margin as well.

Our optimization can be written:

\[ \begin{aligned} & \underset{w,b}{\text{min}} & & \tfrac{1}{2} w^T w \\ & \text{s.t.} & & -y_i(w^T x_i - b) + 1 \le 0,\quad i=1,\dots,n \\ \end{aligned} \]

And the Lagrange function, with multipliers \(\alpha_i\) is:

\[ \mathcal{L}(w,b,\alpha) = \tfrac{1}{2} w^T w - \sum_{i=1}^n \alpha_i \left( y_i (w^T x_i - b) - 1 \right) \]

Taking the gradient with respect to w and setting equal to zero gives:

\[ w - \sum_{i=1}^n \alpha_i y_i x_i = 0\]

\[ w = \sum_{i=1}^n \alpha_i y_i x_i \]

Repeating the same for b gives:

\[ \sum_{i=1}^n \alpha_i y_i = 0 \]

Then we can rewrite the Langrange function, using these two equations, as:

\[ \mathcal{L}(w,b,\alpha) = \tfrac{1}{2} A - A - \sum_{i=1}^n \alpha_i y_i b + \sum_{i=1}^n \alpha_i \]

where \(A = \sum_{i,j=1}^n y_i y_j \alpha_i \alpha_j x_i^T x_j\). Re-arranging this gives what is typically shown for the Lagrangian dual of the SVM:

\[ \begin{aligned} & \underset{\alpha}{\text{max}} & & \sum_{i=1}^n \alpha_i - \tfrac{1}{2} \sum_{i,j=1}^n y_i y_j \alpha_i \alpha_j \langle x_i, x_j \rangle \\ & \text{s.t.} & & \alpha_i \ge 0, \quad i=1,\dots,n \\\ & & & \sum_{i=1}^n \alpha_i y_i = 0 \end{aligned} \]

One note while looking at the first line of the Lagrangian dual:

\[ \max_\alpha \sum_{i=1}^n \alpha_i - \tfrac{1}{2} \sum_{i,j=1}^n y_i y_j \alpha_i \alpha_j \langle x_i, x_j \rangle \]

We want to maximize this quantity, which has a \(-\tfrac{1}{2}\) and then a dot product between all \(x_i\) and \(x_j\). If sample i and j are in the same class, then \(y_i y_j\) will be positive. If they are near each other, then the dot product will be large. For this reason, for many of the samples in the interior of a group of samples, the maximization will “want” to set \(\alpha_i, \alpha_j = 0\). This is some loose motivation for why we will end up with a sparse solution, where only a few of the \(\alpha_i\) are non-zero, and these will be the support vectors that define the hyperplane.

As we have defined \(w = \sum_{i=1}^n \alpha_i y_i x_i\), once we have fit a model to the training data, calculating the label for a new data point \(x'\) involves calculating:

\[ w^T x' - b = \sum_{i=1}^n \alpha_i y_i \langle x_i, x' \rangle - b, \]

but since most of the \(\alpha_i\) will be equal to zero, we only need to perform the dot product for the support vectors. This is one of the efficiencies that allows SVM good performance for large, high dimensional datasets.

We showed that both finding the solution to the SVM on the training data, and calculating the label for a new point, involves a dot product \(\langle x_i, x_j \rangle\) for all the samples i, j in the dataset. This observation motivates the following trick: if we wanted to work in a transformed space on the original x, we could apply a function f throughout all of our equations around x. However, we just saw that the solution and prediction of labels for new points requires only \(\langle x_i, x_j \rangle\), and so working in the transformed space implies replacing this with \(\langle f(x_i), f(x_j) \rangle\). This is what is referred to as a kernel, and our transformation could be written:

\[ K(x_i, x_j) = \langle f(x_i), f(x_j) \rangle \]

This property of the SVM both allows arbitrary transformations, and means that we can work in arbitrary large spaces, as long as we can calculate \(K(x_i, x_j)\). We can avoid actually calculating \(x_i\) or \(f(x_i)\). To give an example, suppose we are interested in classifying documents that contain words. We could count each word as a feature, and give a 1 if the document contains the word, or a 0 if the document does not contain a word. This is a very high dimensional space, if we start to enumerate all the words in a corpus of documents. However, we only need to the dot product between two documents, which amounts to counting the number of words that appear in common, and we never need to enumerate the \(x_i\) themselves.

The radial basis function kernel we used above gives a high value for \(K(x_i, x_j)\) if the points are close together, and a value that quickly drops off to zero if the points are not close.

Extensions to regression

We won’t cover these, but I wanted to provide pointers to literature which extends the SVM model to predict continuous covariates. Two such references are:

These can be accessed within caret by selecting one of the SVM models with a regression tag.

Additional reference for these notes