There are multiple ways to classify data with machine learning. You could run a logistic regression, use decision trees, or build a neural network to accomplish the task. In 1963, Vladimir Vapnik and Alexey Chervonenkis developed another classification tool, the support vector machine. Vapnik refined this classification method in the 1990’s and extended uses for SVMs. Support vector machines have become a great tool for the data scientist.

In this blog post, I plan on offering a high-level overview of SVMs. I will talk about the underlying theory behind SVMs, why they are relevant, and the advantages and disadvantages of this classifier. I show you a quick example of the implementation of SVMs in Python as well. I have included a list of resources I found helpful in understanding support vector machines as well. I encourage you to check out more in depth guides to SVMs if you want to understand the mathematics. The majority of this information was condensed from Tibshirani’s An Introduction to Statistical Learning.

Theory:

Support vector machines attempt to pass a linearly separable hyperplane through a dataset in order to classify the data into two groups. This hyperplane is a linear separator for any dimension; it could be a line (2D), plane (3D), and hyperplane (4D+). Check out this graph from An Introduction to Statistical Learning:

We can separate the red and blue objects with an infinite number of hyperplanes. Which hyperplane is the best? Well, the best hyperplane is the one that maximizes the margin. The margin is the distance between the hyperplane and a few close points. These close points are the support vectors because they control the hyperplane. The graph below illustrates the best hyperplane for the red and blue objects.

This is the Maximum Margin Classifier. It maximizes the margin of the hyperplane. This is the best hyperplane because it reduces the generalization error the most. If we add new data, the Maximum Margin Classifier is the best hyperplane to correctly classify the new data. The Maximum Margin Classifier is our first SVM. But this SVM requires the two classes to be completely linearly separated. This isn’t always the case so in 1993 Vapnik developed another one of his machines.

The graph below shows data that is not perfectly separable.

In this case, the Maximum Margin Classifier would not work. Vapnik developed a soft margin that would allow for some misclassification of data. This is known as a Soft Margin Classifier or a Support Vector Classifier. It also attempts to maximize the margin separating the two classes. The graph below illustrates this SVM.

The support vector classifier contains a tuning parameter in order to control how much misclassification it will allow. This tuning parameter is important when looking to minimize error. As with all supervised learning, there is a bias-variance tradeoff. When the tuning parameter (often denoted as C) is small, the classifier allows only a small bit of misclassification. The support vector classifier will have low bias but may not generalize well and have high variance. We may overfit the training data if our tuning parameter is too small. If C is large, the number of misclassifications allowed has been increased. This classifier would generalize better but may have a high amount of bias. When the tuning parameter is zero, there can be no misclassification and we have the maximum margin classifier. The graphs below illustrate this point.

The support vector classifier can fail if the data is not linearly separable. In 1992, Vapnik developed a method of handling non-linearly separable classes. This method uses the kernel trick. We need “to enlarge the feature space in order to accommodate a non-linear boundary between the classes” (An Introduction to Statistical Learning). Kernels are functions that quantify similarities between observations. Common types of kernels used to separate non-linear data are polynomial kernels, radial basis kernels, and linear kernels (which are the same as support vector classifiers). Simply, these kernels transform our data in order to pass a linear hyperplane and thus classify our data. Below are a few visual guides for the various kernel types.

http://svmcompbio.tuebingen.mpg.de/img/poly.png

Extensions of support vector machines can be used to solve a variety of other problems. We can have multiple class SVMs using One-Versus-One Classification or One-Versus-All Classification. A brief description of these can be found in An Introduction to Statistical Learning. Additionally, support vector regressors exist for regression problems. You can also look into support vector clustering, ranking SVM, transductive SVM, and much more.

So, when should we use SVMs? SVMs do a very good job of classifying when the groups are clearly separated. They also do a great job when our data is non-linearly separated. You could transform data to linearly separate it or you can have SVMs transform the data and linearly separate the two classes right out of the box. This is one of the main reasons to use SVMs. You don’t have to transform non-linear data yourself. One downside to SVMs is the black box nature of these functions. The use of kernels to separate non-linear data makes them difficult (if not impossible) to interpret.

Below is an example of quick implementation of SVMs in Python with Scikit-Learn.

In the code above, I instantiate the SVM with a specific kernel function and check the average accuracy of the model using cross validation. As you can see, using an SVM can be very simple out of the box.

Overall, support vector machines are great classifiers for specific situations. Understanding them will give you an alternative to GLMs and decision trees for classification. Be sure to check out my citations below especially if you want a more in depth mathematical explanation of support vector machines. If you have a question that has not been answered in those resources, send Vapnik a message. He currently works at Facebook AI Research.