Just like other algorithms in machine learning that perform the task of classification(decision trees, random forest, K-NN) and regression, Support Vector Machine or SVM one such algorithm in the entire pool. It is a supervised (requires labeled data sets) machine learning algorithm that is used for problems related to either classification or regression.
However, it is frequently applied in classification problems. SVM algorithm entails plotting of each data item as a point. The plotting is done in an n-dimensional space where n is the number of features of a particular data. Then, classification is carried out by finding the most suitable hyperplane that separates the two(or more) classes effectively.
The term support vectors are just coordinates of an individual feature. Why generalize data points as vectors you may ask. In real-world problems, there exist data -sets of higher dimensions. In higher dimensions(n-dimension), it makes more sense to perform vector arithmetic and matrix manipulations rather than regarding them as points.
Types of SVM
Linear SVM : Linear SVM is used for data that are linearly separable i.e. for a dataset that can be categorized into two categories by utilizing a single straight line. Such data points are termed as linearly separable data, and the classifier is used described as a Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for data that are non-linearly separable data i.e. a straight line cannot be used to classify the dataset. For this, we use something known as a kernel trick that sets data points in a higher dimension where they can be separated using planes or other mathematical functions. Such data points are termed as non-linear data, and the classifier used is termed as a Non-linear SVM classifier.
Algorithm for Linear SVM
Let’s talk about a binary classification problem. The task is to efficiently classify a test point in either of the classes as accurately as possible. Following are the steps involved in the SVM process.
Firstly, set of points belonging to the two classes are plotted and visualized as shown below. In a 2-d space by just applying a straight line, we can efficiently divide these two classes. But there can be many lines that can classify these classes. There are a set of lines or hyperplanes(green lines) to choose from. The obvious question will be, out of all these lines which line is suitable for classification?
set of hyper-planes, Image credit
Basically, Select the hyper-plane which separates the two classes better. We do this by maximizing the distance between the closest data point and the hyper-plane. The greater the distance, the better is the hyperplane and better classification results ensue. It can be seen in the figure below that the hyperplane selected has the maximum distance from the nearest point from each of those classes.
A reminder, the two dotted lines that go parallel to the hyperplane crossing the nearest points of each of the classes are referred to as the support vectors of the hyperplane. Now, the distance of separation between the supporting vectors and the hyperplane is called a margin. And the purpose of the SVM algorithm is to maximize this margin. The optimal hyperplane is the hyperplane with maximum margin.
Take for example classifying cells as good and bad. the cell xᵢ is defined as an n-dimensional feature vector that can be plotted on n-dimensional space. Each of these feature vectors are labeled with a class yᵢ. The class yᵢ can either be a +ve or -ve (eg. good=1, not good =-1). The equation of the hyperplane is y=w.x + b = 0. Where W and b are line parameters. The earlier equation returns a value ≥ 1 for examples for +ve class and ≤-1 for -ve class examples.
But, How does it find this hyperplane? The hyperplane is defined by finding the optimal values w or weights and b or intercept which. And these optimal values are found by minimizing the cost function. Once the algorithm collects these optimal values, the SVM model or the line function f(x) efficiently classifies the two classes.
In a nutshell, the optimal hyperplane has equation w.x+b = 0. The left support vector has equation w.x+b=-1 and the right support vector has w.x+b=1.
Thus the distance d between two parallel liens Ay = Bx + c1 and Ay = Bx + c2 is given by d = |C1–C2|/√A^2 + B^2. With this formula in place, we have the distance between the two support vectors as 2/||w||.
The cost function for SVM looks the like the equation below:
SVM loss function
In the cost function equation above, the λ parameter denotes that a larger λ provides a broader margin, and a smaller λ would yield a smaller margin. Furthermore, the gradient of the cost function is calculated and the weights are updated in the direction that lowers the lost function.
Algorithm for Non-linear SVM
In the SVM classifier, it is straight forward to have a linear hyper-plane between these two classes. But, an interesting question which arises is, what if the data is not linearly separable, what should be done? For this, the SVM algorithm has a method called the kernel trick.
The SVM kernel function takes in low dimensional input space and converts it to a higher-dimensional space. In simple words, it converts the not separable problem to a separable problem. It performs complex data transformations based on the labels or outputs that define them
Look at the diagram below to better understand data transformation. The set of data points on the left are clearly not linearly separable. But when we apply a function Φ to the set of data points, we get transformed data points in a higher dimension that is separable via a plane.
To separate non linearly separable data points, we have to add an extra dimension. For linear data, two dimensions have been used, that is, x and y. For these data points, we add a third dimension, say z. For the example below let z=x² +y².
This z function or the added dimensionality transforms the the sample space and the above image will become as the following:
On close analysis, it is evident that the above data points can be separated using a straight line function that is either parallel to the x axis or is inclined at an angle. Different types of kernel functions are present — linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.
What RBF does in simple words is — if we pick some point, the result of an RBF will be the norm of the distance between that point and some fixed point. In other words, we can design a z dimension with the yields of this RBF, which typically gives ‘height’ depending on how far the point is from some point.
Which Kernel to choose?
A nice method to determine which kernel is the most suitable is to make various models with varying kernels, then estimate each of their performance, and ultimately compare the outcomes. Then you pick the kernel with the best results. Be particular to estimate the model’s performance on unlike observations by using K-Fold Cross-Validation and consider different metrics like Accuracy, F1 Score, etc.
SVM in Python and R
The fit method in python simply trains the SVM model on Xtrain and ytrain data that has been separated. More specifically, the fit method will assemble the data in Xtrain and ytrain, and from that, it will calculate the two support vectors.
Once these support vectors are estimated, the classifier model is completely set to produce new predictions with the predict function because it only needs the support vectors to separate the new data. Now you may get different results in Python and in R, so be sure to check the value of the seed parameter.
In this article, we looked at the Support Vector Machine algorithm in detail. Thanks for your time. Tune in for more such articles.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.