-
Notifications
You must be signed in to change notification settings - Fork 6
Support Vector Machine
A SVM is a binary classifier, we will thus consider:
<math>Y \;\; \epsilon \;\; \mathbb{R}^{MC}</math> taking values <math>\left \{ -1,1\right \}</math>
We want to be able to determine if an example <math>X_m</math> belongs to a class <math>c</math> or not, by looking at the sign of the prediction <math>sign(\widehat{Y}_m)</math>. Given a hyperplane in the N-dimensional space, we can determine with a simple dot product if the example <math>m</math> is in one side or the other of the hyperplane.
In the example above, the green hyperplane H1 does not correctly separate the two classes, the blue hyperplane H2 separates them well but its distance to the classes is very small (does not generalize well), and the red hyperplane H3 is the optimal linear classifier because it maximizes the distance with the two classes. A SVM will thus find the hyperplane, separating the two classes, that maximizes the distance to the closest examples, called support vectors
Let's define our hyperplane <math>i</math> by every <math>X_{i} \;\; \epsilon \;\; \mathbb{R}^{1N}</math> that satisfies:
<math>X_{i} \cdot \beta - \alpha = 0</math>
Being <math>\beta \;\; \epsilon \;\; \mathbb{R}{N1}</math> and <math>\alpha \;\; \epsilon \;\; \mathbb{R}}</math>.
Then we can also define the boundary limit of 'the margin' area by the twin parallel hyperplanes <math>j</math> and <math>k</math>:
- <math>X_{j} \cdot \beta - \alpha = 1</math>
- <math>X_{k} \cdot \beta - \alpha = -1</math>
- <math>X_{j} \cdot \beta - \alpha \geq 1</math>
- <math>X_{k} \cdot \beta - \alpha \leq -1</math>
In particular, a SVM Classifier will try to find a hyperplane that will successfully separate the two classes and will also have the greatest distance to each of the classes.
Given a hyperplane, determining if an example falls to one or the other side can be performed with a simple dot product.
The exact formulation of the optimization procedure goes beyond the objective of this tutorial. Check this paper for a detailed explanation.
This algorithm is a fine refinement of the above k-NN Classifier in the sense that it is designed to learn a linear classifier able to separate the two classes. The classifier thus runs in constant time <math>O(N)</math> instead of having to go through all the training dataset examples <math>O(MN)</math>.
Taking advantage of the fact that the algorithm does not converge if there are outliers, it can precisely be used to detect the presence of outliers.
If we interpret the presence of outliers as miss-classified examples, we can increment the number of support vectors above the strictly necessary in a similar way that the k-NN works.
Being the separator a hyperplane, it can be the case that the data is non-separable, which could be caused by a mislabeled example. In this case it is common to use the kernel trick to export the problem to a higher-dimension space (hopefully separable) add then applying non-linearities. One would normally try the most general-purpose non-linearities and pick the most performant classifier.
The time needed to train the algorithm is <math>O(MN)</math> in the linear case, and <math>O(M^2 N)</math> if we apply the kernel trick.
- Machine Learning
- Deep Learning