Assistant Professor, Principle Investigator Center for Bioinformatics, School of Life Science

Liping Wei 魏丽萍, Ph.D.

Professor, Director Center for Bioinformatics, School of Life Sciences

Hello everyone! I'm Meng Wang from Center for Bioinformatics, Peking University.

Now let me introduce the support vector machine (SVM).

First I will show you how machine learning is done. Here is how the supervised machine learning is carried out.

Supervised machine learning picks a data set as the training data set.

Then it uses machine learning methods to analyze and model the data,

Common machine learning methods include the support vector machine,

the Hidden Markov Model, the Bayesian methods, and artificial neural networks.

Using the training data set to train these methods

is just to get the exact values of parameters of these models. That's how to get the whole model.

Then we can give new data to this model to compute the predicted result.

Here, the classification problem is one of the important problems addressed by machine learning.

We take as example the binary classification here.

It classifies data into two classes, one of which labelled with 1 and the other with -1.

Each data point in the training set can be classified by various features.

For example, here the X1, X2, and X3 denote three features, and Y denotes the class for each data point.

In this way we can map the data to a 3D space. The three features correspond to

three axes of this 3D space. Y represents the color of these points.

As you can see, these points are classified into two clusters, the Class 1 of which colored in blue and Class -1 in red.

By learning these known points, we can determine which cluster (and thus the color) a new given point should be assigned to.

A straightforward idea is to plot a plane to completely separate these two clusters.

Then we can determine whether a new given point is above or below this plane,

and tell the class it belongs to.

For example, if it is above this plane, then it belongs to the first class; if it is below this plane, then it belongs to the second class.

What the support vector machine wants to do is to find such a plane to finish the classification.

The support vector machine can be used not only in classification, but also in regression analysis, pattern recognition, etc.

The goal of support vector machine is to choose in the space certain key points, which are the support vectors.

These points will be used to construct the plane for classification to separate the two data points as much as possible.

SVM can handle more than the linear classification problems.

What makes it strong is that it can handle non-linear cases.

Briefly, it uses kernel functions to map non-linear cases to a hyper-dimensional space, and use linear method to finish it.

For example, assume in a 2D space are two classes of points, one in red and the other in green.

We want to find in this plane a straight line to separate these two classes of points.

As you can see, there are infinite number of lines between these two clusters of points that can separate them completely, as shown here.

Which one is the best?

As you can see, for points that are near the center of the two clusters, such as this one, the line in the figure above will classify it as Class 2.

But it will be classified as Class 1 if the line in the figure below is adopted. So which class does this point belong to?

What SVM wants to do is to find the optimal line among all the possible lines. The optimality is defined as the fairest.

Which is the best among all these planes for classification?

Intuitively, for the sake of fairness,

the one that lies in the very center of the two clusters is the fairest and the most reasonable one.

What SVM wants to do is to find among all possible planes the fairest one that lies in the very center of the two clusters.

As shown on the right, for these two clusters, once the blue and the pink lines are given,

the final red line can be determined unambiguously.

The blue and the pink lines are determined

by these three key points.

In other words, we can unambiguously determine the plane for classification in the center once these three points are given.

The plane is independent with other points beyond these two lines.

In other words, the final plane for classification is not affected by any addition or deletion of points beyond this region.

Thus this classification method is stable; it is not affected by outliers.

The three key points are called as support vectors in support vector machine.

In other words, the plane for classification is unambiguously determined once the support vectors are determined.

Thus the support vector machine is just to find the support vectors to determine unambiguously the plane for classification.

Therefore, SVM is to find in the space the fairest plane for classification to separate the two clusters of points as fair as possible.

The fairest plane is the plane whose distances to the two clusters are equal.

Next I will explain briefly how the support vector machine is expressed mathematically.

If you are interested, you can deduce and learn them by the hints provided in the slides and related materials.

Here is an expression for a plane in a space.

Mathematically speaking, we can determine whether a point is above or below the plane by assigning to the point values of its coordinates.

The value of f(x) can be worked out. The sign of f(x) can thus be used to determine whether the point is above or below the plane.

If f(x)>0, then the point is above the plane and belongs to the first class; if f(x)<0, then it is below the plane and belongs to the other class.

Then how can we evaluate the reliability of the classification result?

A straightforward idea is to use the absolute value of f(x). A larger value implies that the point is relatively far away from the plane.

The classification result is thus more reliable.

Likewise, if the absolute value of f(x) is small and approximates 0, it will be closer to the plane.

The classification result is thus less reliable.

Therefore, we can evaluate the how reliable the classification result is by the absolute value of f(x).

As expressed above, f(x) can be formulated as yi*f(xi). This formula is called as the functional margin of this point.

Next we define the functional margin of a hyperplane as the minimum of the functional margins over all points in the space.

This is defined as the functional margin of a hyperplane.

The functional margin of a point (or a hyperplane) has a very important property.

That is, they can be scaled arbitrarily. The reason is that

the equation always hold no matter by what number the two sides of the equation are multiplied.

For example, we multiply the w and b in the equation by 2 to get 2w and 2b.

The equation still holds anyway.

The functional margins of the point and the corresponding hyperplane are both doubled.

Therefore, by multiplying the two sides of the equation by any same number

the functional margin of a point and the hyperplane margin can be scaled to any number.

Thus, the hyperplane margin is not a static value. It is a relative value.

To make it stable, it is divided by the length of vector w, wich is the 2-norm of w. So the result is a absolute value, which is represented by r, called geometrical margin.

Here the r means the minimum of all the distances from each point to the hyperplane.

As we mentioned, the hyperplane margin could be scaled to any value.

Thus, we can set it to be 1, by multiplying a particular number at the two sides of equation.

So the formula is simplified by setting the hyperplane margin to be 1.

Here we get the following results. Because the r represents the minimum distance of all the points to the hyperplane,

when we maximize it, as you could imagine, the hyperplane will be located in the middle of the two clusters, which is the fairest plane.

In SVM, the objective is to maximize this formula.

At the same time, the hyperplane margin is set to be 1, and it is the minimum distance, so all the other points' distance to the plane is greater than or equal to 1.

Now we have convert the problem of finding the fairest plane to a problem of finding the maximum value of a function under constraints.

To facilitate calculations, we transform the formula, putting the 2-norm of w in the numerator position and squired and divided by 2,

which is equivalent to the original formula.

Finally, the problem becomes to get the minimum value of a function under constraints.

This problem can be solved by quadratic programming.

Meanwhile, there is another general approach to solve this sort of question,

that is lagrangian multiplier method.

The lagrangian multiplier method is to construct a formula about L and lagrange multipliers alpha, to get the extremum of a function.

According the knowledge of differential calculus, the extremum value of a function will be found at the points which their derivative is 0.

So by calculating each partial derivative and set them to be 0, we get the values of parameters by solving the system of equations.

So we get the results about x, y.

We replace the w in the formula about f(x) with above results.

Now we get the finial result.

This is a mathematical representation of SVM.

It first calculates the inner product of two vectors and then calculates the linear combination of the results to get the result, which is used to classification.

All the situations mentioned just now are linear separable, which means we can separate all the points into two clusters by a hyperplane perfectly.

But in most situations, not all the points can be separated by a plane.

Take this for example.

We cannot separate blue points and red points into two clusters by a line. However, we can separate them by a curve.

It is more complicated to deal with non-linear problems in mathematics.

People thought out an idea,

to map all the points to a higher dimension space.

It is not linear separable in the original space, but it may be linear separable in the mapped higher dimension space.

This is the main idea of SVM kernel functions.

By finding a mapping function Phi, we can map the points to a higher dimension space, which is linear separable.

Thus, the mathematical representation of SVM changes to this.

Here, the points are firstly mapped, and then calculated as before.

The function that is the inner product of two mapped points is called kernel function.

So the kernel function is to map points to a higher dimension and do the inner product.

Could we find this sort of mapping function?

Take the picture is the right-bottom corner for example,

the blue points and red points cannot be separated into to clusters by a line.

But as we can see, they can be separated by a circle perfectly.

The is a formula that represents circle in mathematics.

By applying this mapping: c_1=x_1ï¼c_2=(x_1)^2ï¼c_3=x_2ï¼c_4=(x_2)^2ï¼c_5=x_1*x_2.

We can obtain a linear equation, which represents a plane in the 5-dimension space.

Now we have mapped points in 2D to points in 5D.

And these points can be linear separated in the 5D space by a hyperplane.

Here Z1 ... Z5 are the mapping functions.

There are 3 types of kernel functions used.

The first type is linear kernel function, which is used by SVM originally.

Second is the polynomial kernel, which is demonstrated by the last example.

Another is the gauss kernel, which is to map the point to a infinite dimension space.

Next I will illustrate how the SVM works.

I choose R to do the analysis.

R is powerful tool to do statistical analysis and visualization.

It is widely used in bioinformatics data analysis.

Firstly, load the data to be analysed into R.

The dataset has a total of 212 points and each is a 5 dimension vector.

The SVM is implemented by the package of 'e1071'

The data will be used to train a SVM model.

The parameters in the svm() function can be referred to help documents in R.

Now the model is built.

The result can be viewed by typing summary.

It tells us the cross validation accuracy is 81.6%

The support vector selected is stored in the SV attribute.

We can use the model to make a prediction on the training data, to see what the result wiil be.

The probability of each point belong to class 1 is stored in the variable prob.

Next we can draw a ROC curve to visualize the model's performance.

The 'pROC' package is needed to perform roc analysis in R.

Here is the ROC curve.

Now we will show the classification results.

Here we need the rgl package.

The original dataset is in a 5D space, which could not be shown. So we choose the first 3 dimension to illustrate.

We need to re-train the model.

Plot all the points in 3D coordinate system by plot3d().

The points which has a larger size is the support vectors.

Next we will show the classification plane.

On the classification plane, the decision value is 0.

So by enumerating all the points in the space and find all the points that has a decision value of 0 to form the classification plane.

The above is an example of linear classification system, which uses a linear kernel function.

Now let's see what's the result by using the gauss kernel.

We can draw the classification boundary using the same methods above.

The gauss kernel is to map all the points to a infinite dimension space,

The classification boundary is a hyperplane is the infinite dimension space, but in the original space, it is an irregular curved surface.

The SVM is a powerful method to perform classification. It is widely used in the real world to solve many problems.

Besides its widely used in bioinformatics, it is also used in text categorization, image classification and hand written recognition.

Finally, let's see pros and cons of SVM.

First, the SVM is a stable methods, because the hyperplane is only determined by the support vectors.

It can deal with non-linear problems, and often has very high accuracy.

Meanwhile, it needs to do a lot of inner product,

which makes it running slower than many other machine learning methods.