Support vector machines

Each vector in the gene expression matrix may be thought of as a point
in an *m*-dimensional space. A simple way to build a binary
classifier is to construct a hyperplane separating class members from
non-members in this space. This is the approach taken by perceptrons,
also known as single-layer neural networks.

Unfortunately, most real-world problems involve non-separable data for
which there does not exist a hyperplane that successfully separates
the class members from non-class members in the training set. One
solution to the inseparability problem is to map the data into a
higher-dimensional space and define a separating hyperplane there.
This higher-dimensional space is called the *feature space*, as
opposed to the *input space* occupied by the training examples.
With an appropriately chosen feature space of sufficient
dimensionality, any consistent training set can be made separable.

However, translating the training set into a higher-dimensional space incurs both computational and learning-theoretic costs. Representing the feature vectors corresponding to the training set can be extremely expensive in terms of memory and time. Furthermore, artificially separating the data in this way exposes the learning system to the risk of finding trivial solutions that overfit the data.

Support vector machines elegantly sidestep both difficulties
[Vapnik, 1998]. SVMs avoid overfitting by
choosing a specific hyperplane among the many that can separate the
data in the feature space. SVMs find the *maximum margin
hyperplane*, the hyperplane that maximixes the minimum distance from
the hyperplane to the closest training point (see
Figure 2). The maximum margin hyperplane can be
represented as a linear combination of training points. Consequently,
the decision function for classifying points with respect to the
hyperplane only involves dot products between points. Furthermore,
the algorithm that finds a separating hyperplane in the feature space
can be stated entirely in terms of vectors in the input space and dot
products in the feature space. Thus, a support vector machine can
locate a separating hyperplane in the feature space and classify
points in that space without ever representing the space explicitly,
simply by defining a function, called a *kernel function*, that
plays the role of the dot product in the feature space. This
technique avoids the computational burden of explicitly representing
the feature vectors.

The selection of an appropriate kernel function is important, since the kernel function defines the feature space in which the training set examples will be classified. As long as the kernel function is legitimate, an SVM will operate correctly even if the designer does not know exactly what features of the training data are being used in the kernel-induced feature space. The definition of a legitimate kernel function is given by Mercer's theorem [Vapnik, 1998]: the function must be continuous and positive definite. Human experts often find it easier to specify a kernel function than to specify explicitly the training set features that should be used by the classifier. The kernel expresses prior knowledge about the phenomenon being modeled, encoded as a similarity measure between two vectors.

In addition to counteracting overfitting, the SVM's use of the maximum margin hyperplane leads to a straightforward learning algorithm that can be reduced to a convex optimization problem. In order to train the system, the SVM must find the unique minimum of a convex function. Unlike the backpropagation learning algorithm for artificial neural networks, a given SVM will always deterministically converge to the same solution for a given data set, regardless of the initial conditions. For training sets containing less than approximately 5000 points, gradient descent provides an efficient solution to this optimization problem [Campbell and Cristianini, 1999].

Another appealing feature of SVM classification is the sparseness of
its representation of the decision boundary. The location of the
separating hyperplane in the feature space is specified via
real-valued weights on the training set examples. Those training
examples that lie far away from the hyperplane do not participate in
its specification and therefore receive weights of zero. Only the
training examples that lie close to the decision boundary between the
two classes receive non-zero weights. These training examples are
called the *support vectors*, since removing them would change the
location of the separating hyperplane. The support vectors in a
two-dimensional feature space are illustrated in
Figure 3.

The SVM learning algorithm is defined so that, in a typical case, the number of support vectors is small compared to the total number of training examples. This property allows the SVM to classify new examples efficiently, since the majority of the training examples can be safely ignored. In essence, the SVM focuses upon the small subset of examples that are critical to differentiating between class members and non-class members, throwing out the remaining examples. This is a crucial property when analyzing large data sets containing many uninformative patterns, as is the case in many data mining problems. SVMs effectively remove the uniformative patterns from the data set by assigning them weights of zero.

The maximum margin allows the SVM to select among multiple candidate
hyperplanes; however, for many data sets, the SVM may not be able to
find any separating hyperplane at all, either because the kernel
function is inappropriate for the training data or because the data
contains mislabelled examples. The latter problem can be addressed by
using a *soft margin* that accepts some misclassifications of the
training examples. A soft margin can be obtained in two different
ways. The first is to add a constant factor to the kernel function
output whenever the given input vectors are identical
[Shawe-Taylor and Cristianini,
1999]. The second is to
define *a priori* an upper bound on the size of the training set
weights [Cortes and Vapnik, 1995]. In either case, the magnitude of the
constant factor--to be added to the kernel or to bound the size of
the weights--controls the number of training points that the system
misclassifies. The setting of this parameter depends on the specific
data at hand. Completely specifying a support vector machine therefore
requires specifying two parameters: the kernel function and the
magnitude of the penalty for violating the soft margin.

To summarize, a support vector machine finds a nonlinear decision function in the input space by mapping the data into a higher dimensional feature space and separating it there by means of a maximum margin hyperplane. The computational complexity of the classification operation does not depend on the dimensionality of the feature space, which can even be infinite. Overfitting is avoided by controlling the margin. The separating hyperplane is represented sparsely as a linear combination of points. The system automatically identifies a subset of informative points and uses them to represent the solution. Finally, the training algorithm solves a simple convex optimization problem. All these features make SVMs an attractive classification system. A more mathematical description of SVMs can be found in Appendix A.