Next: Bibliography
Up: Support Vector Machine Classification
Previous: Conclusions and future work
Support vector machines
Support vector machines map a given set of binary labeled training
data to a highdimensional feature space and separate the two classes
of data with a maximum margin hyperplane. In general, this hyperplane
corresponds to a nonlinear decision boundary in the input space.
Let
be the input vectors,
be the labels, and
be the mapping
from input space to feature space. Then the SVM learning algorithm
finds a hyperplane (w,b) such that the quantity

(10) 
is maximized, where the vector w has the same dimensionality as F,
b is a real number, and is called the margin. The
corresponding decision function is then

(11) 
It is easy to prove [Vapnik, 1998] that this
minimum occurs when

(12) 
where
are positive real numbers that maximize

(13) 
subject to

(14) 
The decision function can equivalently be expressed as^{4}

(15) 
From this equation it is possible to see that the
associated with the training point
expresses the
strength with which that point is embedded in the final decision
function. A remarkable property of this alternative representation is
that only a subset of the points will be associated with a nonzero
. These points are called support vectors and are
the points that lie closest to the separating hyperplane. The
sparseness of the vector has several computational and
learning theoretic consequences.
It is important to note that neither the learning algorithm nor the
decision function (Equation 15) needs to
represent explicitly the image of points in the feature space,
, since both use only the dot products between such
images,
. Hence, if one were given a function
, one could learn and use the
maximum margin hyperplane in the feature space without ever explicitly
performing the mapping. For each continuous positive definite
function
there exists a mapping such that
for
all
(Mercer's Theorem). The function
is called the kernel function.
The use of a kernel function allows the support vector machine to
operate efficiently in a nonlinear highdimensional feature spaces
without being adversely affected by the dimensionality of that space.
Indeed, it is possible to work with feature spaces of infinite
dimesion. Moreover, Mercer's theorem makes it possible to learn in
the feature space without even knowing and F. The matrix
is
called the kernel matrix and will be particularly important in
the extensions of the algorithm that will be discussed later.
Finally, note that the learning algorithm is a quadratic optimization
problem that has only a global optimum. The absence of local minima
is a significant difference from standard pattern recognition
techniques such as neural networks. For moderate sample sizes, the
optimization problem can be solved with simple gradient descent
techniques like the ones used in this paper (see also
[Campbell and Cristianini, 1999] and [Jaakkola and Haussler, 1998]).
For larger problems, more advanced techniques should be used
[Platt, 1999].
In the presence of noise, the standard maximum margin algorithm
described above can be subject to overfitting, and more sophisticated
techniques should be used. This problem arises because the maximum
margin algorithm always finds a perfectly consistent hypothesis and
does not tolerate training error. Sometimes, however, it is necessary
to trade some training accuracy for better predictive power. The need
for tolerating training error has led to the development the
softmargin and the margindistribution classifiers. One of these
techniques [ShaweTaylor and Cristianini,
1999] replaces
the kernel matrix in the training phase as follows:

(16) 
while still using the standard kernel function in the decision phase
(Equation 15). By tuning , one can
control the training error, and it is possible to prove that the risk
of misclassifying unseen points can be decreased with a suitable
choice of [ShaweTaylor and Cristianini,
1999].
If instead of controlling the overall training error one wants to
control the tradeoff between false positives and false negatives, it
is possible to modify K as follows:

(17) 
where D is a diagonal matrix whose entries are either d^{+} or
d^{}, in locations corresponding to positive and negative examples.
It is possible to prove that this technique is equivalent to
controlling the size of the
in a way that depends on the
size of the class, introducing a bias for larger
in the
class with smaller d. This in turn corresponds to an asymmetric
margin; i.e., the class with smaller d will be kept further away
from the decision boundary [Veropulos et al., 1999].
In this work, the extreme imbalance of the two classes, along with the
presence of noise, creates a situation in which points from the
minority class can be easily mistaken for mislabelled points.
Enforcing a strong bias against training errors in the minority class
provides protection agaist such errors and forces the SVM to make the
positive examples support vectors. Thus, choosing
and
provides a
heuristic way to automatically adjust the relative importance of the
two classes, based on their respective cardinalities. This technique
effectively controls the tradeoff between sensitivity and specificity
[Veropulos et al., 1999].
Next: Bibliography
Up: Support Vector Machine Classification
Previous: Conclusions and future work
Michael Brown
19991105