next up previous
Next: Bibliography Up: Support Vector Machine Classification Previous: Conclusions and future work

   
Support vector machines

Support vector machines map a given set of binary labeled training data to a high-dimensional feature space and separate the two classes of data with a maximum margin hyperplane. In general, this hyperplane corresponds to a nonlinear decision boundary in the input space. Let ${\bf X} \in R_0 \subseteq \Re ^{n}$ be the input vectors, $y \in
\{-1,+1\}$ be the labels, and $\phi :R_0\rightarrow F$ be the mapping from input space to feature space. Then the SVM learning algorithm finds a hyperplane (w,b) such that the quantity

\begin{displaymath}\gamma =\min_{i}y_{i}\{\langle w,\phi ({\bf X}_{i})\rangle -b\}
\end{displaymath} (10)

is maximized, where the vector w has the same dimensionality as F, b is a real number, and $\gamma $ is called the margin. The corresponding decision function is then

\begin{displaymath}f({\bf X})=\mathrm{sign}\left( \langle w,\phi ({\bf X})\rangle -b\right)
\end{displaymath} (11)

It is easy to prove [Vapnik, 1998] that this minimum occurs when

\begin{displaymath}w=\sum_{i}\alpha_{i}y_{i}\phi ({\bf X}_{i})
\end{displaymath} (12)

where $\alpha_i$ are positive real numbers that maximize

\begin{displaymath}\sum_{i}\alpha_{i}-\sum_{ij}\alpha_{i}\alpha_{j}y_{i}y_{j}\langle \phi
({\bf X}_{i}),\phi ({\bf X}_{j})\rangle
\end{displaymath} (13)

subject to

\begin{displaymath}\sum \alpha_{i}y_{i}=0, \alpha_{i} > 0.
\end{displaymath} (14)

The decision function can equivalently be expressed as4

 \begin{displaymath}f({\bf X})=\mathrm{sign}\left( \sum_{i}\alpha_{i}y_{i}\langle \phi
({\bf X}_{i}),\phi \left({\bf X}\right) \rangle -b\right).
\end{displaymath} (15)

From this equation it is possible to see that the $\alpha_i$associated with the training point ${\bf X_i}$ expresses the strength with which that point is embedded in the final decision function. A remarkable property of this alternative representation is that only a subset of the points will be associated with a non-zero $\alpha_i$. These points are called support vectors and are the points that lie closest to the separating hyperplane. The sparseness of the $\alpha $ vector has several computational and learning theoretic consequences. It is important to note that neither the learning algorithm nor the decision function (Equation 15) needs to represent explicitly the image of points in the feature space, $\phi
({\bf X}_{i})$, since both use only the dot products between such images, $\langle \phi ({\bf X}_{i}),\phi ({\bf X}_{j})\rangle
$. Hence, if one were given a function $K({\bf X},{\bf Y})=\langle
\phi ({\bf X}),\phi ({\bf Y})\rangle$, one could learn and use the maximum margin hyperplane in the feature space without ever explicitly performing the mapping. For each continuous positive definite function $K({\bf X},{\bf Y})$ there exists a mapping $\phi$ such that $K({\bf X},{\bf Y})=\langle
\phi ({\bf X}),\phi ({\bf Y})\rangle$ for all ${\bf X},{\bf Y}\in R_0$ (Mercer's Theorem). The function $K({\bf X},{\bf Y})$ is called the kernel function. The use of a kernel function allows the support vector machine to operate efficiently in a nonlinear high-dimensional feature spaces without being adversely affected by the dimensionality of that space. Indeed, it is possible to work with feature spaces of infinite dimesion. Moreover, Mercer's theorem makes it possible to learn in the feature space without even knowing $\phi$ and F. The matrix $K_{ij}= \langle \phi ({\bf X}_{i}),\phi ({\bf X}_{j})\rangle$ is called the kernel matrix and will be particularly important in the extensions of the algorithm that will be discussed later. Finally, note that the learning algorithm is a quadratic optimization problem that has only a global optimum. The absence of local minima is a significant difference from standard pattern recognition techniques such as neural networks. For moderate sample sizes, the optimization problem can be solved with simple gradient descent techniques like the ones used in this paper (see also [Campbell and Cristianini, 1999] and [Jaakkola and Haussler, 1998]). For larger problems, more advanced techniques should be used [Platt, 1999]. In the presence of noise, the standard maximum margin algorithm described above can be subject to overfitting, and more sophisticated techniques should be used. This problem arises because the maximum margin algorithm always finds a perfectly consistent hypothesis and does not tolerate training error. Sometimes, however, it is necessary to trade some training accuracy for better predictive power. The need for tolerating training error has led to the development the soft-margin and the margin-distribution classifiers. One of these techniques [Shawe-Taylor and Cristianini, 1999] replaces the kernel matrix in the training phase as follows:

\begin{displaymath}K \leftarrow K+ \lambda I,
\end{displaymath} (16)

while still using the standard kernel function in the decision phase (Equation 15). By tuning $\lambda$, one can control the training error, and it is possible to prove that the risk of misclassifying unseen points can be decreased with a suitable choice of $\lambda$ [Shawe-Taylor and Cristianini, 1999].

If instead of controlling the overall training error one wants to control the trade-off between false positives and false negatives, it is possible to modify K as follows:

\begin{displaymath}K \leftarrow K + \lambda D,
\end{displaymath} (17)

where D is a diagonal matrix whose entries are either d+ or d-, in locations corresponding to positive and negative examples. It is possible to prove that this technique is equivalent to controlling the size of the $\alpha_i$ in a way that depends on the size of the class, introducing a bias for larger $\alpha_i$ in the class with smaller d. This in turn corresponds to an asymmetric margin; i.e., the class with smaller d will be kept further away from the decision boundary [Veropulos et al., 1999]. In this work, the extreme imbalance of the two classes, along with the presence of noise, creates a situation in which points from the minority class can be easily mistaken for mislabelled points. Enforcing a strong bias against training errors in the minority class provides protection agaist such errors and forces the SVM to make the positive examples support vectors. Thus, choosing $d^{+}=\frac{1}{n^{+}}$ and $d^{-}=\frac{1}{n^{-}}$ provides a heuristic way to automatically adjust the relative importance of the two classes, based on their respective cardinalities. This technique effectively controls the trade-off between sensitivity and specificity [Veropulos et al., 1999].


next up previous
Next: Bibliography Up: Support Vector Machine Classification Previous: Conclusions and future work
Michael Brown
1999-11-05