next up previous
Next: DNA microarray data Up: Support Vector Machine Classification Previous: Support Vector Machine Classification

Introduction

The advent of DNA microarray technology provides biologists with the ability to measure the expression levels of thousands of genes in a single experiment. Initial experiments [Eisen et al., 1998] suggest that genes of similar function yield similar expression patterns in microarray hybridization experiments. As data from such experiments accumulates, it will be essential to have accurate means for extracting its biological significance and for assigning functions to genes.

Currently, most approaches to the computational analysis of gene expression data attempt to learn functionally significant classifications of genes in an unsupervised fashion. A learning method is considered unsupervised if it learns in the absence of a teacher signal that provides prior knowledge of the correct answer. Existing gene expression analysis methods begin with a definition of similarity (or a measure of distance) between expression patterns, but with no prior knowledge of the true functional classes of the genes. Genes are then grouped using a clustering algorithm such as hierarchical clustering [Eisen et al., 1998,Spellman et al., 1998b] or self-organizing maps [Tamayo et al., 1999].

Support vector machines (SVMs) [Vapnik, 1998,Burges, 1998,Scholkopf et al., 1999] and other supervised learning techniques adopt the opposite approach. SVMs have been successfully applied to a wide range of pattern recognition problems, including handwriting recognition, object recognition, speaker identification, face detection and text categorization [Burges, 1998]. SVMs are attractive because they boast an extremely well developed theory. A support vector machine finds an optimal separating hyperplane between members and non-members of a given class in an abstract space. SVMs, as applied to gene expression data, begin with a collection of known classifications of genes. These collections, such as genes coding for ribosomal proteins or genes coding for components of the proteasome, contain genes known to encode proteins that function together and hence exhibit similar expression profiles. One could build a classifier capable of discriminating between members and non-members of a given class, such that, given expression data for a particular gene, one would be able to answer such questions as, ``Does this gene code for a ribosomal protein?'' Such a classifier would be useful in recognizing new members of the class among genes of unknown function. Furthermore, the classifier could be applied to the original set of training data to identify outliers that may have been previously unrecognized. Whereas unsupervised methods determine how a set of genes clusters into functional groups, SVMs determine what expression characteristics of a given gene make it a part of a given functional group. Because the question asked by supervised methods is much more focused than the corresponding question asked by unsupervised methods, supervised methods can use complex models that exploit the specific characteristics of the given functional group.

We describe the first use of SVMs to classify genes based on gene expression. We analyze expression data from 2467 genes from the budding yeast S. cerevisiae measured in 79 different DNA microarray hybridization experiments [Eisen et al., 1998]. From these data, we learn five functional classifications from the MIPS Yeast Genome Database (MYGD) (http://www.mips.biochem.mpg.de/proj/yeast). In addition to SVM classification, we subject these data to analyses by four competing machine learning techniques, including Fisher's linear discriminant [Duda and Hart, 1973], Parzen windows [Bishop, 1995], and two decision tree learners [Quinlan, 1997,Wu et al., 1999]. The SVM method significantly outperforms all other methods investigated here. Furthermore, investigation of genes where the SVM classification differs from the MYGD classification reveals many interesting borderline cases and some plausible mis-annotations in the MYGD.


next up previous
Next: DNA microarray data Up: Support Vector Machine Classification Previous: Support Vector Machine Classification
Michael Brown
1999-11-05