next up previous


Weighting Hidden Markov Models For Maximum Discrimination

Rachel Karchin and Richard Hughey[*]
Department of Computer Engineering
Jack Baskin School of Engineering
University of California
Santa Cruz, CA 95064

PREPRINT to appear in Bioinformatics, 1998

Abstract

Motivation

Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximum-discrimination idea.

Results

One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than single-sequence Probabilistic Smith-Waterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), Meta-MEME (28% reduction), and unweighted SAM (31% reduction).

Availability

A World-Wide Web server, as well as information on obtaining the Sequence Alignment and Modeling (SAM) software suite and additional data from this work, can be found at http://www.cse.ucsc.edu/research/compbio/sam.html.

Contact

Email: rph@cse.ucsc.edu.

Introduction

Linear hidden Markov models (HMMs) [Haussler et al., 1993,Krogh et al., 1994] or generalized profiles [Bucher & Bairoch, 1994] have been demonstrated to be very effective in detecting conserved patterns in multiple sequences [Hughey & Krogh, 1996,Baldi et al., 1994,Eddy et al., 1995,Eddy, 1995,Bucher et al., 1996,McClure et al., 1996,Karplus et al., 1997]. When provided with a set of unaligned or aligned ``training'' sequences that contain members of a protein family, the resulting HMM can identify the positions of amino acids which describe conserved first order structure of the family.

The HMM can also be used to generate a multiple alignment of the unaligned sequences from which the model was built by using a dynamic programming method. The new multiple alignment reveals the regions in the trained sequences that the process found to be conserved and can provide new information about the common structure underlying the sequences in the family. In addition, the HMM can discriminate between family and non-family members in a sequence database search.

This paper studies several sequence weighting methods that improve the ability of an HMM to discriminate between members and non-members of a protein family. Sequence weighting addresses a potential weakness in the model building process. If there is a high degree of redundancy in the set of sequences being used, the model will over-represent the similar sequences. By assigning larger weights to poorly scoring sequences during model training, we have been able to significantly reduce the number of errors made when the resulting model is used to recognize members of a particular family.

The methods we have considered are based entirely on the log-odds score [Altschul, 1991] of the sequence against the model being trained. This investigation was motivated by HMMer's maximum-discrimination training method [Eddy et al., 1995,Eddy, 1995]. HMMer uses a simulated annealing approach, in which the number of sequences actively in the training set is reduced over the course of training. During the course of this training, when maximum discrimination training is used, sequence weights are set according to their log-odds score so that sequences that poorly match the model are given higher weight. As training iterations proceed, the model is modified with data from each new sequence as it is scored against the model.

SAM uses expectation-maximization and its batch mode of training [Krogh et al., 1994,Hughey & Krogh, 1996]. In this case, the entire training set is aligned to the model, and a completely new model is generated by combining the information from these alignments and the transition regularizer and a Dirichlet mixture prior [Sjölander et al., 1996]. (SAM uses the full forward-backward algorithm, so more precisely, the sum of probabilities across all possible alignments is calculated for each sequence.) SAM can produce a model from unaligned sequences, as was done in these experiments, or SAM can be given an initial seed alignment, which often improves results.

Since it has been known for some time that weighting is a problem in SAM, we added the ability to weight sequences to the new version of SAM. This came in two phases. The first is external weighting, in which a file of weights can be read in and assigned to sequences. Our group currently uses this and either position-specific weighting [Henikoff & Henikoff, 1994] or an entropy-based method [Karplus, 1997], using an unpublished method to assign the total weight (without regularization, only relative weight matters; with regularization, required for small sample sizes, total weight will influence how strongly the sequences can overpower the regularizer [Karplus, 1995]). Both of these methods are based on analyzing a multiple alignment of the training sequences. When one is not available, a rough alignment is created using unweighted HMM training, the sequences are weighted based on that alignment, and then a new HMM is trained using those weights.

In an effort to add standalone weighting (i.e., without the extra steps for the file-based weighting), we examined a number of maximum-discrimination style weighting schemes, that is schemes that depend only on the log-odds score of the sequence against the model and in which weights are continuously updated during HMM training. This analysis has led us to make two of the weighting methods, described below, a permanent additions to the Sequence Alignment and Modeling software system.

On completing our evaluation of these different weighting schemes, we performed a more general analysis of our best weighting schemes based on the recent work of Agarwal and States [Agarwal & States, 1998]. Using 67 superfamilies from an annotated version of PIR [Pearson, 1995], they compared several pairwise sequence comparison methods, leading to the observation that Probabilistic Smith Waterman (PSW) [Bucher & Hoffmann, 1996] was the best discrimination technique tested for whole proteins. To their PSW results, we add evaluations of SAM's new internal weighting, HMMer's maximum discrimination training, and Meta-MEME's combination of MEME motif location with HMMer hidden Markov model scoring [Bailey & Elkan, 1994,Grundy et al., 1997,Grundy, 1998].

Systems and Methods

All work described in this paper was performed using the SAM HMM program suite version 2.0 on DEC Alpha 255 workstations and a DEC AlphaServer. SAM is programmed in ANSI C for Unix workstations. The HMMer experiments were performed with HMMer version 1.8.1. The Meta-MEME experiments used Meme and Mast version 2.0 and Meta-MEME version 1.0.

A typical training time for 10 sequences, such as with the cytochrome P450 superfamily with average length 498, was 10.7 minutes for SAM, 2.5 minutes for HMMer, and 4.2 minutes for MetaMeme. Search times against the reference database were 19.9 minutes for SAM, 12.0 minutes for HMMer, and 0.4 minutes for MetaMeme.

Algorithm

Hidden Markov Models

Given a linear hidden Markov model M, a dynamic programming calculation can be used to calculate, for a given sequence, the probability that sequence a was generated by the model, P(a|M). The question of interest is, however, does that model match the sequence? That is, is the sequence more likely to be generated by the hidden Markov model than some other, less structured null model, $\phi$. Making the assumption that the models M and $\phi$ are a priori equally likely, this reduces to the log-odds probability of

\begin{displaymath}
P(M\vert a) = \frac{P(a\vert M)}{P(a\vert M)+P(a\vert\phi)}.\end{displaymath}

Typically, a log-odds score S is used instead [Altschul, 1991]:

This score measures whether the probability that the sequence was generated by the model is greater than the probability it was generated by a null model (in SAM's case, the geometric average of the modeled sequence positions [Barrett et al., 1997]). This log-odds score is used to calculate the weight of a sequence. (SAM actually uses negative log-odds, but for this discussion, hopefully consistently, we use positive log-odds scores.)

HMMer's maximum discrimination method gives to each sequence a weight of 1-P(M|a). Thus, if a sequence perfectly matches the model M, it is given no weight at all, while a sequence that poorly matches the model is given unity weight. In score terms, weights are set to 1/(1+eS). One potential problem of this score function is that a model could, for example, alternately score one group of sequences well and a second poorly and then, because of the weighting, score the second group well after retraining. The HMMer maximum discrimination algorithm deals with this in two ways. First, training is done serially; the model is updated after each sequence is scored against the model. Second, a mixing parameter is used to ensure that any given sequence does not change the model too much.

Weighting Methods

For SAM, we considered several different weighting functions, all loosely based on the maximum discrimination equation above. We had four criteria in selecting candidate weighting functions:

1.
In general, poorly scoring sequences should receive larger weights than well scoring sequences.
2.
Weights should not get too large.

3.
Weighting should not produce excessive iterations when a model is built.

4.
The weighted model should have an improved ability to recognize sequences that belong to a particular protein family.


 
Table I: Notation.
W weight
S a sequence's score
K weight of the best scoring sequence (user defined)
b the best (lowest) score in the training set
w the worst (highest) score in the training set
m the median score in the training set
o user-defined offset value
Ws weight of an individual sequence
N size of the training set
x mix of old weight and new weight

To achieve these ends, we designed eight ``candidate'' functions described below. We also included weight normalization and weight ``mixing'' capabilities to optimize our results.

The methods are numbered in roughly chronological order, perhaps revealing some of our thought processes in creating the experiments.

The first weighting function is directly from HMMer's maximum discrimination method, but with additional parameters. Here, the weighting can be adjusted using an offset parameter o that is not necessarily zero. Any sequence with score better than o has reduced weight, while any sequence that scores worse than o has increased weight. We typically set o to the log of the database size, following the ideas of algorithmic significance [Milosavljevic & Jurka, 1993]. Thus, this offset encourages the model to score sequences not just better than the null model, but significantly better than the null model. The Method 1 formula has a second parameter, 0<K<1, that is used to temper the effects of weighting on the sequences:
\begin{displaymath}
W = 1 - \frac{K}{1 + e^{(o-S)}}.\end{displaymath} (1)

The second weighting function is the first of our relative weighting functions, which attempt to remove the offset parameter to reduce the number of free variables. The worst score w and the best score b are used to decide on two extreme weights. The worst scoring sequence will have a weight of 1, while the best scoring sequence will have a weight of K, typically in the 0.01 to 0.1 range:
\begin{displaymath}
W = e^{ (w-S) \cdot \frac{\ln K}{w-b}}.\end{displaymath} (2)

Method 3 is again similar to the maximum discrimination formula, but instead of using an arbitrary offset, the median of all scores in the training set is used. Thus, sequences that score worse than the median m receive close to a unity weight, while those that score better than the median receive a small weight:
\begin{displaymath}
W = 1 - \frac{K}{e^{(m-S)}}.\end{displaymath} (3)

Method 4 is basically Method 2 with an offset used to scale the scores, rather than the worst score. Here, sequences that score better than the offset receive weights less than one, while sequences that score worse than the offset can receive very high weights:

\begin{displaymath}
W = e^{ (S-o) \cdot \frac{\ln K}{b-o} }.\end{displaymath} (4)

Continuing with variations on a theme, Method 5 uses the median and the best score to determine the exponential weight curve:
\begin{displaymath}
W = e^{ (S-m) \cdot\frac{\ln K}{b-m} }.\end{displaymath} (5)

One problem with Methods 2, 4 and 5 is that the scaling depends very strongly on the worst score. This can be particularly a problem if the training set is polluted: if a non-family member is included in the training set. In an effort to deal with these possible outliers, Method 6 is a piecewise weighting function. The primary weighting function,
\begin{displaymath}
W = e^{ (S-m) \cdot \frac{\ln K}{b-m} },\end{displaymath} (6)
is applied to scores that are at most 3 sample deviations below the mean scores, while the poorest scoring sequences receive a weight that is linear in terms of their score, allowing for the fact that they may not really be family members:

\begin{displaymath}
W = 1 - \frac{S-w}{b-w}.\end{displaymath}

Method 7 is a similar function based on the best and worst score, rather than the median and worst score. Here, the equation
\begin{displaymath}
W = e^{ (w-S) \cdot\frac{\ln K}{w-b}}\end{displaymath} (7)
is applied to scores no more than 3 deviations below the median, while

\begin{displaymath}
W = 1 - \frac{S-w}{b-w}\end{displaymath}

is applied to the remaining sequences.

Finally, Method 8 is a simple linear function,
\begin{displaymath}
W = K + \frac{b-S}{b-w}.\end{displaymath} (8)

Weight Normalization and Mixing

Whenever the total weight of all the sequences exceeds the number of sequences (it is unlikely that N sequences could ever contain more than N sequences worth of information!), the weights are renormalized to sum to N. To deal with the opposite case, of the sequence weights becoming too small (below $\ln N$), the weights are renormalized as well.


To potentially account for weights varying too rapidly between training iterations, we experimented with an annealing schedule similar to that used by SAM for noise injection. This weight mixing function decides how much influence the calculated $W_{\mbox{new}}$can have on the previous iteration's $W_{\mbox{old}}$: 
 \begin{displaymath}
W = (1 - x) W_{\mbox{old}} + x \cdot W_{\mbox{new}},\end{displaymath} (9)
where x is the ``weight mix'' parameter that is decreased linearly from one to zero over the training iterations depending on a factor, either default or specified by the user.

Method Evaluations

Following previous work [Barrett et al., 1997], we initially tested these eight methods on three protein families: globins, ferredoxins and EF-hands. Models were generated from large training sets of known family members using SAM's buildmodel program. We would expect such models to be quite good at discriminating between family and non-family members.

For each family, we built five models using each prospective weighting method and a variety of parameters. We wanted to compare their discrimination performance with unweighted SAM models.

We tested discrimination performance by using SAM's hmmscore program with each model to calculate the log-odds scores of sequences in two discrimination sets. The ``true'' set is composed of members of the given protein family; the ``false'' set is composed of non-family members.

The true sets were assembled by using a Prosite keyword search on the SwissProt 32 database (685 globins, 100 ferredoxins, 147 EF-hands). The false sets consisted of all remaining sequences in the SwissProt32 database after extraction of the true set. Although our true and false sets undoubtedly had errors, the same errors were present for all methods.

We then plotted the resulting number of false positives versus number of false negatives over a range of score thresholds. False positives are non-family sequences which the model identifies as family members; that is, their log-odds score is better than the worst scoring family member. False negatives are family members which the model misses. They score worse than the best-scoring non-family members.

The point where the number of false positives equals the number of false negatives is called the equivalence number [Pearson, 1995,Agarwal & States, 1998]. We calculated the equivalence number (EN) for each of our test and control models. This number is a good standard of discrimination performance. It provides an upper bound on the minimum number of discrimination errors (MER)

\begin{displaymath}
\mbox{MER}<= 2 * \mbox{EN}\end{displaymath}

and it balances model sensitivity and selectivity; false negatives occur when a model is not sensitive enough and false positives occur when a model is not selective enough.

To evaluate computational performance, we kept track of the number of iterations required to build each model and the range of weights assigned to sequences in each experiment.

Four of our methods had poor computational performance (1, 3, 5, 6). We eliminated methods and parameter settings that made more discrimination errors than an unweighted model, added excessive iterations to the model building process, or were numerically unstable. However, four of the methods (2,4,7,8) performed very well with K (weight of the best scoring sequence before normalization) set at 0.1, so we continued our study using these.

At this point, we also revised the training procedure with the use of a different Dirichlet mixture prior (mall-opt.9comp) and an improved local (Smith-Waterman style) scoring algorithm [Smith & Waterman, 1981,Tarnas & Hughey, 1998]. All other parameters, apart from weighting parameters, were defaults.

Further tests with size 15 training sets showed that for globins and ferredoxins, methods 2 and 4 had the best discrimination performance. For EF-hands, position specific weighting and method 7 performed better than methods 2 and 4. Computationally method 4 performed the worst, requiring two and three times as many model building iterations as the other internal methods. We restricted further experiments to method 2, method 7, unweighted, external entropy, and external position-specific [Henikoff & Henikoff, 1994]. As mentioned, the two external methods consist of training an unweighted SAM model, using that model to form a multiple alignment, and then passing the multiple alignment to the weighting algorithm.

Training Set Size and Discrimination Errors

Our next concern was to see if training set size made a difference in a model's discrimination accuracy. To speed method evaluation, we created smaller discrimination sets by training an unweighted SAM model on each entire family (SwissProt keyword extraction) and then scoring this model on the entire SwissProt32 database. The true set was created by selecting the 100 worst scoring true sequences from each family and the false set was created by selecting the 2000 best scoring falses. Thus, these sets included the sequences that were difficult for the unweighted HMM to categorize.


  
Figure 1: Globins family: Comparison of weighted and unweighted models' average equivalence numbers by training set size. Weight mixing used.
\begin{figure}
{
\psfig {figure=globin.size.ps,width=\columnwidth}
}\end{figure}


  
Figure 2: Ferredoxins family: Comparison of weighted and unweighted models' average equivalence numbers by training set size. Weight mixing used.
\begin{figure}
{
\psfig {figure=ferredox.size.ps,width=\columnwidth}
}\end{figure}


  
Figure 3: EF-hands family: Comparison of weighted and unweighted models' average equivalence numbers by training set size. Weight mixing used.
\begin{figure}
{
\psfig {figure=efhands.size.ps,width=\columnwidth}
}\end{figure}

We studied training set sizes of 2, 5, 10, 15, 30, 50, 65, 85 and 100. For each size, we randomly formed five different training sets of each size and calculated the minimum, maximum and average equivalence numbers of the models created with each method.

While running the experiments on small training sets, we observed that sequence scores were oscillating indefinitely, a known problem with maximum discrimination score functions.

When weight mixing was added, the oscillations were successfully damped. We tried several x parameters in equation (10) and discovered that equivalence numbers were minimized when x=0.8; as a result, we have included weight mixing with parameter 0.8 as the default when SAM internal weighting is used.

Figures 1, 2 and 3 show the average equivalence number of each weighting method, as training set size increases from 2 to 100, by its difference from the average equivalence number of an unweighted SAM model calculated over the same five training sets.

This lets us see clearly how the weighting method improves on the discrimination performance of an unweighted model. The improvement can be measured by decreases in average equivalence numbers from unweighted to weighted models.

For the globins family, SAM methods 2 and 7 clearly have the best overall performance.

For ferredoxins, SAM methods 2 and 7 do best overall, however entropy weighting does an outstanding job for size 50 training sets and position specific weighting is best by a small margin for size 100 training sets.

For EF-hands, position specific weighting has the best overall performance and its margin of superiority increases as training sets get larger. Method 2 also does well. For size 85 sets it does better than position specific, however at size 100 it does poorly. For very small training sets (less than 15 sequences), methods 2 and 7 do best. Entropy weighting has the weakest discrimination performance on this family.


  
Figure 4: Average and range of equivalence numbers for each size group of ferredoxin training sets for several weighting methods. The internal weighting methods 2 and 7 perform somewhat better than the unweighted training, especially on small training sets. The off-line position-specific and entropy-based weighting produce a wide range of results.
\begin{figure*}
\small
\hspace*{\fill}
\makebox[0pt][c]{
\psfig {figure=ferredox...
 ...\hspace*{\fill}
\makebox[0pt][c]{Entropy weighting}
\hspace*{\fill}\end{figure*}

Range of Errors Made By Different Methods

In addition to the average equivalence numbers, we looked at the ranges between minimum and maximum equivalence numbers for training sets of the same size.

We observed that position specific and entropy weighting methods had more unstable results than the SAM internal methods. For training sets where these methods have excellent average equivalence numbers, the actual range of equivalence numbers is frequently quite large. This variation could be due to sensitivity to the single alignment (from an HMM trained with unweighted sequences) from which the external weights are calculated. Equivalence number ranges for the ferredoxins family are shown in Figure 4.

All methods tested were somewhat unstable on the EF-hands family; external position specific weighting was the most unstable.


  
Figure 5: Weight versus score snapshots during the training process of 100 ferredoxins. The score range changes between plots as the model specializes. Training iterations 1, 2, 10 and 26 (the last) are shown.
\begin{figure*}
\small

\noindent
{
\hspace*{\fill}

\psfig {figure=ferredox.ite...
 ...ure=ferredox.iter26.small.ps,width=0.4\textwidth}
}
\hspace*{\fill}\end{figure*}

Dynamic Behavior of Weights

One great advantage of internal weighting methods is that the weights are dynamically changed as the model is trained. Training 100 ferredoxins (Figure 5), when the sequences are initially compared against the untrained model in iteration 1, there are two outliers that receive particularly high weights. In the second iteration, the sequences are much more clustered. Because of the weight mixing, however, the two outliers' weights are only partly reduced to that of the typical sequence receiving the same score. After 10 iterations, weights are nearly fixed for the remainder of the training process. At the last iteration, 26 in this case, the best scoring sequence is one that during one of the initial iterations was hard to model, and thus received a higher weight.

Additional Protein Families

To find out whether or not our methods had value beyond these three protein families and to compare our discrimination results with other methods in general use, we ran another set of experiments using training and discrimination sets based Pearson's annotated version of the PIR 39 database with 12219 proteins total [Pearson, 1995].

Agarwal and States have analyzed five single-sequence search algorithms using a single sequence from each of the 67 superfamilies [Agarwal & States, 1998]. They analyzed Smith and Waterman SSEARCH [Smith & Waterman, 1981,Pearson & Lipman, 1988], FASTA (FA) [Pearson & Lipman, 1988], Probabilistic Smith-Waterman (PS) [Bucher & Hoffmann, 1996], BLASTP 1.4 (BP) [Altschul et al., 1990], and WU-BLASTP2 (WU) [Altschul et al., 1990,Althschul & Gish, 1996]. The general result of this work was that MERs are an effective evaluation criterion, and that Probabilistic Smith-Waterman had the best overall performance on complete proteins.


 
Table II: Comparison of Agarwal and States MER of Best Method to SAM MER of Best Method. Includes family number, family name, length of single-sequence query, number of sequences in the family, and the training set size used to get SAM best results.
        2|c|Agarwal/States 3|cSAM      
# Superfamily Length Size Best method Best MER Best method Best MER Trainset size
1 hemoglobin alpha/beta 141 505 PS 14 m2 7 10
2 Ig kappa chain V-I region 108 280 PS 11 m2 3 10
3 G-prot.coupled receptors 348 165 SW 26 psp,ent   10
4 cytochrome C 105 142 PS 18 m7 16 10
5 snake neurotoxin 74 109 PS/SW/FA/B2   nw,m2,m7   2,5,10
6 calcium binding EF-hand 159 106 SW/B2 11 m7 5 10
7 glutathione transferase 222 106 PS 3 psp   10
8 protein kinase, cAMP-dep 351 97 SW 70 ent   10
9 ferredoxin 54 93 SW 57 m2,m7 9 5,10
10 ribulose-bisphosphate 139 77 all 2 nw,m2,m7   1,2,5,10
  carboxylase              
11 Ig kappa chain C region 106 74 PS/SW/B2 18 nw,m7 7 10
12 hemagglutinin 567 73 PS/SW/FA/B2   all   1,10
13 histocompatibility antigen 338 71 PS/SW   nw,m2,m7 1 1,10
14 insulin 110 69 all 3 nw,m2,m7 3 2,5,10
15 alpha-crystalline chain A 173 67 PS 2 m7 3 10
16 phospholipase A2 148 58 SW/B2   m7 1 5
17 glyceraldehyde-3-P DH 335 46 PS/SW/BP/B2   all   1,2,5,10
18 transforming prot.(N-ras) 189 45 all 1 nw,m2,m7 1 1,2,5,10
19 serine protease 246 45 FA 16 ent   10
20 glucagon precursor 180 44 B2 6 m2,m7 2 10
21 H+ transporting ATP 553 43 PS/SW/FA 1 all   5,10
  synthase alpha chain              
  precursor              
22 hemagglutinin-neruraminidase 576 42 all   all   1,2,5,10
23 ribonuclease 124 40 all   nw,m2,m7   1,2,5,10
24 interferon alpha-I-6 189 39 all   nw,m2,m7   1,2,5,10
25 glutamate-ammonia ligase 373 39 PS/SW/FA/B2   nw,m2,m7   1,2,5,10
26 azurin 129 38 SW 17 m2,m7 3 10
27 fusion protein-Sendai virus 565 36 all   all   1,2,5,10
28 cytochrome P450 497 35 PS/SW   nw,m2,m7,psp   10
29 outer capsid protein VP8 280 34 all   all   1,2,5,10
30 gag polyprotein 512 33 PS/SW/BP/B2 3 psp,ent   2,5,10
31 keratin 471 32 BP 4 ent   5
32 nucleoprotein-influenza A 498 31 all   nw,m2,m7   5,10
33 acidic ribosomal protein P2 115 29 PS 3 nw,m2,m7 5 10
34 E6 protein papillomarvirus 158 29 PS/SW/BP/B2   nw,m2,m7,psp   9
35 lysozyme 130 28 all   nw,m2,m7   1,2,5,10
36 N-cadherin 906 27 all   all   1,2,5,10
37 exo-alpha-sialidase 454 27 all   nw,m2,m7,psp   1,2,5,10
38 L2 protein papillomavirus 507 27 all   all   1,2,5,10
39 scorpion neurotoxin 64 26 PS/SW/FA/B2 5 m2 2 10
40 E7 protein papillomavirus 98 26 PS/SW/FA   m2,m7   10
41 H+ transporting ATP synthase 75 26 B2   nw,m2,m7   2,5,10
  lipid-binding              
42 L-lactate dehydrogenase 333 26 PS/SW/B2   nw,m2,m7,psp   10
43 E2 protein papillomavirus 322 26 all   m2,m7   5,10
44 core antigen - hepatitis B 183 25 all   nw,m2,m7   10
45 antithrombin-III 464 25 PS/SW/BP/B2   psp,ent 1  
46 thymidine kinase 376 25 BP   m2   2
47 phycocyanin 162 25 all   nw,m2,m7   2,5,10
48 protamine Y2 34 24 SW/FA/B2 1 nw,m2,m7 1 5,10
49 transforming prot (myc) 439 24 all   nw,m2,m7   1,2,5,10
50 matrix protein 348 24 PS/SW/BP/B2   nw,m2,m7   10
51 H+ transporting ATP 226 23 PS/SW/B2 1 ent   10
  synthase P6              
52 alcohol dehydrogenase A 375 23 all   nw,m2,m7,psp   1,2,5,10
53 glycoprotein B 857 23 all   all   1,2,5,10
54 ionotropic acetylcholine 457 23 all   all   1,2,5,10
  receptor              
55 nonstructural protein NS2 121 22 PS   nw,m7   1,2
56 annexin I 346 22 all 4 psp,ent   10
57 histone H1b 218 22 SW/BP/B2 2 all   5,10
58 metallothionein 61 21 B2 3 m2 3 10
59 Beta-crystallin chain Bp 204 21 all   nw,m2,m7   1,2,5,10
60 proteinase inhibitor 71 21 PS 2 nw 1 10
61 hepatic lectin H1 291 20 SW/FA/B2 1 m2,m7   10
62 E2 glycoprotein precursor 1447 20 all   all   1,2,5,10
63 alpha-2u-globulin precursor 181 20 PS 6 psp   10
64 pepsin 388 20 all   all   1,2,5,10
65 DNA-directed DNA polymerase 1462 20 SW/FA/BP 1 psp   2
66 prolactin 227 20 all   nw,m2,m7   1,2,5,10
67 vitamin B12 tran. btuD 249 20 PS 3 psp   10

Table II summarizes the results from both our experiments and those of Agarwal and States. Each row lists one of the 67 superfamilies, length of the sequence used in the single-sequence search, size of the superfamily, the best MER obtained by Agarwal and States, and the method which produced that MER. We added to this data on the best SAM method for each family, the best MER, and the size of the training set that produced that MER.

We calculated the SAM MERs by building models trained on the single superfamily search sequence used by Agarwal and States and models trained on small training sets (2, 5 and 10) containing the search sequence and a few other randomly-selected superfamily members from PIR 39. As in the Agarwal and States result, the discrimination sets consisted of the superfamily members from PIR 39 and the remaining (non-family) sequences in the database.


 
Table III: Comparison of Probabilistic Smith-Waterman (from Agarwal and States) to unweighted, weighted method 7, and PSP SAM HMM training, HMMer, and Meta-MEME (using MEME, Mast, and HMMer) with various training set sizes. The training sets always include the single query sequence used for Probabilistic Smith-Waterman. SAM internal weighting with one sequence is algorithmically the same as unweighted SAM. The final rows indicate total number of errors, number of families better than and worse than PSW, net change in number of errors over PSW, and net number of improved families over PSW for each method.
        4|c|SAM No Weights 3|c|SAM Internal 4|c|HMMer 4|c Meta-MEME         r@r@r|r@      
# Len Size PSW 1 2 5 10 2 5 10 1 r@r@r|r@ 2 5 10
1 141 505 14 39 48 27 25 37 13 8 113 r@r@r|r@ 73 27 19
2 108 280 11 56 106 25 5 100 17 5 130 r@r@r|r@ 133 18 4
3 348 165 29 52 55 35 19 25 29 18 125 r@r@r|r@ 105 56 28
4 105 142 18 30 28 22 29 29 18 16 41 r@r@r|r@ 37 36 34
5 74 109   1 2 1   9 2   85 r@r@r|r@ 40 1  
6 159 106 15 9 8 6 8 9 6 5 59 r@r@r|r@ 56 28 13
7 222 106 3 13 30 5 15 16 2 1 80 r@r@r|r@ 69 38 9
8 351 97 74 71 58 59 37 71 62 39 78 r@r@r|r@ 73 63 56
9 54 93 58 58 15 14 22 26 16 9 74 r@r@r|r@ 44 40 28
10 139 77 2               55 r@r@r|r@ 5    
11 106 74 18 19 61 19 7 37 18 7 60 r@r@r|r@ 67 48 14
12 567 73   1   1         4 r@r@r|r@ 2 1  
13 338 71   1 6 6 1 3 9 1 19 r@r@r|r@ 6 4 1
14 110 69 3 4 3 3 3 3 3 3 12 r@r@r|r@ 8 6 3
15 173 67 2 5 6 7 9 6 4 3 11 r@r@r|r@ 10 7 7
16 148 58 1 2 2 2 2 2 1 2 14 r@r@r|r@ 5 2 2
17 335 46                 1 r@r@r|r@ 1    
18 189 45 1 1 1 1 1 1 1 1 4 r@r@r|r@ 2 1 1
19 246 45 20 22 22 22 15 22 22 16 20 r@r@r|r@ 17 20 16
20 180 44 14 8 4 6 3 6 6 2 21 r@r@r|r@ 21 19 11
21 553 43 1 1 5   1 1     10 r@r@r|r@ 21    
22 576 42                 3 r@r@r|r@      
23 124 40                 1 r@r@r|r@ 1    
24 189 39                   r@r@r|r@      
25 373 39                 12 r@r@r|r@ 1 21 1
26 129 38 19 17 25 5 4 19 4 3 27 r@r@r|r@ 27 8 5
27 565 36                   r@r@r|r@ 4    
28 497 35   3 3     2 1   7 r@r@r|r@ 8 1 1
29 280 34                 1 r@r@r|r@      
30 512 33 3 4 7 3 3 4 3 3 12 r@r@r|r@ 10 8 4
31 471 32 5 5 1 1 2 2 4 3 2 r@r@r|r@ 7 5 2
32 498 31   1 1 1   1     1 r@r@r|r@ 1 1  
33 115 29 3 10 9 9 5 9 6 5 11 r@r@r|r@ 10 7 5
34 158 29   2 2 2     2   3 r@r@r|r@ 1    
35 130 28                 7 r@r@r|r@      
36 906 27                   r@r@r|r@      
37 454 27                   r@r@r|r@ 3    
38 507 27                 1 r@r@r|r@      
39 64 26 5 5 6 3 4 14 4 4 14 r@r@r|r@ 18 4 3
40 98 26   1 6 1 1 6     1 r@r@r|r@      
41 75 26 1 1             11 r@r@r|r@ 1    
42 333 26   6 4     2     14 r@r@r|r@ 14 1  
43 322 26   1 6 1 1 6       r@r@r|r@      
44 183 25     1           1 r@r@r|r@ 4    
45 464 25   1 1     16 1   16 r@r@r|r@ 18 2  
46 376 25 1 1 1 1 1 1 1 1 1 r@r@r|r@ 1 1 1
47 162 25   1             9 r@r@r|r@      
48 34 24 2 5 2 3 1 2 1 1 11 r@r@r|r@ 6 2 1
49 439 24         1       4 r@r@r|r@ 2 1 1
50 348 24   6             1 r@r@r|r@ 2    
51 226 23 1 8 2 2 1 1 1 1 11 r@r@r|r@ 11 1  
52 375 23                 7 r@r@r|r@ 2    
53 857 23                   r@r@r|r@ 1    
54 457 23     3 3     3   3 r@r@r|r@ 3    
55 121 22     2 2 2   2 2 2 r@r@r|r@ 2 2 2
56 346 22 4 4 4 4 3 4 4 4 4 r@r@r|r@ 4 4 3
57 218 22 3 2 2     1     4 r@r@r|r@ 2 2 1
58 61 21 6 5 5 5 5 5 5 4 6 r@r@r|r@ 6 5 4
59 204 21                   r@r@r|r@      
60 71 21 2 2 2 2 1 3 2 2 6 r@r@r|r@ 4 2 2
61 291 20 2 8 7   1 6     13 r@r@r|r@ 14 4 3
62 1447 20                 1 r@r@r|r@   1  
63 181 20 6 12 15 4 2 5 5 1 17 r@r@r|r@ 15 4 4
64 388 20                 1 r@r@r|r@ 1    
65 1462 20 5 1 1     3     18 r@r@r|r@ 15 1  
66 227 20                 9 r@r@r|r@      
67 249 20 3 6 13 5 4 5 5 3 18 r@r@r|r@ 17 9 7
3l|Total MER 355 511 591 318 244 520 283 173 1307 1031 512 r@r@r|r@ 296 1206 879
3l|#F Better PSW 0 8 10 15 21 12 18 23 1 4 11 r@r@r|r@ 21 0 4
3l|#F Worse PSW 0 29 30 22 12 22 12 5 55 47 26 r@r@r|r@ 14 58 47
3l|Improve (MER) 0 -156 -236 37 111 -165 72 182 -952 -676 -157 r@r@r|r@ 59 -851 -524
3l|Improve (#F) 0 -21 -20 -7 9 -10 6 18 -54 -43 -15 r@r@r|r@ 7 -58 -43

 


To compare our results with other maximum discrimination methods, we repeated the experiments using HMMer and Meta-MEME (Table III). We built the HMMer models from the training sets used in the SAM experiments and scored the same discrimination sets with these models, using HMMer's hmmsw program. We used the default HMMer configuration. The impending rewrite of HMMer can be expected to have improved performance.

We also built HMMs using a combined MEME [Bailey & Elkan, 1994] and HMMer method called Meta-MEME [Grundy et al., 1997,Grundy, 1998]. In this procedure, MEME is used to locate common motifs in the training set, MAST finds the sequences in the training set with the strongest matches to these common motifs, and Meta-MEME builds an HMM in HMMer format based on the motifs and ``diagrams'' which describe the distribution of the motifs in the matching sequences. We again used the same training and discrimination sets. We ran the programs from the MEME suite using the settings described in the literature [Grundy, 1998]: a minimum motif width of 12, maximum motif width of 55, the ZOOPS motif model (zero or one motif occurrence per sequence), ten selected motifs from each training set, and use of only those motifs that appeared in more than half of the sequences in the training set to build the HMM.

The final rows of Table III include the minimum number of errors across the entire experiment and the number of families that have a better or worse MER than PSW. As can be seen, the HMM-based methods require 5 or 10 sequences to gain any improvement over the PSW results. For training sets of 1 or 2 sequences, PSW performs better than SAM (46%-66% higher MER), HMMer (190%-270% higher), or Meta-MEME (150%-240%), indicating further need for tuning the model building procedure for small data sets. For training sets of 5 sequences, unweighted SAM has a 10% better MER and weighted SAM has a 20% better MER compared to PSW. HMMer and Meta-MEME MERs are 44% and 14% worse than PSW. At 10 training sequences, all HMM methods do better than PSW, with net MER improvements over PSW of 17%, 28%, 31%, and 51%, corresponding to MERs of 296, 255, 244, and 173, for HMMer, Meta-MEME, unweighted SAM, and weighted SAM. Method 2 performed slightly worse than method 7 with, for example, a total MER of 187 (47% better than PSW) for a size 10 training set.

Based on these data, we conclude that SAM's new internal sequence weighting method is a success, significantly extending the discriminatory effectiveness of hidden Markov models.

Discussion

Our experiments lead us to believe that we can consistently improve the discrimination performance of a SAM HMM by using maximum discrimination weighting functions during model building. We have added Methods 2 and 7 to our current upgrade of SAM (2.0).

Our improvement in discrimination is not always dramatic, but it is very cheap. There is no noticeable increase in the time it takes to build the model when internal weighting is used.

We believe that a careful choice of Dirichlet mixture priors and transition regularizers could further improve results for a particular protein family or training set size. However, experiments to determine the best choice are beyond the scope of this paper.

The position-specific and entropy-based ``external'' SAM methods often performed very well but were not as stable as our standalone methods. It is possible that internally implemented versions of these methods, with continuous updating of weights during model training, would be more consistent and discriminate better than our current internal methods.

None of the HMM-based methods performs as well as Probabilistic Smith Waterman with only one or two training sequences. Because HMMs are a generalization of the PSW method, there is no intrinsic reason why this should be the case. Instead, it indicates that our current library of Dirichlet mixture priors and transition regularizers is more finally tuned to training HMMs for groups of sequences. Further study of this problem could result in a range of different prior probabilities for different training set sizes and discrimination goals.

The comparison of single-sequence PSW with multiple-sequence HMMs is somewhat unfair. A more complete comparison would include a modified version of PSW that uses all sequences in the training set and, for example, using an iterative search technique [Pearson, 1997] or averaging the scores [Grundy, 1998]. As our primary goal was to evaluate the effectiveness of internal weighting methods in SAM, this work was considered beyond the scope of the paper.

Acknowledgments

We gratefully acknowledge Christian Barrett, David Haussler, and Kevin Karplus for their assistance and commentary on this work. This work was supported in part by NSF grant DBI-9408579 and its REU supplement, DOE grant DE-FG0395ER62112, and a grant from Digital Equipment Corporation.

References

Agarwal & States, 1998
Agarwal, P. & States, D. J. (1998).
Comparative accuracy of methods for protein-sequence similarity search.
Bioinformatics, 14 (1).

Althschul & Gish, 1996
Althschul, S. & Gish, W. (1996).
Local alignment statistics.
Methods Enzymol. 266, 460-480.

Altschul, 1991
Altschul, S. F. (1991).
Amino acid substitution matrices from an information theoretic perspective.
JMB, 219, 555-565.

Altschul et al., 1990
Altschul, S. F., Gish, W., Miller, W., Meyers, E. W., & Lippman, D. J. (1990).
Basic local alignment search tool.
JMB, 215, 403-410.

Bailey & Elkan, 1994
Bailey, T. L. & Elkan, C. (1994).
Fitting a mixture model by expectation maximization to discover motifs in biopolymers.
In: ISMB-94 pp. 28-36, Menlo Park, CA: AAAI/MIT Press.

Baldi et al., 1994
Baldi, P., Chauvin, Y., Hunkapillar, T., & McClure, M. (1994).
Hidden Markov models of biological primary sequence information.
PNAS, 91, 1059-1063.

Barrett et al., 1997
Barrett, C., Hughey, R., & Karplus, K. (1997).
Scoring hidden Markov models.
CABIOS, 13 (2), 191-199.

Bucher & Bairoch, 1994
Bucher, P. & Bairoch, A. (1994).
A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation.
In: ISMB-94 pp. 53-61, Menlo Park, CA: AAAI/MIT Press.

Bucher & Hoffmann, 1996
Bucher, P. & Hoffmann, K. (1996).
A sequence similarity search algorithm based on probabalistic interpretation of an alignment scoring system.
In: ISMB-96 pp. 44-51, Menlo Park, CA: AAAI Press.

Bucher et al., 1996
Bucher, P., Karplus, K., Moeri, N., & Hoffman, K. (1996).
A flexible motif search technique based on generalized profiles.
Computers and Chemistry, 20 (1), 3-24.

Eddy, 1995
Eddy, S. (1995).
Multiple alignment using hidden Markov models.
In: ISMB-95, (Rallings, C. et al., eds) pp. 114-120, Menlo Park, CA: AAAI/MIT Press.

Eddy et al., 1995
Eddy, S., Mitchison, G., & Durbin, R. (1995).
Maximum discrimination hidden Markov models of sequence consensus.
J. Comput. Biol. 2, 9-23.

Grundy, 1998
Grundy, W. N. (1998).
Family-based homology detection via pairwise sequence comparison.
In: Int. Conf. Computational Molecular Biology (RECOMB-98) , New York: ACM Press.

Grundy et al., 1997
Grundy, W. N., Bailey, W., Elkan, T., & Baker, C. (1997).
Meta-MEME: Motif-based hidden Markov models of protein families.
CABIOS, 13 (4), 397-406.

Haussler et al., 1993
Haussler, D., Krogh, A., Mian, I. S., & Sjölander, K. (1993).
Protein modeling using hidden Markov models: Analysis of globins.
In: Proceedings of the Hawaii International Conference on System Sciences volume 1 pp. 792-802, Los Alamitos, CA: IEEE Computer Society Press.

Henikoff & Henikoff, 1994
Henikoff, S. & Henikoff, J. G. (1994).
Position-based sequence weights.
JMB, 243 (4), 574-578.

Hughey & Krogh, 1996
Hughey, R. & Krogh, A. (1996).
Hidden Markov models for sequence analysis: Extension and analysis of the basic method.
CABIOS, 12 (2), 95-107.

Karplus, 1995
Karplus, K. (1995).
Regularizers for estimating distributions of amino acids from small samples.
In: ISMB-95 , Menlo Park, CA: AAAI/MIT Press.

Karplus, 1997
Karplus, K. (1997).
Personal communication.

Karplus et al., 1997
Karplus, K., Kimmen Sjölander, Barrett, C., Cline, M., Haussler, D., Hughey, R., Holm, L., & Sander, C. (1997).
Predicting protein structure using hidden Markov models.
Proteins: Structure, Function, and Genetics, .
to appear.

Krogh et al., 1994
Krogh, A., Brown, M., Mian, I. S., Sjölander, K., & Haussler, D. (1994).
Hidden Markov models in computational biology: Applications to protein modeling.
JMB, 235, 1501-1531.

McClure et al., 1996
McClure, M., Smith, C., & Elton, P. (1996).
Parameterization studies for the SAM and HMMER methods of hidden Markov model generation.
In: ISMB-96 pp. 155-164, St. Louis: AAAI Press.

Milosavljevic & Jurka, 1993
Milosavljevic, A. & Jurka, J. (1993).
Discovering simple DNA sequences by the algorithmic similarity method.
CABIOS, 9 (4), 407-411.

Pearson, 1995
Pearson, W. (1995).
Comparison of methods for searching protein sequence databases.
Protein Science, 4, 1145-1160.

Pearson & Lipman, 1988
Pearson, W. & Lipman, D. (1988).
Improved tools for biological sequence comparison.
Proc. Natl. Acad. Sci. USA, 85, 2444-2448.

Pearson, 1997
Pearson, W. R. (1997).
Identifying distantly related protein sequences.
CABIOS, 13 (4), 325-332.

Sjölander et al., 1996
Sjölander, K., Karplus, K., Brown, M. P., Hughey, R., Krogh, A., Mian, I. S., & Haussler, D. (1996).
Dirichlet mixtures: A method for improving detection of weak but significant protein sequence homology.
CABIOS, 12 (4), 327-345.

Smith & Waterman, 1981
Smith, T. F. & Waterman, M. S. (1981).
Identification of common molecular subsequences.
JMB, 147, 195-197.

Tarnas & Hughey, 1998
Tarnas, C. & Hughey, R. (1998).
Reduced space hidden Markov model training.
Bioinformatics, (to appear).


Footnotes

...Hughey
To whom correspondence should be addressed


next up previous
Richard Hughey
5/8/1998