Rachel Karchin and Richard Hughey![]()
Department of Computer Engineering
Jack Baskin School of Engineering
University of California
Santa Cruz, CA 95064
PREPRINT to appear in Bioinformatics, 1998
Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximum-discrimination idea.
One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than single-sequence Probabilistic Smith-Waterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), Meta-MEME (28% reduction), and unweighted SAM (31% reduction).
A World-Wide Web server, as well as information on obtaining the Sequence Alignment and Modeling (SAM) software suite and additional data from this work, can be found at http://www.cse.ucsc.edu/research/compbio/sam.html.
Email: rph@cse.ucsc.edu.
Linear hidden Markov models (HMMs) [Haussler et al., 1993,Krogh et al., 1994] or generalized profiles [Bucher & Bairoch, 1994] have been demonstrated to be very effective in detecting conserved patterns in multiple sequences [Hughey & Krogh, 1996,Baldi et al., 1994,Eddy et al., 1995,Eddy, 1995,Bucher et al., 1996,McClure et al., 1996,Karplus et al., 1997]. When provided with a set of unaligned or aligned ``training'' sequences that contain members of a protein family, the resulting HMM can identify the positions of amino acids which describe conserved first order structure of the family.
The HMM can also be used to generate a multiple alignment of the unaligned sequences from which the model was built by using a dynamic programming method. The new multiple alignment reveals the regions in the trained sequences that the process found to be conserved and can provide new information about the common structure underlying the sequences in the family. In addition, the HMM can discriminate between family and non-family members in a sequence database search.
This paper studies several sequence weighting methods that improve the ability of an HMM to discriminate between members and non-members of a protein family. Sequence weighting addresses a potential weakness in the model building process. If there is a high degree of redundancy in the set of sequences being used, the model will over-represent the similar sequences. By assigning larger weights to poorly scoring sequences during model training, we have been able to significantly reduce the number of errors made when the resulting model is used to recognize members of a particular family.
The methods we have considered are based entirely on the log-odds score [Altschul, 1991] of the sequence against the model being trained. This investigation was motivated by HMMer's maximum-discrimination training method [Eddy et al., 1995,Eddy, 1995]. HMMer uses a simulated annealing approach, in which the number of sequences actively in the training set is reduced over the course of training. During the course of this training, when maximum discrimination training is used, sequence weights are set according to their log-odds score so that sequences that poorly match the model are given higher weight. As training iterations proceed, the model is modified with data from each new sequence as it is scored against the model.
SAM uses expectation-maximization and its batch mode of training [Krogh et al., 1994,Hughey & Krogh, 1996]. In this case, the entire training set is aligned to the model, and a completely new model is generated by combining the information from these alignments and the transition regularizer and a Dirichlet mixture prior [Sjölander et al., 1996]. (SAM uses the full forward-backward algorithm, so more precisely, the sum of probabilities across all possible alignments is calculated for each sequence.) SAM can produce a model from unaligned sequences, as was done in these experiments, or SAM can be given an initial seed alignment, which often improves results.
Since it has been known for some time that weighting is a problem in SAM, we added the ability to weight sequences to the new version of SAM. This came in two phases. The first is external weighting, in which a file of weights can be read in and assigned to sequences. Our group currently uses this and either position-specific weighting [Henikoff & Henikoff, 1994] or an entropy-based method [Karplus, 1997], using an unpublished method to assign the total weight (without regularization, only relative weight matters; with regularization, required for small sample sizes, total weight will influence how strongly the sequences can overpower the regularizer [Karplus, 1995]). Both of these methods are based on analyzing a multiple alignment of the training sequences. When one is not available, a rough alignment is created using unweighted HMM training, the sequences are weighted based on that alignment, and then a new HMM is trained using those weights.
In an effort to add standalone weighting (i.e., without the extra steps for the file-based weighting), we examined a number of maximum-discrimination style weighting schemes, that is schemes that depend only on the log-odds score of the sequence against the model and in which weights are continuously updated during HMM training. This analysis has led us to make two of the weighting methods, described below, a permanent additions to the Sequence Alignment and Modeling software system.
On completing our evaluation of these different weighting schemes, we performed a more general analysis of our best weighting schemes based on the recent work of Agarwal and States [Agarwal & States, 1998]. Using 67 superfamilies from an annotated version of PIR [Pearson, 1995], they compared several pairwise sequence comparison methods, leading to the observation that Probabilistic Smith Waterman (PSW) [Bucher & Hoffmann, 1996] was the best discrimination technique tested for whole proteins. To their PSW results, we add evaluations of SAM's new internal weighting, HMMer's maximum discrimination training, and Meta-MEME's combination of MEME motif location with HMMer hidden Markov model scoring [Bailey & Elkan, 1994,Grundy et al., 1997,Grundy, 1998].
All work described in this paper was performed using the SAM HMM program suite version 2.0 on DEC Alpha 255 workstations and a DEC AlphaServer. SAM is programmed in ANSI C for Unix workstations. The HMMer experiments were performed with HMMer version 1.8.1. The Meta-MEME experiments used Meme and Mast version 2.0 and Meta-MEME version 1.0.
A typical training time for 10 sequences, such as with the cytochrome P450 superfamily with average length 498, was 10.7 minutes for SAM, 2.5 minutes for HMMer, and 4.2 minutes for MetaMeme. Search times against the reference database were 19.9 minutes for SAM, 12.0 minutes for HMMer, and 0.4 minutes for MetaMeme.
Given a linear hidden Markov model M, a dynamic programming
calculation can be used to calculate, for a given sequence, the
probability that sequence a was generated by the model, P(a|M).
The question of interest is, however, does that model match the
sequence? That is, is the sequence more likely to be generated by the
hidden Markov model than some other, less structured null model,
. Making the assumption that the models M and
are
a priori equally likely, this reduces to the log-odds probability of
![]()
This score measures whether the probability that the sequence was generated by the model is greater than the probability it was generated by a null model (in SAM's case, the geometric average of the modeled sequence positions [Barrett et al., 1997]). This log-odds score is used to calculate the weight of a sequence. (SAM actually uses negative log-odds, but for this discussion, hopefully consistently, we use positive log-odds scores.)
HMMer's maximum discrimination method gives to each sequence a weight of 1-P(M|a). Thus, if a sequence perfectly matches the model M, it is given no weight at all, while a sequence that poorly matches the model is given unity weight. In score terms, weights are set to 1/(1+eS). One potential problem of this score function is that a model could, for example, alternately score one group of sequences well and a second poorly and then, because of the weighting, score the second group well after retraining. The HMMer maximum discrimination algorithm deals with this in two ways. First, training is done serially; the model is updated after each sequence is scored against the model. Second, a mixing parameter is used to ensure that any given sequence does not change the model too much.
For SAM, we considered several different weighting functions, all loosely based on the maximum discrimination equation above. We had four criteria in selecting candidate weighting functions:
| W | weight |
| S | a sequence's score |
| K | weight of the best scoring sequence (user defined) |
| b | the best (lowest) score in the training set |
| w | the worst (highest) score in the training set |
| m | the median score in the training set |
| o | user-defined offset value |
| Ws | weight of an individual sequence |
| N | size of the training set |
| x | mix of old weight and new weight |
To achieve these ends, we designed eight ``candidate'' functions described below. We also included weight normalization and weight ``mixing'' capabilities to optimize our results.
The methods are numbered in roughly chronological order, perhaps revealing some of our thought processes in creating the experiments.
The first weighting function is directly from HMMer's maximum discrimination method, but with additional parameters. Here, the weighting can be adjusted using an offset parameter o that is not necessarily zero. Any sequence with score better than o has reduced weight, while any sequence that scores worse than o has increased weight. We typically set o to the log of the database size, following the ideas of algorithmic significance [Milosavljevic & Jurka, 1993]. Thus, this offset encourages the model to score sequences not just better than the null model, but significantly better than the null model. The Method 1 formula has a second parameter, 0<K<1, that is used to temper the effects of weighting on the sequences:
| (1) |
The second weighting function is the first of our relative weighting functions, which attempt to remove the offset parameter to reduce the number of free variables. The worst score w and the best score b are used to decide on two extreme weights. The worst scoring sequence will have a weight of 1, while the best scoring sequence will have a weight of K, typically in the 0.01 to 0.1 range:
| (2) |
Method 3 is again similar to the maximum discrimination formula, but instead of using an arbitrary offset, the median of all scores in the training set is used. Thus, sequences that score worse than the median m receive close to a unity weight, while those that score better than the median receive a small weight:
| (3) |
Method 4 is basically Method 2 with an offset used to scale the scores, rather than the worst score. Here, sequences that score better than the offset receive weights less than one, while sequences that score worse than the offset can receive very high weights:
| (4) |
Continuing with variations on a theme, Method 5 uses the median and the best score to determine the exponential weight curve:
| (5) |
One problem with Methods 2, 4 and 5 is that the scaling depends very strongly on the worst score. This can be particularly a problem if the training set is polluted: if a non-family member is included in the training set. In an effort to deal with these possible outliers, Method 6 is a piecewise weighting function. The primary weighting function,
| (6) |
![]()
Method 7 is a similar function based on the best and worst score, rather than the median and worst score. Here, the equation
| (7) |
![]()
Finally, Method 8 is a simple linear function,
| (8) |
Whenever the total weight of all the sequences exceeds the number of
sequences (it is unlikely that N sequences could ever contain more
than N sequences worth of information!), the weights are renormalized
to sum to N. To deal with the opposite case, of the sequence weights
becoming too small (below
), the weights are renormalized as
well.
To potentially account for weights varying too rapidly between
training iterations, we experimented with an annealing schedule
similar to that used by SAM for noise injection. This weight mixing
function decides how much influence the calculated
can have on the previous iteration's
:
| |
(9) |
Following previous work [Barrett et al., 1997], we initially tested these eight methods on three protein families: globins, ferredoxins and EF-hands. Models were generated from large training sets of known family members using SAM's buildmodel program. We would expect such models to be quite good at discriminating between family and non-family members.
For each family, we built five models using each prospective weighting method and a variety of parameters. We wanted to compare their discrimination performance with unweighted SAM models.
We tested discrimination performance by using SAM's hmmscore program with each model to calculate the log-odds scores of sequences in two discrimination sets. The ``true'' set is composed of members of the given protein family; the ``false'' set is composed of non-family members.
The true sets were assembled by using a Prosite keyword search on the SwissProt 32 database (685 globins, 100 ferredoxins, 147 EF-hands). The false sets consisted of all remaining sequences in the SwissProt32 database after extraction of the true set. Although our true and false sets undoubtedly had errors, the same errors were present for all methods.
We then plotted the resulting number of false positives versus number of false negatives over a range of score thresholds. False positives are non-family sequences which the model identifies as family members; that is, their log-odds score is better than the worst scoring family member. False negatives are family members which the model misses. They score worse than the best-scoring non-family members.
The point where the number of false positives equals the number of false negatives is called the equivalence number [Pearson, 1995,Agarwal & States, 1998]. We calculated the equivalence number (EN) for each of our test and control models. This number is a good standard of discrimination performance. It provides an upper bound on the minimum number of discrimination errors (MER)
![]()
To evaluate computational performance, we kept track of the number of iterations required to build each model and the range of weights assigned to sequences in each experiment.
Four of our methods had poor computational performance (1, 3, 5, 6). We eliminated methods and parameter settings that made more discrimination errors than an unweighted model, added excessive iterations to the model building process, or were numerically unstable. However, four of the methods (2,4,7,8) performed very well with K (weight of the best scoring sequence before normalization) set at 0.1, so we continued our study using these.
At this point, we also revised the training procedure with the use of a different Dirichlet mixture prior (mall-opt.9comp) and an improved local (Smith-Waterman style) scoring algorithm [Smith & Waterman, 1981,Tarnas & Hughey, 1998]. All other parameters, apart from weighting parameters, were defaults.
Further tests with size 15 training sets showed that for globins and ferredoxins, methods 2 and 4 had the best discrimination performance. For EF-hands, position specific weighting and method 7 performed better than methods 2 and 4. Computationally method 4 performed the worst, requiring two and three times as many model building iterations as the other internal methods. We restricted further experiments to method 2, method 7, unweighted, external entropy, and external position-specific [Henikoff & Henikoff, 1994]. As mentioned, the two external methods consist of training an unweighted SAM model, using that model to form a multiple alignment, and then passing the multiple alignment to the weighting algorithm.
Our next concern was to see if training set size made a difference in a model's discrimination accuracy. To speed method evaluation, we created smaller discrimination sets by training an unweighted SAM model on each entire family (SwissProt keyword extraction) and then scoring this model on the entire SwissProt32 database. The true set was created by selecting the 100 worst scoring true sequences from each family and the false set was created by selecting the 2000 best scoring falses. Thus, these sets included the sequences that were difficult for the unweighted HMM to categorize.
![]() |
![]() |
![]() |
We studied training set sizes of 2, 5, 10, 15, 30, 50, 65, 85 and 100. For each size, we randomly formed five different training sets of each size and calculated the minimum, maximum and average equivalence numbers of the models created with each method.
While running the experiments on small training sets, we observed that sequence scores were oscillating indefinitely, a known problem with maximum discrimination score functions.
When weight mixing was added, the oscillations were successfully damped. We tried several x parameters in equation (10) and discovered that equivalence numbers were minimized when x=0.8; as a result, we have included weight mixing with parameter 0.8 as the default when SAM internal weighting is used.
Figures 1, 2 and 3 show the average equivalence number of each weighting method, as training set size increases from 2 to 100, by its difference from the average equivalence number of an unweighted SAM model calculated over the same five training sets.
This lets us see clearly how the weighting method improves on the discrimination performance of an unweighted model. The improvement can be measured by decreases in average equivalence numbers from unweighted to weighted models.
For the globins family, SAM methods 2 and 7 clearly have the best overall performance.
For ferredoxins, SAM methods 2 and 7 do best overall, however entropy weighting does an outstanding job for size 50 training sets and position specific weighting is best by a small margin for size 100 training sets.
For EF-hands, position specific weighting has the best overall performance and its margin of superiority increases as training sets get larger. Method 2 also does well. For size 85 sets it does better than position specific, however at size 100 it does poorly. For very small training sets (less than 15 sequences), methods 2 and 7 do best. Entropy weighting has the weakest discrimination performance on this family.
![]() |
In addition to the average equivalence numbers, we looked at the ranges between minimum and maximum equivalence numbers for training sets of the same size.
We observed that position specific and entropy weighting methods had more unstable results than the SAM internal methods. For training sets where these methods have excellent average equivalence numbers, the actual range of equivalence numbers is frequently quite large. This variation could be due to sensitivity to the single alignment (from an HMM trained with unweighted sequences) from which the external weights are calculated. Equivalence number ranges for the ferredoxins family are shown in Figure 4.
All methods tested were somewhat unstable on the EF-hands family; external position specific weighting was the most unstable.
![]() |
One great advantage of internal weighting methods is that the weights are dynamically changed as the model is trained. Training 100 ferredoxins (Figure 5), when the sequences are initially compared against the untrained model in iteration 1, there are two outliers that receive particularly high weights. In the second iteration, the sequences are much more clustered. Because of the weight mixing, however, the two outliers' weights are only partly reduced to that of the typical sequence receiving the same score. After 10 iterations, weights are nearly fixed for the remainder of the training process. At the last iteration, 26 in this case, the best scoring sequence is one that during one of the initial iterations was hard to model, and thus received a higher weight.
To find out whether or not our methods had value beyond these three protein families and to compare our discrimination results with other methods in general use, we ran another set of experiments using training and discrimination sets based Pearson's annotated version of the PIR 39 database with 12219 proteins total [Pearson, 1995].
Agarwal and States have analyzed five single-sequence search algorithms using a single sequence from each of the 67 superfamilies [Agarwal & States, 1998]. They analyzed Smith and Waterman SSEARCH [Smith & Waterman, 1981,Pearson & Lipman, 1988], FASTA (FA) [Pearson & Lipman, 1988], Probabilistic Smith-Waterman (PS) [Bucher & Hoffmann, 1996], BLASTP 1.4 (BP) [Altschul et al., 1990], and WU-BLASTP2 (WU) [Altschul et al., 1990,Althschul & Gish, 1996]. The general result of this work was that MERs are an effective evaluation criterion, and that Probabilistic Smith-Waterman had the best overall performance on complete proteins.
2|c|Agarwal/States
3|cSAM
#
Superfamily
Length
Size
Best method
Best MER
Best method
Best MER
Trainset size
1
hemoglobin alpha/beta
141
505
PS
14
m2
7
10
2
Ig kappa chain V-I region
108
280
PS
11
m2
3
10
3
G-prot.coupled receptors
348
165
SW
26
psp,ent
10
4
cytochrome C
105
142
PS
18
m7
16
10
5
snake neurotoxin
74
109
PS/SW/FA/B2
nw,m2,m7
2,5,10
6
calcium binding EF-hand
159
106
SW/B2
11
m7
5
10
7
glutathione transferase
222
106
PS
3
psp
10
8
protein kinase, cAMP-dep
351
97
SW
70
ent
10
9
ferredoxin
54
93
SW
57
m2,m7
9
5,10
10
ribulose-bisphosphate
139
77
all
2
nw,m2,m7
1,2,5,10
carboxylase
11
Ig kappa chain C region
106
74
PS/SW/B2
18
nw,m7
7
10
12
hemagglutinin
567
73
PS/SW/FA/B2
all
1,10
13
histocompatibility antigen
338
71
PS/SW
nw,m2,m7
1
1,10
14
insulin
110
69
all
3
nw,m2,m7
3
2,5,10
15
alpha-crystalline chain A
173
67
PS
2
m7
3
10
16
phospholipase A2
148
58
SW/B2
m7
1
5
17
glyceraldehyde-3-P DH
335
46
PS/SW/BP/B2
all
1,2,5,10
18
transforming prot.(N-ras)
189
45
all
1
nw,m2,m7
1
1,2,5,10
19
serine protease
246
45
FA
16
ent
10
20
glucagon precursor
180
44
B2
6
m2,m7
2
10
21
H+ transporting ATP
553
43
PS/SW/FA
1
all
5,10
synthase alpha chain
precursor
22
hemagglutinin-neruraminidase
576
42
all
all
1,2,5,10
23
ribonuclease
124
40
all
nw,m2,m7
1,2,5,10
24
interferon alpha-I-6
189
39
all
nw,m2,m7
1,2,5,10
25
glutamate-ammonia ligase
373
39
PS/SW/FA/B2
nw,m2,m7
1,2,5,10
26
azurin
129
38
SW
17
m2,m7
3
10
27
fusion protein-Sendai virus
565
36
all
all
1,2,5,10
28
cytochrome P450
497
35
PS/SW
nw,m2,m7,psp
10
29
outer capsid protein VP8
280
34
all
all
1,2,5,10
30
gag polyprotein
512
33
PS/SW/BP/B2
3
psp,ent
2,5,10
31
keratin
471
32
BP
4
ent
5
32
nucleoprotein-influenza A
498
31
all
nw,m2,m7
5,10
33
acidic ribosomal protein P2
115
29
PS
3
nw,m2,m7
5
10
34
E6 protein papillomarvirus
158
29
PS/SW/BP/B2
nw,m2,m7,psp
9
35
lysozyme
130
28
all
nw,m2,m7
1,2,5,10
36
N-cadherin
906
27
all
all
1,2,5,10
37
exo-alpha-sialidase
454
27
all
nw,m2,m7,psp
1,2,5,10
38
L2 protein papillomavirus
507
27
all
all
1,2,5,10
39
scorpion neurotoxin
64
26
PS/SW/FA/B2
5
m2
2
10
40
E7 protein papillomavirus
98
26
PS/SW/FA
m2,m7
10
41
H+ transporting ATP synthase
75
26
B2
nw,m2,m7
2,5,10
lipid-binding
42
L-lactate dehydrogenase
333
26
PS/SW/B2
nw,m2,m7,psp
10
43
E2 protein papillomavirus
322
26
all
m2,m7
5,10
44
core antigen - hepatitis B
183
25
all
nw,m2,m7
10
45
antithrombin-III
464
25
PS/SW/BP/B2
psp,ent
1
46
thymidine kinase
376
25
BP
m2
2
47
phycocyanin
162
25
all
nw,m2,m7
2,5,10
48
protamine Y2
34
24
SW/FA/B2
1
nw,m2,m7
1
5,10
49
transforming prot (myc)
439
24
all
nw,m2,m7
1,2,5,10
50
matrix protein
348
24
PS/SW/BP/B2
nw,m2,m7
10
51
H+ transporting ATP
226
23
PS/SW/B2
1
ent
10
synthase P6
52
alcohol dehydrogenase A
375
23
all
nw,m2,m7,psp
1,2,5,10
53
glycoprotein B
857
23
all
all
1,2,5,10
54
ionotropic acetylcholine
457
23
all
all
1,2,5,10
receptor
55
nonstructural protein NS2
121
22
PS
nw,m7
1,2
56
annexin I
346
22
all
4
psp,ent
10
57
histone H1b
218
22
SW/BP/B2
2
all
5,10
58
metallothionein
61
21
B2
3
m2
3
10
59
Beta-crystallin chain Bp
204
21
all
nw,m2,m7
1,2,5,10
60
proteinase inhibitor
71
21
PS
2
nw
1
10
61
hepatic lectin H1
291
20
SW/FA/B2
1
m2,m7
10
62
E2 glycoprotein precursor
1447
20
all
all
1,2,5,10
63
alpha-2u-globulin precursor
181
20
PS
6
psp
10
64
pepsin
388
20
all
all
1,2,5,10
65
DNA-directed DNA polymerase
1462
20
SW/FA/BP
1
psp
2
66
prolactin
227
20
all
nw,m2,m7
1,2,5,10
67
vitamin B12 tran. btuD
249
20
PS
3
psp
10
Table II summarizes the results from both our experiments and those of Agarwal and States. Each row lists one of the 67 superfamilies, length of the sequence used in the single-sequence search, size of the superfamily, the best MER obtained by Agarwal and States, and the method which produced that MER. We added to this data on the best SAM method for each family, the best MER, and the size of the training set that produced that MER.
We calculated the SAM MERs by building models trained on the single superfamily search sequence used by Agarwal and States and models trained on small training sets (2, 5 and 10) containing the search sequence and a few other randomly-selected superfamily members from PIR 39. As in the Agarwal and States result, the discrimination sets consisted of the superfamily members from PIR 39 and the remaining (non-family) sequences in the database.
| 4|c|SAM No Weights | 3|c|SAM Internal | 4|c|HMMer | 4|c Meta-MEME | r@r@r|r@ | ||||||||||||||||||
| # | Len | Size | PSW | 1 | 2 | 5 | 10 | 2 | 5 | 10 | 1 | r@r@r|r@ | 2 | 5 | 10 | |||||||
| 1 | 141 | 505 | 14 | 39 | 48 | 27 | 25 | 37 | 13 | 8 | 113 | r@r@r|r@ | 73 | 27 | 19 | |||||||
| 2 | 108 | 280 | 11 | 56 | 106 | 25 | 5 | 100 | 17 | 5 | 130 | r@r@r|r@ | 133 | 18 | 4 | |||||||
| 3 | 348 | 165 | 29 | 52 | 55 | 35 | 19 | 25 | 29 | 18 | 125 | r@r@r|r@ | 105 | 56 | 28 | |||||||
| 4 | 105 | 142 | 18 | 30 | 28 | 22 | 29 | 29 | 18 | 16 | 41 | r@r@r|r@ | 37 | 36 | 34 | |||||||
| 5 | 74 | 109 | 1 | 2 | 1 | 9 | 2 | 85 | r@r@r|r@ | 40 | 1 | |||||||||||
| 6 | 159 | 106 | 15 | 9 | 8 | 6 | 8 | 9 | 6 | 5 | 59 | r@r@r|r@ | 56 | 28 | 13 | |||||||
| 7 | 222 | 106 | 3 | 13 | 30 | 5 | 15 | 16 | 2 | 1 | 80 | r@r@r|r@ | 69 | 38 | 9 | |||||||
| 8 | 351 | 97 | 74 | 71 | 58 | 59 | 37 | 71 | 62 | 39 | 78 | r@r@r|r@ | 73 | 63 | 56 | |||||||
| 9 | 54 | 93 | 58 | 58 | 15 | 14 | 22 | 26 | 16 | 9 | 74 | r@r@r|r@ | 44 | 40 | 28 | |||||||
| 10 | 139 | 77 | 2 | 55 | r@r@r|r@ | 5 | ||||||||||||||||
| 11 | 106 | 74 | 18 | 19 | 61 | 19 | 7 | 37 | 18 | 7 | 60 | r@r@r|r@ | 67 | 48 | 14 | |||||||
| 12 | 567 | 73 | 1 | 1 | 4 | r@r@r|r@ | 2 | 1 | ||||||||||||||
| 13 | 338 | 71 | 1 | 6 | 6 | 1 | 3 | 9 | 1 | 19 | r@r@r|r@ | 6 | 4 | 1 | ||||||||
| 14 | 110 | 69 | 3 | 4 | 3 | 3 | 3 | 3 | 3 | 3 | 12 | r@r@r|r@ | 8 | 6 | 3 | |||||||
| 15 | 173 | 67 | 2 | 5 | 6 | 7 | 9 | 6 | 4 | 3 | 11 | r@r@r|r@ | 10 | 7 | 7 | |||||||
| 16 | 148 | 58 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 14 | r@r@r|r@ | 5 | 2 | 2 | |||||||
| 17 | 335 | 46 | 1 | r@r@r|r@ | 1 | |||||||||||||||||
| 18 | 189 | 45 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | r@r@r|r@ | 2 | 1 | 1 | |||||||
| 19 | 246 | 45 | 20 | 22 | 22 | 22 | 15 | 22 | 22 | 16 | 20 | r@r@r|r@ | 17 | 20 | 16 | |||||||
| 20 | 180 | 44 | 14 | 8 | 4 | 6 | 3 | 6 | 6 | 2 | 21 | r@r@r|r@ | 21 | 19 | 11 | |||||||
| 21 | 553 | 43 | 1 | 1 | 5 | 1 | 1 | 10 | r@r@r|r@ | 21 | ||||||||||||
| 22 | 576 | 42 | 3 | r@r@r|r@ | ||||||||||||||||||
| 23 | 124 | 40 | 1 | r@r@r|r@ | 1 | |||||||||||||||||
| 24 | 189 | 39 | r@r@r|r@ | |||||||||||||||||||
| 25 | 373 | 39 | 12 | r@r@r|r@ | 1 | 21 | 1 | |||||||||||||||
| 26 | 129 | 38 | 19 | 17 | 25 | 5 | 4 | 19 | 4 | 3 | 27 | r@r@r|r@ | 27 | 8 | 5 | |||||||
| 27 | 565 | 36 | r@r@r|r@ | 4 | ||||||||||||||||||
| 28 | 497 | 35 | 3 | 3 | 2 | 1 | 7 | r@r@r|r@ | 8 | 1 | 1 | |||||||||||
| 29 | 280 | 34 | 1 | r@r@r|r@ | ||||||||||||||||||
| 30 | 512 | 33 | 3 | 4 | 7 | 3 | 3 | 4 | 3 | 3 | 12 | r@r@r|r@ | 10 | 8 | 4 | |||||||
| 31 | 471 | 32 | 5 | 5 | 1 | 1 | 2 | 2 | 4 | 3 | 2 | r@r@r|r@ | 7 | 5 | 2 | |||||||
| 32 | 498 | 31 | 1 | 1 | 1 | 1 | 1 | r@r@r|r@ | 1 | 1 | ||||||||||||
| 33 | 115 | 29 | 3 | 10 | 9 | 9 | 5 | 9 | 6 | 5 | 11 | r@r@r|r@ | 10 | 7 | 5 | |||||||
| 34 | 158 | 29 | 2 | 2 | 2 | 2 | 3 | r@r@r|r@ | 1 | |||||||||||||
| 35 | 130 | 28 | 7 | r@r@r|r@ | ||||||||||||||||||
| 36 | 906 | 27 | r@r@r|r@ | |||||||||||||||||||
| 37 | 454 | 27 | r@r@r|r@ | 3 | ||||||||||||||||||
| 38 | 507 | 27 | 1 | r@r@r|r@ | ||||||||||||||||||
| 39 | 64 | 26 | 5 | 5 | 6 | 3 | 4 | 14 | 4 | 4 | 14 | r@r@r|r@ | 18 | 4 | 3 | |||||||
| 40 | 98 | 26 | 1 | 6 | 1 | 1 | 6 | 1 | r@r@r|r@ | |||||||||||||
| 41 | 75 | 26 | 1 | 1 | 11 | r@r@r|r@ | 1 | |||||||||||||||
| 42 | 333 | 26 | 6 | 4 | 2 | 14 | r@r@r|r@ | 14 | 1 | |||||||||||||
| 43 | 322 | 26 | 1 | 6 | 1 | 1 | 6 | r@r@r|r@ | ||||||||||||||
| 44 | 183 | 25 | 1 | 1 | r@r@r|r@ | 4 | ||||||||||||||||
| 45 | 464 | 25 | 1 | 1 | 16 | 1 | 16 | r@r@r|r@ | 18 | 2 | ||||||||||||
| 46 | 376 | 25 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | r@r@r|r@ | 1 | 1 | 1 | |||||||
| 47 | 162 | 25 | 1 | 9 | r@r@r|r@ | |||||||||||||||||
| 48 | 34 | 24 | 2 | 5 | 2 | 3 | 1 | 2 | 1 | 1 | 11 | r@r@r|r@ | 6 | 2 | 1 | |||||||
| 49 | 439 | 24 | 1 | 4 | r@r@r|r@ | 2 | 1 | 1 | ||||||||||||||
| 50 | 348 | 24 | 6 | 1 | r@r@r|r@ | 2 | ||||||||||||||||
| 51 | 226 | 23 | 1 | 8 | 2 | 2 | 1 | 1 | 1 | 1 | 11 | r@r@r|r@ | 11 | 1 | ||||||||
| 52 | 375 | 23 | 7 | r@r@r|r@ | 2 | |||||||||||||||||
| 53 | 857 | 23 | r@r@r|r@ | 1 | ||||||||||||||||||
| 54 | 457 | 23 | 3 | 3 | 3 | 3 | r@r@r|r@ | 3 | ||||||||||||||
| 55 | 121 | 22 | 2 | 2 | 2 | 2 | 2 | 2 | r@r@r|r@ | 2 | 2 | 2 | ||||||||||
| 56 | 346 | 22 | 4 | 4 | 4 | 4 | 3 | 4 | 4 | 4 | 4 | r@r@r|r@ | 4 | 4 | 3 | |||||||
| 57 | 218 | 22 | 3 | 2 | 2 | 1 | 4 | r@r@r|r@ | 2 | 2 | 1 | |||||||||||
| 58 | 61 | 21 | 6 | 5 | 5 | 5 | 5 | 5 | 5 | 4 | 6 | r@r@r|r@ | 6 | 5 | 4 | |||||||
| 59 | 204 | 21 | r@r@r|r@ | |||||||||||||||||||
| 60 | 71 | 21 | 2 | 2 | 2 | 2 | 1 | 3 | 2 | 2 | 6 | r@r@r|r@ | 4 | 2 | 2 | |||||||
| 61 | 291 | 20 | 2 | 8 | 7 | 1 | 6 | 13 | r@r@r|r@ | 14 | 4 | 3 | ||||||||||
| 62 | 1447 | 20 | 1 | r@r@r|r@ | 1 | |||||||||||||||||
| 63 | 181 | 20 | 6 | 12 | 15 | 4 | 2 | 5 | 5 | 1 | 17 | r@r@r|r@ | 15 | 4 | 4 | |||||||
| 64 | 388 | 20 | 1 | r@r@r|r@ | 1 | |||||||||||||||||
| 65 | 1462 | 20 | 5 | 1 | 1 | 3 | 18 | r@r@r|r@ | 15 | 1 | ||||||||||||
| 66 | 227 | 20 | 9 | r@r@r|r@ | ||||||||||||||||||
| 67 | 249 | 20 | 3 | 6 | 13 | 5 | 4 | 5 | 5 | 3 | 18 | r@r@r|r@ | 17 | 9 | 7 | |||||||
| 3l|Total MER | 355 | 511 | 591 | 318 | 244 | 520 | 283 | 173 | 1307 | 1031 | 512 | r@r@r|r@ | 296 | 1206 | 879 | |||||||
| 3l|#F Better PSW | 0 | 8 | 10 | 15 | 21 | 12 | 18 | 23 | 1 | 4 | 11 | r@r@r|r@ | 21 | 0 | 4 | |||||||
| 3l|#F Worse PSW | 0 | 29 | 30 | 22 | 12 | 22 | 12 | 5 | 55 | 47 | 26 | r@r@r|r@ | 14 | 58 | 47 | |||||||
| 3l|Improve (MER) | 0 | -156 | -236 | 37 | 111 | -165 | 72 | 182 | -952 | -676 | -157 | r@r@r|r@ | 59 | -851 | -524 | |||||||
| 3l|Improve (#F) | 0 | -21 | -20 | -7 | 9 | -10 | 6 | 18 | -54 | -43 | -15 | r@r@r|r@ | 7 | -58 | -43 |
To compare our results with other maximum discrimination methods, we repeated the experiments using HMMer and Meta-MEME (Table III). We built the HMMer models from the training sets used in the SAM experiments and scored the same discrimination sets with these models, using HMMer's hmmsw program. We used the default HMMer configuration. The impending rewrite of HMMer can be expected to have improved performance.
We also built HMMs using a combined MEME [Bailey & Elkan, 1994] and HMMer method called Meta-MEME [Grundy et al., 1997,Grundy, 1998]. In this procedure, MEME is used to locate common motifs in the training set, MAST finds the sequences in the training set with the strongest matches to these common motifs, and Meta-MEME builds an HMM in HMMer format based on the motifs and ``diagrams'' which describe the distribution of the motifs in the matching sequences. We again used the same training and discrimination sets. We ran the programs from the MEME suite using the settings described in the literature [Grundy, 1998]: a minimum motif width of 12, maximum motif width of 55, the ZOOPS motif model (zero or one motif occurrence per sequence), ten selected motifs from each training set, and use of only those motifs that appeared in more than half of the sequences in the training set to build the HMM.
The final rows of Table III include the minimum number of errors across the entire experiment and the number of families that have a better or worse MER than PSW. As can be seen, the HMM-based methods require 5 or 10 sequences to gain any improvement over the PSW results. For training sets of 1 or 2 sequences, PSW performs better than SAM (46%-66% higher MER), HMMer (190%-270% higher), or Meta-MEME (150%-240%), indicating further need for tuning the model building procedure for small data sets. For training sets of 5 sequences, unweighted SAM has a 10% better MER and weighted SAM has a 20% better MER compared to PSW. HMMer and Meta-MEME MERs are 44% and 14% worse than PSW. At 10 training sequences, all HMM methods do better than PSW, with net MER improvements over PSW of 17%, 28%, 31%, and 51%, corresponding to MERs of 296, 255, 244, and 173, for HMMer, Meta-MEME, unweighted SAM, and weighted SAM. Method 2 performed slightly worse than method 7 with, for example, a total MER of 187 (47% better than PSW) for a size 10 training set.
Based on these data, we conclude that SAM's new internal sequence weighting method is a success, significantly extending the discriminatory effectiveness of hidden Markov models.
Our experiments lead us to believe that we can consistently improve the discrimination performance of a SAM HMM by using maximum discrimination weighting functions during model building. We have added Methods 2 and 7 to our current upgrade of SAM (2.0).
Our improvement in discrimination is not always dramatic, but it is very cheap. There is no noticeable increase in the time it takes to build the model when internal weighting is used.
We believe that a careful choice of Dirichlet mixture priors and transition regularizers could further improve results for a particular protein family or training set size. However, experiments to determine the best choice are beyond the scope of this paper.
The position-specific and entropy-based ``external'' SAM methods often performed very well but were not as stable as our standalone methods. It is possible that internally implemented versions of these methods, with continuous updating of weights during model training, would be more consistent and discriminate better than our current internal methods.
None of the HMM-based methods performs as well as Probabilistic Smith Waterman with only one or two training sequences. Because HMMs are a generalization of the PSW method, there is no intrinsic reason why this should be the case. Instead, it indicates that our current library of Dirichlet mixture priors and transition regularizers is more finally tuned to training HMMs for groups of sequences. Further study of this problem could result in a range of different prior probabilities for different training set sizes and discrimination goals.
The comparison of single-sequence PSW with multiple-sequence HMMs is somewhat unfair. A more complete comparison would include a modified version of PSW that uses all sequences in the training set and, for example, using an iterative search technique [Pearson, 1997] or averaging the scores [Grundy, 1998]. As our primary goal was to evaluate the effectiveness of internal weighting methods in SAM, this work was considered beyond the scope of the paper.
We gratefully acknowledge Christian Barrett, David Haussler, and Kevin Karplus for their assistance and commentary on this work. This work was supported in part by NSF grant DBI-9408579 and its REU supplement, DOE grant DE-FG0395ER62112, and a grant from Digital Equipment Corporation.