Kevin Karplus, Christian Barrett, Richard Hughey
Department of Computer Engineering
Jack Baskin School of Engineering
University of California
Santa Cruz, CA 95064
PREPRINT to appear in Bioinformatics, 1999
A new hidden Markov model method (SAM-T98) for finding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (HMM) from the sequence and homologs found using the HMM for database search. SAM-T98 is also used to construct model libraries automatically from sequences in structural databases.
We evaluate the SAM-T98 method with four datasets. Three of the test sets are fold-recognition tests, where the correct answers are determined by structural similarity. The fourth uses a curated database. The method is compared against WU-BLASTP and against DOUBLE-BLAST, a two-step method similar to ISS, but using BLAST instead of FASTA.
SAM-T98 had the fewest errors in all tests--dramatically so for the fold-recognition tests. At the minimum-error point on the SCOP-domains test, SAM-T98 got 880 true positives and 68 false positives, DOUBLE-BLAST got 533 true positives with 71 false positives, and WU-BLASTP got 353 true positives with 24 false positives.
The method is optimized to recognize superfamilies, and would require parameter adjustment to be used to find family or fold relationships.
One key to the performance of the HMM method is a new score-normalization technique that compares the score to the score with a reversed model rather than to a uniform null model.
A World Wide Web server, as well as information on obtaining the Sequence Alignment and Modeling (SAM) software suite, can be found at http://www.cse.ucsc.edu/research/compbio/
The focus of this paper is to present a new hidden Markov model (HMM) method to detect remote homologies. The SAM-T98 method creates a hidden Markov model from a single target sequence by iteratively finding homologs in a protein database and refining the model. We compare our results to those using more established methods.
The results are presented in the context of four tests, three of which are fold-recognition tests. These three tests use a set of target sequences whose folds are to be determined, a fold database of sequences of known structure, and a definition of ``correct'' target-database sequence pairings. The fourth uses a curated database whose protein sequences were grouped according to family, primarily using sequence information. For all of the tests, we used only primary sequence information--the test was purely one of detecting remote homologs, not of protein structure prediction or threading.
For the fold-recognition tests, our HMM-based methods did extremely well at all levels of acceptable error, finding many more remote homologs than the more traditional sequence-based methods.
A companion paper [Park et al., 1998] compares SAM-T98 on the SCOP test sets with BLAST [Altshul et al., 1990] and FASTA [Pearson & Lipman, 1988] and with two state-of-the-art methods: PSI-BLAST [Altschul et al., 1997] and ISS [Park et al., 1997]. The results there show SAM-T98 to be superior to PSI-BLAST, which is superior to ISS, which is superior to BLAST and FASTA.
For this work we use a local alignment procedure that relates part of the sequence to one contiguous path through part of the HMM [Tarnas & Hughey, 1998]. If two sequences are aligned to the model, a multiple alignment between those sequences can be inferred from their alignments to the model, though it must be remembered that characters modeled by insert states are not aligned between sequences.
When an HMM is trained on sequences that are members of a protein family, the resulting HMM can identify the positions of amino acids which describe conserved primary structure of the family. This HMM can then be used to discriminate between family and non-family members in a search of a sequence database. A multiple alignment of sequences to the HMM will reveal the regions in the primary structure that are conserved and that are characteristic of the family.
The FSSP test set uses DALI structure comparison [Holm & Sander, 1993] to determine structural homology. A soft-threshold classification was made, in which DALI z-scores higher than 6 were considered to be homologs, z-scores lower than 2 were non-homologs, and z-scores between 2 and 6 were counted as partly homologous and partly non-homologous using a linear interpolation to get a homology score between 0 and 1. For the 174,134 non-self pairs, the sum of the homology scores was 3510.85 (so about 2% of the pairs represent homologies to be detected), though the best possible classifier still makes at least 1494.95 errors (Figure 2). At the minimum-error point for an optimal classifier, there are 2449.45 homolog pairs (1.4% of the possible pairs).
The whole-chain test set was composed of 571 single-domain proteins. Of the 162,735 pairings, only 931 (0.6%) are considered correct homologies. The domain test set contained the whole-chain test set, plus another 364 domains that were only parts of chains (935 sequences in total). Of the 436,645 possible non-self pairs, only 2605 were considered homologs (0.6%).
The higher rate of homology for the FSSP dataset may be an artifact of our selecting target sequences to cover the major subtrees--sequences with few true positives were less likely to be picked as targets.
Since the PIR families are generally of fairly close homologs, the Pearson test set is a test of close-homolog classification, not remote homolog classification.
http://blast.wustl.edu) and SAM suite version 2.0 [Hughey & Krogh, 1996], although the latter was not used with default parameters. For example, different transition regularizers were specified and noise and model surgery were not used. Additionally, local alignments, not global alignments, were used.
Two of the methods used here are based on the BLAST search program [Altshul et al., 1990], perhaps the most widely used bioinformatics tool today. This program is extremely fast and easy to use, so evaluating it is essential. Tools that fail to outperform BLAST are rarely worth their computational cost.
Instead of trying to find the homologs in the database directly from the target sequence, a two-step approach is used. First, a set of close homologs to the target sequence is found in a large database of sequences, then each homolog is used as a query to search the final database. The large database employed is the non-redundant protein database NRP [NRP, 1998]. WU-BLASTP is used both for finding the set of close homologs and for using each of these homologs to perform the second search. The first search is done with an E-value of 0.00005, and the second search with an E-value of 0.2. The score reported is the log of the maximum of the reported E-values for the hits. Each hit found in the first search is treated as a separate homolog, as attempts to combine the hits resulted in many more false positives. This was particularly evident for the SCOP whole-chain test set, since non-homologous domains may occur between two homologs in a database sequence.
When the database is small, the SAM-T98 method can also be used to create an HMM for each sequence in the database. This database of models can then be searched with the target sequence, providing a two-pronged approach to the search problem. Because SAM-T98 iteratively creates a model from a single sequence, hand-tuned seed alignments, such as those used for PFAM [Sonnhammer et al., 1997], are not needed, though the method could be applied to such seed alignments.
For the fold-recognition tests, we created HMMs for all of the sequences in the fold database (1050 for FSSP and 931 for SCOP, 1677 in all, taking the overlap into account). For the Pearson test, since we were unwilling to build an HMM for each of the 12,216 sequences in the database, we used SAM-T98 to build HMMs only for the 67 target sequences, and scored with just the target HMMs. Based on the results for the other test sets, using only target HMMs reduces performance only slightly (see Summing Scores, below).
Since building HMMs from weighted multiple alignments is a critical aspect of the method, we specifically discuss sequence weighting next, followed by the SAM-T98 method itself and a discussion on how the HMMs were used to score sequences in the test sets.
The relative weights are set with the Henikoffs' position-based sequence weights [Henikoff & Henikoff, 1994], but the absolute weight is set to get a specific level of entropy averaged over all columns after a Dirichlet mixture regularizer [Sjölander et al., 1996] is applied to the weighted counts. The entropy is specified by the number of bits saved relative to the entropy of the background distribution. This relative entropy measure has been used previously to characterize substitution matrices [Altschul, 1991], and the popular BLOSUM50 and BLOSUM62 matrices corresponds to saving about 0.5 and 0.7 bits per column. The savings for our method varies from 2.5 bits for alignments with only 20 match columns down to about 0.36 bits per column for alignments with over 600 match columns. More precisely, the savings requested for an n-column alignment is , where n is the length of the alignment.
The large savings requested for short alignments is generally not available with any weights, and the relatively poor performance of the SAM-T98 method on short peptides, noticeable when analyzing the top false positives for the SCOP domain test set, may be due to this weighting problem.
The SAM-T98 method then uses 4 iterations of a selection, training, and alignment procedure. For each iteration it needs an initial alignment, a set of sequences to search, a threshold value, and a transition regularizer. From the alignment and regularizer, an HMM is constructed and used to score the set of sequences. All sequences that score better than the threshold value are used to estimate a new HMM. Alignment of the training sequences to the HMM produces the alignment that is the input for the next iteration.
On the first iteration the single sequence passed to the method is used as the initial (trivial) alignment and the close homologs found by WU-BLASTP are used as the search set. The threshold is set strictly (-40 nats), so only strong matches to the sequence are considered. The transition regularizer approximates the gap costs used by WU-BLASTP. Requiring both WU-BLASTP and the initial HMM to score a sequence well ensures that only close homologs are included at this stage of the process.
On subsequent iterations the input alignment is the output from the previous iteration and the search set is the larger set of possible homologs found by WU-BLASTP. The thresholds are gradually loosened (-30 nats, -24 nats, and -16 nats).
For the second and third iteration, we use a regularizer that encourages long sequences of match states, and for the final iteration a transition regularizer trained on FSSP structural alignments is used.
The above selection, training, and alignment procedures consists of several calls to SAM programs. Models are created with SAM's modelfromalign program which uses the alignment, sequence weighting, transition regularizer, and Dirichlet mixture to build an HMM. Scoring the sequence set with an HMM uses SAM's multiple domain scoring procedure, now part of hmmscore, which selects only the portion of a sequence matching the HMM (local scoring [Smith & Waterman, 1981] as applied to SAM models [Tarnas & Hughey, 1998]). From the sequences selected using this procedure, a new model is estimated using SAM's buildmodel HMM training program. The alignment of the training sequences back to the resulting HMM is accomplished with SAM's align2model program. To ensure that the initial sequence to the whole process is not lost, it is added to the training set at this point, and any duplicate sequences in the training set are eliminated.
Since this process is involved and requires substantial computing time, it is only done once for any sequence and the final alignment is kept as an entry in a library. An HMM can be quickly constructed for the stored alignment using modelfromalign and sequence weighting.
The quality of the HMM resulting from this method is critically dependent on the sequences selected for training, and this sequence selection depends on the scoring implementation. During the method's development, we found that many protein families' multiple alignments show columns of strict conservation of what are usually the more rarely seen residues (cysteine, for example). When scoring databases with an HMM built for these families, sequences that are compositionally biased toward these residues tend to receive inflated scores and become false positives.
Before this observation, scoring involved comparing the log-probability of a sequence for an HMM with its log-probability for a null model [Barrett et al., 1997]. To address this problem, we looked at the difference of the log-probability of the sequence and the log-probability of the sequence with a reversed HMM (equivalently, the score of the reversed sequence with the HMM). Since the reversed sequence has the same length and composition as the sequence, these two sources of error are effectively eliminated. Figure 3 shows the effectiveness of this reversed model scoring on the SCOP whole chain test set. For the remainder of this experiment, we used reversed model scoring when scoring an HMM against the test sets.
To gauge the effectiveness of score summing, in Figure 4 we plot the false positives as a function of true positives using both the SCOP whole-chain and FSSP test sets. As can be seen for the former, the added computational burden of building an HMM for all of the test set sequences so that score summing can be performed is not always justified. This changes when one considers the FSSP test set, as summing provides definite improvement beyond the 100 false positives level. This difference may be attributable to the fact that the FSSP test set contains sequences with no more than 25% sequence homology (as opposed to the SCOP whole-chain's 40%), and the summing is necessary to strengthen the weak scores between a truly homologous pair.
Another possible explanation is that the SCOP test consisted solely of single domains, while the FSSP test had to match domains from multiple-domain proteins. When the target and template have very different lengths, the scoring may well work better in one direction than the other.
For the structure-based fold-recognition tests, we performed both directions of scoring and summed the scores.
To evaluate the performance of the search methods for each test set, all pairs of target sequence and database sequence were sorted from best score to worst score. By sweeping through this sorted list, we compare the methods in three fashions. First, to make comparisons based on one number, in Table I we compare the number of errors at each method's minimum error point. Next, in Figures 5-8, discussed below, we plot the number of non-homolog pairs found versus the number of homolog pairs found (the false positives as a function of true positives). Since the number of false positives grows roughly exponentially with the number of true positives, setting an optimal threshold is difficult from the false-positive-versus-true-positive plot. Thus, we also plot the total number of errors as a function of true positives to provide a more detailed look at the tradeoff between precision (minimizing false positives) and recall (minimizing false negatives).
For the whole-chain SCOP dataset, Figure 6 shows that the HMM-based methods perform best for all levels of false positives. If no false positives are allowed, WU-BLASTP gets 148 true positives, DOUBLE-BLAST gets 233, and SAM-T98 gets 256. The minimum-error points are even more dramatically separated with 740 for WU-BLASTP, 665 for DOUBLE-BLAST, and only 555 errors for SAM-T98 (see Table I).
The closeness of the members of the families can be seen in the excellent performance of WU-BLASTP on this dataset. With no false positives, WU-BLASTP gets 547 true positives, DOUBLE-BLAST gets 603 true positives, and SAM-T98 gets only 350. At 200 false positives (near WU-BLASTP's minimum-error point), WU-BLASTP gets 2952 true positives, DOUBLE-BLAST gets 2760, and SAM-T98 gets 2584. At 400 false positives (near SAM-T98's minimum-error point), WU-BLASTP gets 3121 true positives, DOUBLE-BLAST gets 3099, and SAM-T98 gets 3287. Figure 8 shows this tradeoff in performance clearly. The SAM-T98 method was optimized for finding superfamilies, not families, and so it merges similar families together.
Note that we use a single threshold for each method for all of the targets in a test set, not a separate threshold for each target as done previously for the Pearson test set [Agarwal & States, 1998,Karchin & Hughey, 1998]. Using separate thresholds would provide much more impressive numbers, but the single-threshold is a more valuable test. We are not testing how well a particular library of models can be tuned, but how well a set of homologs can be found for a protein of unknown character. If we do not already know the classification, we cannot choose a classification-specific threshold, hence the insistence on a single threshold. If we had used an optimal threshold for each family, the SAM-T98 minimum error point would have dropped from 584 to 285 errors. At this point, there were 3274 true positives and 148 false positives.
The SCOP database is a hierarchical classification of protein domain structures, with classification into class, fold, superfamily, family, and subfamily. We chose to consider pairs that were in the same superfamily to be correct matches, but we could have chosen any level of the hierarchy as our definition of correctness. Figure 9 shows how choosing different levels would affect our results for the SAM-T98 method. The almost identical false-positive rates for folds and superfamilies at low error rates means that overall error rate is much lower for superfamilies than for folds since there are many more false negatives at the fold level.
The SAM-T98 method also seems to do well on families in Figure 9, but a closer look at the calibration curve in Figure 10 shows that the homologs included in the SAM-T98 alignments are distant enough to contaminate the method as a family or subfamily recognizer (as was seen with the Pearson test set). We would have to use stricter thresholds in building the alignments to create a family, rather than a superfamily, recognizer.
The calibration curve in Figure 10 can be applied to a search with one target in a database of N sequences, the number of false positives from the curve should be multiplied by N/436,645 to get the expected number of false positives.
If one ignores the ``fat tail'' (the excessive number of false positives for strong scores), the number of false positives can be reasonably approximated by .The fat tail probably results from two sources of error: small shared motifs (such as ampipathic helices) that are not long enough to justify classifying the proteins in the same superfamily and contamination of the SAM-T98 alignment by non-homologous sequences.
optimum, true +
optimum, false +
WU-BLASTP, true +
WU-BLASTP, false +
double-blast, true +
double-blast, false +
target-SAM-T98, true +
target-SAM-T98, false +
SAM-T98, true +
SAM-T98, false +
SAM-T98 introduced reversed-model score adjustment. Not only does this scoring method correct for length and composition biases, but some other, subtler effects are also cancelled--for example the periodic hydrophobicity patterns of amphipathic helices or beta strands also appear in the reversed sequence, as does the lower frequency surface-core hydrophobicity pattern. Because of these subtle effects, the reversed sequence is a much more realistic decoy than a scrambled sequence.
These effects can affect scoring signifcantly. For example, in the scoring for the CASP-2 contest [Karplus et al., 1997], we had to eliminate by hand some coiled-coil models that scored any helical protein well--the reversed-model scoring eliminates these problems. Also, metallothionein (4mt2), with 24 cysteines out of 61 residues, can align well to almost any sequence with conserved cysteines. Since many HMMs get a large part of their score from aligning highly conserved cysteines, 4mt2 often appeared as a false postive for these HMMs, but since the reversal of 4mt2 has the same number of cysteines with the same distribution of spacings, it also scores well for these HMMs and the difference between the model and reversed-model scores is near zero.
The SAM-T98 method also introduced score summing. We performed score summing for CASP-2 also--what is new here is the systematic evaluation of this approach. Summing entails the added computational expense of building additional HMMs that is not always clearly justified. For the SCOP whole-chain and domain test sets, summing of scores provided neglible gain in performance. This was not the case for the FSSP test set, for which summing provided a marked improvement for homologs more remote than the minimum error point.
If score summing is to be used, then models must be built for both the target sequences and the database sequences. If not, then only database HMMs or target sequence HMMs are built. Which HMMs to build depends on the number of sequences to classify and the number of families to classify into. If only a small number of sequences are to be identified, then it is probably better to build an HMM for each. If many are to be classified, then building a library of models is better.
The SAM-T98 method uses reversed-model scoring and score thresholds for selecting training sequences in its iterative procedure. We found that a predecessor method (SAM-T97) that did not include the reversed-model scoring and used more liberal threshold values garnered more remote homologs at the expense of including more spurious sequences in the alignments. This led to the creation of HMMs of a slightly different character; they were often more adept at finding the remoter homologies but not as able at filtering out false positives. This is illustrated in Figure 11. While we believe that reversed-model scoring should be retained, we will be investigating the proper threshold settings to find the best balance for building both sensitive and accurate HMMs.
The SAM-T98 method and its use as exemplified in this work is
available on the World Wide Web from
One may search
large sequence databases for homologs using a single query sequence.
This makes use of the SAM-T98 method to build an HMM and search a
database. Since it is not feasible to build HMMs for all database
sequences, database scoring does not sum any scores. The second
option allows one to search our model library with a sequence. This is
similar to the first option, except that the database is composed of
selected sequences from PDB. Since we have constructed HMMs for
each of these sequences, score summing is used. Other options allow
access to separate components of the SAM-T98 method. They allow one
to build an alignment from a query sequence, generate sequence weights
from an alignment, or build an HMM from a single query sequence or
an alignment with weights.
Future work is needed in several directions: evaluating other fold-recognition methods, tuning the parameters (such as thresholds and number of iterations) of SAM-T98, and evaluating the quality of alignments produced as a by-product of the fold recognition. Other fold-recognition techniques that need to be evaluated include other sequence-based methods for finding relationships, such as MetaMEME [Grundy et al., 1997] and SearchWise [Birney et al., 1996], structure-structure comparison techniques, and methods such as threading that use structure information for the template sequence, but not the target sequence. Some of the more popular sequence-based methods, including PSI-BLAST [Altschul et al., 1997] and ISS [Park et al., 1997], have already been tested on the SCOP dataset [Park et al., 1998]. One attempt at testing structure-structure aligners has been made [Gerstein & Levitt, 1998], but that experiment looked only at pairs known to be in the same superfamily, so no false-positive rate can be determined.
We thank Nguyet Manh, who ran many of the early tests of the SAM-T97 method on the FSSP test set and did much of the work making SAM-T98 available on the web site, and Cyrus Chothia, who provided us with the selected sequences that make up the SCOP test sets. Special thanks to Philipp Bucher and Kay Hofmann, whose use of reversed sequence databases for normalizing HMMs inspired our somewhat different use of reversed model scoring.
This work was supported in part by NSF grant DBI-9408579, DOE grant DE-FG0395ER62112, and a grant from Digital Equipment Corporation.