Hidden Markov Models and Protein Sequence Analysis

by Rachel Karchin

Table of Contents


Protein Structure

Statistical Profiles

Hidden Markov Models

Scoring a Sequence with an HMM

What the Score Means

Local and Global Scoring

Building an HMM

Sequence Weighting

Overfitting and Regularization

Uses of HMMs

Classifying Sequences with an HMM

Creating multiple alignments


Appendix A



This paper provides an introduction to a sophisticated and flexible statistical model called the hidden Markov model (HMM) and its use as a tool in the study of protein molecules. Due to the breadth and complexity of the subject, my goal is to provide an overview of HMM-related protein research. The topics covered do not represent a comprehensive study of the field, and I have targeted my explanation of the technical material at an undergraduate computer science major who knows something about chemistry and electric circuits, but is unfamiliar with HMMs. At the risk of presenting an oversimplified version of the material, I have avoided mathematical equations as much as possible and tried to explain algorithms in plain English.

Using HMMs to analyze proteins is part of a new scientific field called bioinformatics, based on the relationship between computer science, statistics and molecular biology. Because of large government-sponsored projects like the Human Genome Project in the United States, there has been an exponential increase in the quantity of data available about proteins, DNA, and RNA. Traditional lab methods of studying the structure and function of these molecules are no longer able to keep up with the rate of new information. As a result, molecular biologists have turned to statistical methods capable of analyzing large amounts of data, and to computer programs which implement these methods.

Protein Structure

Proteins are large, organic molecules and are among the most important components in the cells of living organisms. They are more diverse in structure and function than any other kind of molecule. Enzymes, antibodies, hormones, transport molecules, hair, skin, muscle, tendons, cartilage, claws, nails, horns, hooves, and feathers are all made of proteins [1].

These remarkable molecules are built from an alphabet of twenty smaller molecules known as amino acids (see Appendix A). All twenty amino acids have a common chemical core of two functional groups, a carboxylic acid group and an amino group. A central carbon atom connects the two groups. In nature, this atom is always bonded to both a hydrogen atom and another functional group called a side chain, which is different for each of the twenty amino acids [1].

Amino acids within a protein are bonded by a condensation reaction in which the carboxylic acid group (COOH) of one amino acid is linked to the amino group (NH2) of a second, releasing one molecule of water. This combination of two amino acids is known as a peptide linkage. As shown in Figure 1, sequential condensations produce a chain of bonded amino acids known as a polypeptide chain [1].

Figure 1. Condensation reactions producing a polypeptide chain

The linear sequence of amino acids in a polypeptide chain is known as the primary structure of a protein. The enormous diversity of proteins is due to the many ways in which amino acids can combine in these chains. A set of n amino acids can form 20n polypeptides, so 20100 combinations are possible for a protein built from 100 amino acids. This number is larger than the number of known atoms in the universe [1].

Within long polypeptide chains, certain sections twist into coils and fold into sheets. These shapes are known as the secondary structure of a protein. Secondary structure is formed by hydrogen bonds between the carboxylic acid group and the amino group of two amino acids which are not adjacent in the polypeptide chain. If the two amino acids are part of a single chain, a twisted-helix shape is formed. When a protein is built from multiple chains of polypeptides, multiple hydrogen bonds can form across many chains. This creates pleated sheets. Examples of these sheets can be seen in the lower half of Figure 2. The tightly coiled structures in the upper half of Figure 2 are twisted helices, known as alpha helices.

Figure 2. Secondary structure of metalloprotease/inhibitor protein

Although the linear string of amino acids is a simplified view of a molecule's structure, relationships between proteins can frequently be seen by comparing only their primary structures. Figure 3 shows the primary structures of four related proteins. The amino acid molecules are represented by single letters. Only a small piece of each protein is shown. They are laid out in a multiple alignment, an ordering which highlights areas in which they are extremely similar.





Figure 3. Primary structure of four related proteins

By taking a close look at the alignment, we can see the history of evolution in this protein family. Possibly, the ancestor of the four proteins above looked like the protein in Figure 4.

Figure 4. A possible common ancestor

To understand why, consider what happens to a protein inside a cell when the cell reproduces. Through a process called mitosis, the cell makes a copy of itself and then splits into two daughter cells. Most of the time, a protein in the parent cell is exactly duplicated in the daughter cell. However, over long periods of time, errors occur in this copying process. When this happens, a protein in the daughter cell is slightly different from its counterpart in the parent. The three most common errors are substitution of an amino acid in a given position, insertion of one or more new amino acids, and deletion of one or more amino acids.

Figure 5. Protein from ancestor cell and a possible descendant

Figure 5 shows a possible worst case scenario after two kinds of errors have occurred. The daughter cell's protein is on the bottom row. A substitution error has occurred in position 13, where the parent has the amino acid F and the daughter has the amino acid V. The daughter also has six insertion errors (REDSSK) in positions 7 through 12.

Statistical Profiles

As a result of these errors, proteins which share a common ancestor are not exactly alike. However, they inherit many similarities in primary structure from their ancestor. This is known as conservation of primary structure in a protein family. An example is seen in Figure 3, where most of the positions (columns) in the multiple alignment contain only one or two distinct amino acids.

These structural similarities make it possible to create a statistical model of a protein family. The model shown in Figure 6 is a simplified statistical profile, a model which shows the amino acid probability distribution for each position in the family [2]. According to this profile, the probability of C in position 1 is 0.8, the probability of G in position 2 is 0.4, and so forth. The probabilities are calculated from the observed frequencies of amino acids in the family.

Figure 6. A statistical model of ten related proteins

Given a profile, the probability of a sequence is the product of the amino acid probabilities given by the profile. For example, the probability of CGGSV, given the profile in Figure 6, is

0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031.

Given a statistical model, the probability of a sequence is used to calculate a score for the sequence. Because multiplication of fractions is computationally expensive and prone to floating point errors such as underflow, a convenient transformation into the logarithmic world is used. The score of a sequence is calculated by taking the logs of all amino acid probabilities and adding them up. Using this method with base e logarithms, the score of CGGSV is

loge(0.8)+loge(0.4)+loge(0.8)+loge(0.6)+loge(0.2) = -3.48

In practice, profile models take other factors into account. For example, members of a protein family have varying lengths, so a score penalty is charged for insertions and deletions. The scores of individual amino acids in a profile are also position specific. In other words, more weight must be given to an unlikely amino acid which appears in a structurally important position in the protein than to one which appears in a structurally unimportant position.

Although these refinements are necessary to create good profile models, they introduce many additional free parameters which must be calculated when building a profile, and unfortunately, the calculations must be done by trial and error. These shortcomings set the stage for a new kind of profile, based on the Hidden Markov model.

Hidden Markov Models

Hidden Markov models (HMMs) offer a more systematic approach to estimating model parameters. The HMM is a dynamic kind of statistical profile. Like an ordinary profile, it is built by analyzing the distribution of amino acids in a training set of related proteins. However, an HMM has a more complex topology than a profile. It can be visualized as a finite state machine, familiar to students of computer science.

Finite state machines typically move through a series of states and produce some kind of output either when the machine has reached a particular state or when it is moving from state to state. The HMM generates a protein sequence by emitting amino acids as it progresses through a series of states. Each state has a table of amino acid emission probabilities similar to those described in a profile model. There are also transition probabilities for moving from state to state.

Figure 7. A possible hidden Markov model for the protein ACCY. The protein is represented as a sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs in a particular state, and the numbers next to the directed arcs show probabilities which connect the states. The probability of ACCY is shown as a highlighted path through the model.

Figure 7 shows one topology for a hidden Markov model. Although other topologies are used, the one shown is very popular in protein sequence analysis. Note that there are three kinds of states represented by three different shapes. The squares are called match states, and the amino acids emitted from them form the conserved primary structure of a protein. These amino acids are the same as those in the common ancestor or, if not, are the result of substitutions. The diamond shapes are insert states and emit amino acids which result from insertions. The circles are special, silent states known as delete states and model deletions.

Transitions from state to state progress from left to right through the model, with the exception of the self-loops on the diamond insertion states. The self-loops allow deletions of any length to fit the model, regardless of the length of other sequences in the family.

Scoring a Sequence with an HMM

Any sequence can be represented by a path through the model. The probability of any sequence, given the model, is computed by multiplying the emission and transition probabilities along the path.

In Figure 7, a path through the model represented by ACCY is highlighted. In the interest of saving space, the full tables of emission probabilities are not shown. Only the probability of the emitted amino acid is given. For example, the probability of A being emitted in position 1 is 0.3, and the probability of C being emitted in position 2 is 0.6. The probability of ACCY along this path is

.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.

As in the profile case described above, the calculation is simplified by transforming probabilities to logs so that addition can replace multiplication. The resulting number is the raw score of a sequence, given the HMM.

For example, the score of ACCY along the path shown in Figure 7 is

loge(.4) + loge(.3) + loge(.46) + loge(.6) + loge(.97) + loge(.5) + loge(.015) + loge(.73) +loge(.01) + loge(1) = -13.25

The calculation is easy if the exact state path is known, as in the toy example of Figure 7. In a real model, many different state paths through a model can generate the same sequence. Therefore, the correct probability of a sequence is the sum of probabilities over all of the possible state paths. Unfortunately, a brute force calculation of this problem is computationally unfeasible, except in the case of very short sequences. Two good alternatives are to calculate the sum over all paths inductively using the forward algorithm, or to calculate the most probable path through the model using the Viterbi algorithm. Both algorithms are described below.

Figure 8. HMM with multiple paths through the model for ACCY. The highlighted path is only one of several possibilities.

Consider the HMM shown in Figure 8. The Insert, Match, and Delete states are labeled with their position number in the model, M1, D1 etc. (States I1 and I2 are unlabelled to reduce clutter.) Because the number of insertion states is greater than the number of match or delete states, there is an extra insertion state at the beginning of the model, labeled I0. Unlike the HMM in Figure 7, where the state path for ACCY was known, several state paths through the model are possible for this sequence.

The most likely path through the model is computed with the Viterbi algorithm. The algorithm employs a matrix, shown in Figure 9. The columns of the matrix are indexed by the states in the model, and the rows are indexed by the sequence. Deletion states are not shown, since, by definition, they have a zero probability of emitting an amino acid. The elements of the matrix are initialized to zero and then computed with these steps:

1. The probability that the amino acid A was generated by state I0 is computed and entered as the first element of the matrix.

2. The probabilities that C is emitted in state M1 (multiplied by the probability of the most likely transition to state M1 from state I0) and in state I1 (multiplied by the most likely transition to state I1 from state I0) are entered into the matrix element indexed by C and I1/M1.

3. The maximum probability, max(I1, M1), is calculated.

4. A pointer is set from the winner back to state I0.

5. Steps 2-4 are repeated until the matrix is filled.

Prob(A in state I0) = 0.4*0.3=0.12

Prob(C in state I1) = 0.05*0.06*0.5 = .015

Prob(C in state M1) = 0.46*0.01 = 0.005

Prob(C in state M2) = 0.46*0.5 = 0.23

Prob(Y in state I3) = 0.015*0.73*0.01 = .0001

Prob(Y in state M3) = 0.97*0.23 = 0.22

The most likely path through the model can now be found by following the back-pointers.

Figure 9. Matrix for the Viterbi algorithm

Once the most probable path through the model is known, the probability of a sequence given the model can be computed by multiplying all probabilities along the path.

The forward algorithm is similar to Viterbi. However in step 3, a sum rather than a maximum is computed, and no back pointers are necessary. The probability of the sequence is found by summing the probabilities in the last column. The resulting matrix is shown in Figure 10.

Prob(A in state I0) = 0.4*0.3=0.12

Prob(C in state I1) = 0.05*0.06*0.5 = 0.015

Prob(C in state M1) = 0.46*0.01= 0.005

Prob(C in state M2) = (0.005*0.97) +(0.015*0.46)= .012

Prob(Y in state I3) = .012*0.015*0.73*0.01 = 1.31x10-7

Prob(Y in state M3) = .012*0.97*0.2 = 0.002

Figure 10. Matrix for the Forward algorithm

What the Score Means

Once the probability of a sequence has been determined, its score can be computed. Because the model is a generalization of how amino acids are distributed in a related group (or class) of sequences, a score measures the probability that a sequence belongs to the class. A high score implies that the sequence of interest is probably a member of the class, and a low score implies it is probably not a member.

Local and Global Scoring

In the examples above, global scoring was used. This means simply that computation of the score begins at the first amino acid in the sequence and ends at the last one. Even though this may seem like the most natural way to compute a score, the results are often misleading. Because of the evolutionary variety found in related protein sequences, a family member may be composed of both highly conserved areas which score well against the model and divergent areas which score poorly. If both kinds of areas are given equal importance, the overall score of family member may be poor.

The solution to this problem is to use local scoring, where the score of a sequence is set to the score of its highest scoring subsequence. The principle can be illustrated with a very simple example. Consider again the sequence ACCY shown in Figure 5. Converting probabilities to the log world, the global score for the sequence is the sum of all four scores: -13.25. The computation is shown in Figure 11.

Figure 11. The log score of ACCY

Clearly, the score has been significantly lowered by A and Y. The score is low enough ACCY that may not appear to be a member of the family being modeled.

Figure 12. The family of sequence ACCY

Let's assume ACCY is a member of the family shown in Figure 12. In this case, the global score proves to be a poor measure of family membership. However, if local scoring is used to evaluate the sequence, the final score is much higher. The highest scoring subsequence is found to be CC, with a score of -2.01. Unlike the global score, the local score is high enough to classify ACCY in this family. In situations like this, classifications based on local scoring are more accurate than those based on global scoring.

Building an HMM

Another tricky problem is how to create an HMM in the first place, given a particular set of related training sequences. It is necessary to estimate the amino acid emission distributions in each state and all state-to-state transition probabilities from a set of related training sequences.

If the state paths for all the training sequences are known, the emission and transition probabilities in the model can be calculated by computing their expected value: observing the number of times each transmission or emission occurs in the training set and dividing by the sum of all the transmission probabilities or all the emission probabilities.

If the state paths are unknown, finding the best model given the training set is an optimization problem which has no closed form solution. It must be solved by iterative methods.

The algorithms used to do this are closely related to the scoring algorithms described previously. The goal is to find model parameters which maximize the probability of all sequences in the training set. In other words, the desired model is a model against which all the sequences in the training set will have the best possible scores. The parameters are re-estimated after every iteration by computing a score for each training sequence against the previous set of model parameters.

The Baum-Welch algorithm is a variation of the forward algorithm described earlier. It begins with a reasonable guess for an initial model and then calculates a score for each sequence in the training set over all possible paths through this model . During the next iteration, a new set of expected emission and transition probabilities is calculated, as described above for the case when state paths are known. The updated parameters replace those in the initial model, and the training sequences are scored against the new model. The process is repeated until model convergence, meaning there is very little change in parameters between iterations.

The Viterbi algorithm is less computationally expensive than Baum-Welch. As described in the previous section, it computes the sequence scores over the most likely path rather than over the sum of all paths.

However, there is no guarantee that a model built with either algorithm has parameters which maximize the probability of the training set. As in many iterative methods, convergence indicates only that a local maximum has been found. Several heuristic methods have been developed to deal with this problem. One approach is to start with several initial models and proceed to build several models in parallel. When the models converge at several different local optimums, the probability of each model given the training set is computed, and the model with the highest probability wins. Another approach is to add noise, or random data, into the mix at each iteration of the model building process. Typically, an annealing schedule is used. The schedule controls the amount of noise added during each iteration. Less and less noise is added as iterations proceed. The decrease is either linear or exponential. The effect is to delay the convergence of the model. When the model finally does converge, it is more likely to have found a good approximation to the global maximum [3].

Sequence Weighting

Another problem with HMMs is that if there is a small group of sequences in the training set which are highly similar, the model will overspecialize to the small group. To prevent this, several methods of sequence weighting have been developed. These methods give the outlier sequences, those that do not belong to the highly similar group, additional importance in the calculation of model parameters.

The simplest weighting methods are based on tree structures. Sequences in a family are placed on branches of a tree, representing their divergence from a common ancestor. One interesting approach is to visualize a tree made of conducting wire with a voltage V applied to the root. The leaves are set to V=0, and the currents flowing through each branch are calculated using Kirchoff's laws. The currents are then used as the sequence weights. Intuitively, the currents will be smaller in a more highly divided area of the tree (the highly similar sub-group), and larger in less divided areas of the tree (the outliers) [4].

Figure 13. Calculating tree-based weights

Figure 13 demonstrates this approach. According to Kirchoff's laws, the following equations can be derived:

I0=I1+I2 I1=I2

I2=I3+I4 I3=I4

I3=I5+I6 I5=I6

I4=I7+I8 I7=I8

With some algebraic manipulation, it can be shown that:

I1=I2= .5 * I0

I3=I4= .25*I1

I5=I6=I7=I8= .125 * I1

The corresponding sequence weights are shown in Figure 14.

Figure 14. Tree-based sequence weights

Another approach is known as maximum discrimination weighting [5]. In this method, weights are estimated iteratively while the model is being built. After each iteration in the model building process, the sequences in the training set are scored against the current model. Weights are assigned to each sequence, with poorly scoring sequences (outliers) receiving the highest valued weights. During the next iteration, these sequences get more importance than sequences which had good scores during the prior round. The process repeats until the model converges.

A third kind of weighting method is based on trying to make the statistical spread in the model as uniform as possible [6]. The position-specific weighting method developed by Henikoff & Henikoff falls into this category [7] . In this method, weights are based on the diversity observed in the columns of a multiple alignment of sequences.

A weight is computed for each position in a sequence inversely proportional to m, the number of different amino acids in the column and k, the number of times the amino acid of interest appears in the column.

For example, looking at the first sequence in the multiple alignment shown in Figure 3, the weight of amino acid C in position 1 is


and the weight of amino acid N in position 7 is


The weight of a sequence is the average of the weights in all positions, normalized to sum to 1.

It is not possible to identify a single, best weighting method. Choice of a weighting method depends both on what the resulting model will be used for and the particular protein group being modeled.

Overfitting and Regularization

Clearly, accurate methods of estimating amino acid distributions are necessary to build good HMMs. A potential pitfall is illustrated by the example shown in Figure 3. Consider these four sequences to be members of a training set. The first column shown contains one distinct amino acid: C. Using the methods described so far, the probability of any of the other 19 amino acids appearing in this position is 0. However, if we set all these probabilities to 0 in the model, we end up with a model unable to recognize family members which do not begin with a C. Remember that members of the training set are only a small percentage of real members of the protein family. At least some of the other members are likely to begin with different amino acids.

This problem is known as overfitting. A variety of approaches known as regularization have been developed to deal with it. The simplest is to use pseudocounts: this means that, even if a given amino acid does not appear in a column of an aligned training set, it is given a fake count. Fake counts are also added for the amino acids which appear in the column. When probabilities are calculated, the fake counts are treated exactly like real observed counts. Take a look at the first column of Figure 3, which has 4 counts of the amino acid C. If a pseudocount of 1 is used for each of the 20 amino acids, in effect, the column contains 24 entries, of which 4 are real and 20 are fake. The probability of A, which does not appear in the column, is calculated as shown in Figure 15.

Figure 15. The probability of A in column 1 of Figure 3 using pseudocounts

The probability of C in position 1 is shown in Figure 16.

Figure 16. The probability of C in column 1 of Figure 3 using pseudocounts

A sophisticated application of this method is known as Dirichlet mixtures [8]. The mixtures are created by statistical analysis of the distribution of amino acids at particular positions in a large number of proteins. The mixtures are built from smaller components known as Dirichlet densities.

A Dirichlet density is a probability density over all possible combinations of amino acids appearing in a given position. It gives high probability to certain distributions and low probability to others. For example, a particular Dirichlet density may give high probability to conserved distributions where a single amino acid predominates over all others. Another possibility is a density where high probability is given to amino acids with a common identifying feature, such as the subgroup of hydrophobic amino acids.

When an HMM is built using a Dirichlet mixture, a wealth of information about protein structure is factored into the parameter estimation process. The pseudocounts for each amino acid are calculated from a weighted sum of Dirichlet densities and added to the observed amino acid counts from the training set. The parameters of the model are calculated as described above for simple pseudocounts.

Uses of HMMs

Classifying Sequences with an HMM

One common application of HMMs is classifying sequences in a database. The method is to build an HMM with a training set of known members of the class, and then to compute the score of all sequences in the database. The resulting scores are ranked and a suitable threshold is selected to separate class members from the other sequences in the database.

Unfortunately, because these raw scores are length-dependent, they must be modified before the ranking can occur. Various techniques have been developed to work around this shortcoming [9]. One approach is to calculate the difference in standard deviations between the raw score of a sequence and the mean score of sequences of similar length in the database. A more sophisticated method scores a sequence by a log-odds ratio, shown in Figure 17.

Figure 17. The log-odds ratio

This number is the log of the ratio between two probabilities the probability that the sequence was generated by the HMM and the probability that the sequence was generated by a null model, whose parameters reflect the general amino acid distribution in the training sequences. Barrett, Hughey, and Karplus have shown that a good choice of null model is the geometric mean of amino acids in the match states [9].

Another important issue is the selection of a suitable threshold separating family and non-family members. One good method uses Milosavljevic's algorithmic significance test [10], setting the threshold to


where N is the number of sequences in the database being searched, and z is the logarithm base, usually set to the values 2 or e. The variable sigma is a significance level which varies from .01 to 10. Choice of a large sigma increases the chance of false positives. These are non-family members which exceed the threshold because they score well against the HMM out of pure chance. A small sigma sets the threshold at a level where fewer sequences will be classified as family members. If sigma is too small, the chance of false negatives is increased. In this case, sequences which are members of the family may not exceed the threshold and may be misclassified.

Creating multiple alignments

Another important use of HMMs is to automatically create a multiple alignment from a group of unaligned sequences. Multiple alignment is the process of taking a group of sequences and identifying amino acids which are homologous, structurally or functionally similar. The homologous amino acids are aligned in columns (see Figure 3). Until recently, this tedious task was done by hand and required a biology expert with extensive knowledge of protein structure and evolution.

A multiple alignment can be generated by using the Viterbi algorithm to find the most likely path through the HMM for each sequence. Each match state in the HMM corresponds to a column in the multiple alignment. A delete state is represented by a dash. Amino acids from insert states are either not shown or are displayed in lower case letters.

An alternative to Viterbi is a method known as posterior decoding. By considering all positions in the model and all twenty amino acids, one can calculate the probability that each amino acid occurs at that each position in the model. This produces a table of probabilities for all possible pairs of amino acids and positions in the model. From the table, it is possible to find the highest probability path through the model for any protein, based on all possible places that any amino acid can occur.


This paper has explored many exciting features of HMMs. However, these models do have a few limitations.

The HMM is a linear model and is unable to capture higher order correlations among amino acids in a protein molecule. These correlations include hydrogen bonds between non-adjacent amino acids in a polypeptide chain, hydrogen bonds created between amino acids in multiple chains, and disulfide bridges, chemical bonds between C (cysteine) amino acids which are distant from each other within the molecule.

In reality, amino acids which are far apart in the linear chain may be physically close to each other when a protein folds. Chemical and electrical interactions between them cannot be predicted with a linear model.

Another flaw of HMMs lies at the very heart of the mathematical theory behind these models. As anyone who has taken a stochastics course knows, the probability of a sequence of events is the multiplicative product of the probabilities of individual events only when the events are independent. Throughout this paper, I have claimed that the probability of a protein sequence can be found by multiplying the probabilities of the amino acids in the sequence. This claim is only valid if the probability of any amino acid in the sequence is independent of the probabilities of its neighbors.

In biology, this is not the case. There are, in fact, strong dependencies between these probabilities. For example, hydrophobic amino acids are highly likely to appear in proximity to each other. Because such molecules fear water, they cluster at the inside of a protein, rather than at the surface where they would be forced to encounter water molecules.

These biological realities have motivated research into new kinds of statistical models. Hybrids of HMMs and neural nets, dynamic Bayesian nets, factorial HMMs, Boltzmann trees and hidden Markov random fields are among the areas being
explored [6].

A good way to learn more about HMMs is to try using them. To start, check out the URLs listed in Figure 18.

Figure 18. URLs for HMM web servers

You can find more information about HMMs and protein research at the URLS shown in Figure 19.

Figure 19. URLs for HMM and protein information

I also highly recommend the newly released book by Eddy, Mitchison, and Durbin, Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids [6], which you can order online at http://www.amazon.com.


[1] J. Olmsted and G. Williams. Chemistry the Molecular Science, pp. 549-557. William C. Brown. 1997.

[2] M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Sciences of
the USA,
84:4355-4358, 1987.

[3] R. Hughey and A. Krogh. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Computer Applications in the Biosciences, 12:95-107. 1996.

[4] J.D. Thompson, D.G. Higgins, and T.J. Gibson. Improved sensitivity of profile searches through the use of sequence weights and gap excision. Computer Applications in the Biosciences, 10:19-29. 1994b.

[5] S.R. Eddy, G. Mitchison, and R. Durbin. Maximum discrimination hidden Markov models of sequence consensus. Journal of Computational Biology, 2:9-23. 1995.

[6] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. pp. 51-68. Cambridge University Press. 1998.

[7] S. Henikoff and J.G. Henikoff. Position-based sequence weights. Journal of Molecular Biology, 243:574-578. 1994.

[8] K. Sjolander, K. Karplus, M. Brown, R. Hughey, A. Krogh, S.I. Mian, and D. Haussler. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Computer Applications in the Biosciences, 12:327-345. 1996.

[9] C. Barrett, R. Hughey, and K. Karplus. Scoring hidden Markov models. CABIOS, 13(2):191-199. 1997.

[10] A. Milosavljevic and J. Jurka. Discovering simple DNA sequences by the algorithmic similarity method. CABIOS, 9(4):407-411. 1993.

Appendix A

Twenty Amino Acids Commonly Found in Proteins

A ALA Alanine
V VAL Valine
L LEU Leucine
I ILE Isoleucine
F PHE Phenylalanine
P PRO Proline
M MET Methionine
D ASP Aspartic Acid
E GLU Glutamic Acid
K LYS Lysine
R ARG Arginine
S SER Serine
T THR Threonine
C CYS Cysteine
N ASN Asparagine
Q GLU Glutamine
H HIS Histidine
Y TYR Tyrosine
W TRP Tryptophan
G GLY Glycine