Table of Contents

Scoring a Sequence with an HMM

This paper provides an introduction to a sophisticated and
flexible statistical model called the *hidden Markov model *(HMM)
and its use as a tool in the study of protein molecules. Due to the
breadth and complexity of the subject, my goal is to provide an
overview of HMM-related protein research. The topics covered do not
represent a comprehensive study of the field, and I have targeted my
explanation of the technical material at an undergraduate computer
science major who knows something about chemistry and electric
circuits, but is unfamiliar with HMMs. At the risk of presenting an
oversimplified version of the material, I have avoided mathematical
equations as much as possible and tried to explain algorithms in plain
English.

Using HMMs to analyze proteins is part of a new scientific
field called *bioinformatics, *based on the relationship between
computer science, statistics and molecular biology. Because of large
government-sponsored projects like the *Human Genome Project* in
the United States, there has been an exponential increase in the
quantity of data available about proteins, DNA, and RNA. Traditional
lab methods of studying the structure and function of these molecules
are no longer able to keep up with the rate of new information. As a
result, molecular biologists have turned to statistical methods
capable of analyzing large amounts of data, and to computer programs
which implement these methods.

Proteins are large, organic molecules and are among the most important components in the cells of living organisms. They are more diverse in structure and function than any other kind of molecule. Enzymes, antibodies, hormones, transport molecules, hair, skin, muscle, tendons, cartilage, claws, nails, horns, hooves, and feathers are all made of proteins [1].

These remarkable molecules are built from an alphabet of twenty
smaller molecules known as *amino acids *(see Appendix A). All twenty amino acids have a common
chemical core of two functional groups, a *carboxylic acid *group
and an *amino* group. A central carbon atom connects the two
groups. In nature, this atom is always bonded to both a hydrogen atom
and another functional group called a *side chain*, which is
different for each of the twenty amino acids [1].

Amino acids within a protein are bonded by a condensation
reaction in which the carboxylic acid group (COOH) of one amino acid
is linked to the amino group (NH_{2}) of a second, releasing
one molecule of water. This combination of two amino acids is known
as a *peptide linkage*. As shown in Figure
1, sequential condensations produce a chain of bonded amino acids
known as a *polypeptide chain *[1]*.*

Figure 1. Condensation reactions producing a polypeptide chain

The linear sequence of amino acids in a polypeptide chain is
known as the *primary structure* of a protein. The enormous
diversity of proteins is due to the many ways in which amino acids can
combine in these chains. A set of **n** amino acids can form
20^{n} polypeptides, so 20^{100} combinations are
possible for a protein built from 100 amino acids. This number is
larger than the number of known atoms in the universe [1].

Within long polypeptide chains, certain sections twist into
coils and fold into sheets. These shapes are known as the
*secondary structure* of a protein. Secondary structure is
formed by hydrogen bonds between the carboxylic acid group and the
amino group of two amino acids which are not adjacent in the
polypeptide chain. If the two amino acids are part of a single chain,
a twisted-helix shape is formed. When a protein is built from
multiple chains of polypeptides, multiple hydrogen bonds can form
across many chains. This creates *pleated sheets*. Examples of
these sheets can be seen in the lower half of Figure
2. The tightly coiled structures in the upper half of Figure 2 are twisted helices, known as *alpha
helices.*

Figure 2. Secondary structure of metalloprotease/inhibitor protein

Although the linear string of amino acids is a simplified view
of a molecule's structure, relationships between proteins can
frequently be seen by comparing only their primary structures. Figure 3 shows the primary structures of four related
proteins. The amino acid molecules are represented by single letters.
Only a small piece of each protein is shown. They are laid out in a
*multiple alignment, *an ordering which highlights areas in which
they are extremely similar.

CGGSLLNAN--TVLTAAHC

CGGSLIDNK-GWILTAAHC

CGGSLIRQG--WVMTAAHC

CGGSLIREDSSFVLTAAHC

By taking a close look at the alignment, we can see the history of evolution in this protein family. Possibly, the ancestor of the four proteins above looked like the protein in Figure 4.

Figure 4. A possible common ancestor

To understand why, consider what happens to a protein inside a cell when the cell reproduces. Through a process called

As a result of these errors, proteins which share a common
ancestor are not exactly alike. However, they inherit many
similarities in primary structure from their ancestor. This is known
as *conservation *of primary structure in a protein family. An
example is seen in Figure 3, where most of the
positions (columns) in the multiple alignment contain only one or two
distinct amino acids.

These structural similarities make it possible to create a
statistical model of a protein family. The model shown in Figure 6 is a simplified statistical *profile*,
a model which shows the amino acid probability distribution for each
position in the family [2]. According to this
profile, the probability of C in position 1 is 0.8, the probability of
G in position 2 is 0.4, and so forth. The probabilities are
calculated from the observed frequencies of amino acids in the
family.

Figure 6. A statistical model of ten related proteins

Given a profile, the probability of a sequence is the product of the amino acid probabilities given by the profile. For example, the probability of CGGSV, given the profile in Figure 6, is

0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031.

Given a statistical model, the probability of a sequence is used to
calculate a *score* for the sequence. Because multiplication of
fractions is computationally expensive and prone to floating point
errors such as underflow, a convenient transformation into the
logarithmic world is used. The score of a sequence is calculated by
taking the logs of all amino acid probabilities and adding them up.
Using this method with base **e** logarithms, the score of CGGSV
is

log_{e}(0.8)+log_{e}(0.4)+log_{e}(0.8)+log_{e}(0.6)+log_{e}(0.2) = -3.48

In practice, profile models take other factors into account.
For example, members of a protein family have varying lengths, so a
score *penalty* is charged for insertions and deletions. The
scores of individual amino acids in a profile are also *position
specific*. In other words, more weight must be given to an
unlikely amino acid which appears in a structurally important position
in the protein than to one which appears in a structurally unimportant
position.

Although these refinements are necessary to create good profile models, they introduce many additional free parameters which must be calculated when building a profile, and unfortunately, the calculations must be done by trial and error. These shortcomings set the stage for a new kind of profile, based on the Hidden Markov model.

Hidden Markov models (HMMs) offer a more systematic approach to
estimating model parameters. The HMM is a dynamic kind of statistical
profile. Like an ordinary profile, it is built by analyzing the
distribution of amino acids in a training set of related proteins.
However, an HMM has a more complex topology than a profile. It can be
visualized as a *finite state machine*, familiar to students of
computer science.

Finite state machines typically move through a series of states
and produce some kind of output either when the machine has reached a
particular state or when it is moving from state to state. The HMM
generates a protein sequence by *emitting* amino acids as it
progresses through a series of states. Each state has a table of
amino acid *emission probabilities* similar to those described in
a profile model. There are also *transition probabilities* for
moving from state to state.

Figure 7. A possible hidden Markov model for the protein ACCY. The protein is represented as a sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs in a particular state, and the numbers next to the directed arcs show probabilities which connect the states. The probability of ACCY is shown as a highlighted path through the model.

Figure 7 shows one topology for a
hidden Markov model. Although other topologies are used, the one
shown is very popular in protein sequence analysis. Note that there
are three kinds of states represented by three different shapes. The
squares are called *match states*, and the amino acids emitted
from them form the conserved primary structure of a protein. These
amino acids are the same as those in the common ancestor or, if not,
are the result of substitutions. The diamond shapes are *insert
states* and emit amino acids which result from insertions. The
circles are special, silent states known as *delete states* and
model deletions.

Transitions from state to state progress from left to right through the model, with the exception of the self-loops on the diamond insertion states. The self-loops allow deletions of any length to fit the model, regardless of the length of other sequences in the family.

Any sequence can be represented by a path through the model. The probability of any sequence, given the model, is computed by multiplying the emission and transition probabilities along the path.

In Figure 7, a path through the model represented by ACCY is highlighted. In the interest of saving space, the full tables of emission probabilities are not shown. Only the probability of the emitted amino acid is given. For example, the probability of A being emitted in position 1 is 0.3, and the probability of C being emitted in position 2 is 0.6. The probability of ACCY along this path is

.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10^{-6}.

As in the profile case described above, the calculation is
simplified by transforming probabilities to logs so that addition can
replace multiplication. The resulting number is the *raw*
*score* of a sequence, given the HMM.

For example, the score of ACCY along the path shown in Figure 7 is

log_{e}(.4) + log_{e}(.3) +
log_{e}(.46) + log_{e}(.6) + log_{e}(.97) +
log_{e}(.5) + log_{e}(.015) + log_{e}(.73)
+log_{e}(.01) + log_{e}(1) = -13.25

The calculation is easy if the exact state path is known, as in
the toy example of Figure 7. In a real model,
many different state paths through a model can generate the same
sequence. Therefore, the correct probability of a sequence is the sum
of probabilities over all of the possible state paths. Unfortunately,
a brute force calculation of this problem is computationally
unfeasible, except in the case of very short sequences. Two good
alternatives are to calculate the sum over all paths inductively using
the *forward algorithm,* or to calculate the most probable path
through the model using the *Viterbi algorithm*. Both algorithms
are described below.

Figure 8. HMM with multiple paths through the model for ACCY. The highlighted path is only one of several possibilities.

Consider the HMM shown in Figure 8. The Insert, Match, and Delete states are labeled with their position number in the model, M1, D1 etc. (States I1 and I2 are unlabelled to reduce clutter.) Because the number of insertion states is greater than the number of match or delete states, there is an extra insertion state at the beginning of the model, labeled I0. Unlike the HMM in Figure 7, where the state path for ACCY was known, several state paths through the model are possible for this sequence.

The most likely path through the model is computed with the Viterbi algorithm. The algorithm employs a matrix, shown in Figure 9. The columns of the matrix are indexed by the states in the model, and the rows are indexed by the sequence. Deletion states are not shown, since, by definition, they have a zero probability of emitting an amino acid. The elements of the matrix are initialized to zero and then computed with these steps:

1. The probability that the amino acid *A* was generated by
state I0* *is computed and entered as the first element of the
matrix.

2. The probabilities that C* *is emitted in state M1
(multiplied by the probability of the most likely transition to state
M1* *from state I0) and in state I1 (multiplied by the most
likely transition to state I1 from state I0) are entered into the
matrix element indexed by C and I1/M1.

3. The maximum probability, max(I1, M1), is calculated.

4. A pointer is set from the winner back to state I0*.*

5. Steps 2-4 are repeated until the matrix is filled.

Prob(A in state I0) = 0.4*0.3=0.12

Prob(C in state I1) = 0.05*0.06*0.5 = .015

Prob(C in state M1) = 0.46*0.01 = 0.005

Prob(C in state M2) = 0.46*0.5 = 0.23

Prob(Y in state I3) = 0.015*0.73*0.01 = .0001

Prob(Y in state M3) = 0.97*0.23 = 0.22

The most likely path through the model can now be found by following the back-pointers.

Figure 9. Matrix for the Viterbi algorithm

Once the most probable path through the model is known, the probability of a sequence given the model can be computed by multiplying all probabilities along the path.

The forward algorithm is similar to Viterbi. However in step 3, a sum rather than a maximum is computed, and no back pointers are necessary. The probability of the sequence is found by summing the probabilities in the last column. The resulting matrix is shown in Figure 10.

Prob(A in state I0) = 0.4*0.3=0.12

Prob(C in state I1) = 0.05*0.06*0.5 = 0.015

Prob(C in state M1) = 0.46*0.01= 0.005

Prob(C in state M2) = (0.005*0.97) +(0.015*0.46)= .012

Prob(Y in state I3) = .012*0.015*0.73*0.01 = 1.31x10^{-7}

Prob(Y in state M3) = .012*0.97*0.2 = 0.002

Figure 10. Matrix for the Forward algorithm

Once the probability of a sequence has been determined, its score can be computed. Because the model is a generalization of how amino acids are distributed in a related group (or class) of sequences, a score measures the probability that a sequence belongs to the class. A high score implies that the sequence of interest is probably a member of the class, and a low score implies it is probably not a member.

In the examples above, *global scoring *was used*.
*This means simply that computation of the score begins at the
first amino acid in the sequence and ends at the last one. Even
though this may seem like the most natural way to compute a score, the
results are often misleading. Because of the evolutionary variety
found in related protein sequences, a family member may be composed of
both highly conserved areas which score well against the model and
divergent areas which score poorly. If both kinds of areas are given
equal importance, the overall score of family member may be poor.

The solution to this problem is to use *local scoring,
*where the score of a sequence is set to the score of its highest
scoring subsequence. The principle can be illustrated with a very
simple example. Consider again the sequence ACCY shown in Figure 5. Converting probabilities to the log world,
the global score for the sequence is the sum of all four scores:
-13.25. The computation is shown in Figure
11.

Figure 11. The log score of ACCY

Clearly, the score has been significantly lowered by A and Y. The score is low enough ACCY that may not appear to be a member of the family being modeled.

Figure 12. The family of sequence ACCY

Let's assume ACCY is a member of the family shown in Figure 12. In this case, the global score proves to be a poor measure of family membership. However, if local scoring is used to evaluate the sequence, the final score is much higher. The highest scoring subsequence is found to be CC, with a score of -2.01. Unlike the global score, the local score is high enough to classify ACCY in this family. In situations like this, classifications based on local scoring are more accurate than those based on global scoring.

Another tricky problem is how to create an HMM in the first place, given a particular set of related training sequences. It is necessary to estimate the amino acid emission distributions in each state and all state-to-state transition probabilities from a set of related training sequences.

If the state paths for all the training sequences are known,
the emission and transition probabilities in the model can be
calculated by computing their *expected value: *observing the
number of times each transmission or emission occurs in the training
set and dividing by the sum of all the transmission probabilities or
all the emission probabilities.

If the state paths are unknown, finding the best model given the training set is an optimization problem which has no closed form solution. It must be solved by iterative methods.

The algorithms used to do this are closely related to the scoring algorithms described previously. The goal is to find model parameters which maximize the probability of all sequences in the training set. In other words, the desired model is a model against which all the sequences in the training set will have the best possible scores. The parameters are re-estimated after every iteration by computing a score for each training sequence against the previous set of model parameters.

The *Baum-Welch* algorithm is a variation of the forward
algorithm described earlier. It begins with a reasonable guess for an
initial model and then calculates a score for each sequence in the
training set over all possible paths through this model . During the
next iteration, a new set of expected emission and transition
probabilities is calculated, as described above for the case when
state paths are known. The updated parameters replace those in the
initial model, and the training sequences are scored against the new
model. The process is repeated until *model convergence,
*meaning there is very little change in parameters between
iterations.

The *Viterbi *algorithm is less computationally expensive
than Baum-Welch. As described in the previous section, it computes
the sequence scores over the most likely path rather than over the sum
of all paths.

However, there is no guarantee that a model built with either
algorithm has parameters which maximize the probability of the
training set. As in many iterative methods, convergence indicates
only that a local maximum has been found. Several heuristic methods
have been developed to deal with this problem. One approach is to
start with several initial models and proceed to build several models
in parallel. When the models converge at several different local
optimums, the probability of each model given the training set is
computed, and the model with the highest probability wins. Another
approach is to add *noise, *or random data, into the mix at each
iteration of the model building process. Typically, an *annealing
schedule* is used. The schedule controls the amount of noise added
during each iteration. Less and less noise is added as iterations
proceed. The decrease is either linear or exponential. The effect is
to delay the convergence of the model. When the model finally does
converge, it is more likely to have found a good approximation to the
global maximum [3].

Another problem with HMMs is that if there is a small group of
sequences in the training set which are highly similar, the model will
overspecialize to the small group. To prevent this, several methods
of *sequence weighting *have been developed. These methods give
the *outlier* sequences, those that do not belong to the highly
similar group, additional importance in the calculation of model
parameters.

The simplest weighting methods are based on tree structures. Sequences in a family are placed on branches of a tree, representing their divergence from a common ancestor. One interesting approach is to visualize a tree made of conducting wire with a voltage V applied to the root. The leaves are set to V=0, and the currents flowing through each branch are calculated using Kirchoff's laws. The currents are then used as the sequence weights. Intuitively, the currents will be smaller in a more highly divided area of the tree (the highly similar sub-group), and larger in less divided areas of the tree (the outliers) [4].

Figure 13. Calculating tree-based weights

Figure 13 demonstrates this approach. According to Kirchoff's laws, the following equations can be derived:

I0=I1+I2 I1=I2

I2=I3+I4 I3=I4

I3=I5+I6 I5=I6

I4=I7+I8 I7=I8

With some algebraic manipulation, it can be shown that:

I1=I2= .5 * I0

I3=I4= .25*I1

I5=I6=I7=I8= .125 * I1

The corresponding sequence weights are shown in Figure 14.

Figure 14. Tree-based sequence weights

Another approach is known as *maximum discrimination
*weighting [5]. In this method, weights are
estimated iteratively while the model is being built. After each
iteration in the model building process, the sequences in the training
set are scored against the current model. Weights are assigned to
each sequence, with poorly scoring sequences (outliers) receiving the
highest valued weights. During the next iteration, these sequences
get more importance than sequences which had good scores during the
prior round. The process repeats until the model converges.

A third kind of weighting method is based on trying to make the statistical spread in the model as uniform as possible [6]. The position-specific weighting method developed by Henikoff & Henikoff falls into this category [7] . In this method, weights are based on the diversity observed in the columns of a multiple alignment of sequences.

A weight is computed for each position in a sequence inversely
proportional to *m,* the number of different amino acids in the
column and *k,* the number of times the amino acid of interest
appears in the column.

For example, looking at the first sequence in the multiple alignment shown in Figure 3, the weight of amino acid C in position 1 is

,

and the weight of amino acid N in position 7 is

.

The weight of a sequence is the average of the weights in all positions, normalized to sum to 1.

It is not possible to identify a single, best weighting method. Choice of a weighting method depends both on what the resulting model will be used for and the particular protein group being modeled.

Overfitting and Regularization

Clearly, accurate methods of estimating amino acid distributions are necessary to build good HMMs. A potential pitfall is illustrated by the example shown in Figure 3. Consider these four sequences to be members of a training set. The first column shown contains one distinct amino acid: C. Using the methods described so far, the probability of any of the other 19 amino acids appearing in this position is 0. However, if we set all these probabilities to 0 in the model, we end up with a model unable to recognize family members which do not begin with a C. Remember that members of the training set are only a small percentage of real members of the protein family. At least some of the other members are likely to begin with different amino acids.

This problem is known as *overfitting*. A variety of
approaches known as *regularization *have been developed to deal
with it. The simplest is to use *pseudocounts: *this means that,
even if a given amino acid does not appear in a column of an aligned
training set, it is given a fake count. Fake counts are also added
for the amino acids which appear in the column. When probabilities
are calculated, the fake counts are treated exactly like real observed
counts. Take a look at the first column of Figure
3, which has 4 counts of the amino acid C. If a pseudocount of 1
is used for each of the 20 amino acids, in effect, the column contains
24 entries, of which 4 are real and 20 are fake. The probability of
A, which does not appear in the column, is calculated as shown in Figure 15.

Figure 15. The probability of A in column 1 of Figure 3 using pseudocounts

The probability of C in position 1 is shown in Figure 16.

Figure 16. The probability of C in column 1 of Figure 3 using pseudocounts

A sophisticated application of this method is known as
*Dirichlet mixtures *[8]. The mixtures are
created by statistical analysis of the distribution of amino acids at
particular positions in a large number of proteins. The mixtures are
built from smaller components known as *Dirichlet densities.
*

A Dirichlet density is a probability density over all
possible combinations of amino acids appearing in a given position.
It gives high probability to certain distributions and low probability
to others. For example, a particular Dirichlet density may give high
probability to *conserved *distributions where a single amino
acid predominates over all others. Another possibility is a density
where high probability is given to amino acids with a common
identifying feature, such as the subgroup of *hydrophobic *amino
acids.

When an HMM is built using a Dirichlet mixture, a wealth of information about protein structure is factored into the parameter estimation process. The pseudocounts for each amino acid are calculated from a weighted sum of Dirichlet densities and added to the observed amino acid counts from the training set. The parameters of the model are calculated as described above for simple pseudocounts.

Classifying Sequences with an HMM

One common application of HMMs is classifying sequences in a database. The method is to build an HMM with a training set of known members of the class, and then to compute the score of all sequences in the database. The resulting scores are ranked and a suitable threshold is selected to separate class members from the other sequences in the database.

Unfortunately, because these raw scores are length-dependent,
they must be modified before the ranking can occur. Various
techniques have been developed to work around this shortcoming [9]. One approach is to calculate the difference in
standard deviations between the raw score of a sequence and the mean
score of sequences of similar length in the database. A more
sophisticated method scores a sequence by a *log-odds ratio*,
shown in Figure 17.

Figure 17. The log-odds ratio

This number is the log of the ratio between two probabilities
— the probability that the sequence was generated by the HMM and
the probability that the sequence was generated by a *null*
*model, *whose parameters reflect the general amino acid
distribution in the training sequences. Barrett, Hughey, and Karplus
have shown that a good choice of null model is the geometric mean of
amino acids in the match states [9].

Another important issue is the selection of a suitable threshold separating family and non-family members. One good method uses Milosavljevic's algorithmic significance test [10], setting the threshold to

,

where N is the number of sequences in the database being searched,
and z is the logarithm base, usually set to the values **2** or
**e**. The variable *sigma* is a *significance* level
which varies from .01 to 10. Choice of a large sigma increases the
chance of *false positives*. These are non-family members which
exceed the threshold because they score well against the HMM out of
pure chance. A small sigma sets the threshold at a level where fewer
sequences will be classified as family members. If sigma is too
small, the chance of *false negatives* is increased. In this
case, sequences which are members of the family may not exceed the
threshold and may be misclassified.

Another important use of HMMs is to automatically create a
multiple alignment from a group of unaligned sequences. Multiple
alignment is the process of taking a group of sequences and
identifying amino acids which are *homologous*, structurally or
functionally similar. The homologous amino acids are aligned in
columns (see Figure 3). Until recently, this
tedious task was done by hand and required a biology expert with
extensive knowledge of protein structure and evolution.

A multiple alignment can be generated by using the Viterbi algorithm to find the most likely path through the HMM for each sequence. Each match state in the HMM corresponds to a column in the multiple alignment. A delete state is represented by a dash. Amino acids from insert states are either not shown or are displayed in lower case letters.

An alternative to Viterbi is a method known asThis paper has explored many exciting features of HMMs. However, these models do have a few limitations.

The HMM is a linear model and is unable to capture higher order
correlations among amino acids in a protein molecule. These
correlations include hydrogen bonds between non-adjacent amino acids
in a polypeptide chain, hydrogen bonds created between amino acids in
multiple chains, and* disulfide bridges, *chemical bonds between
C (cysteine) amino acids which are distant from each other within the
molecule.

In reality, amino acids which are far apart in the linear chain may be physically close to each other when a protein folds. Chemical and electrical interactions between them cannot be predicted with a linear model.

Another flaw of HMMs lies at the very heart of the mathematical theory behind these models. As anyone who has taken a stochastics course knows, the probability of a sequence of events is the multiplicative product of the probabilities of individual events only when the events are independent. Throughout this paper, I have claimed that the probability of a protein sequence can be found by multiplying the probabilities of the amino acids in the sequence. This claim is only valid if the probability of any amino acid in the sequence is independent of the probabilities of its neighbors.

In biology, this is not the case. There are, in fact, strong dependencies between these probabilities. For example, hydrophobic amino acids are highly likely to appear in proximity to each other. Because such molecules fear water, they cluster at the inside of a protein, rather than at the surface where they would be forced to encounter water molecules.

These biological realities have motivated research into new
kinds of statistical models. Hybrids of HMMs and neural nets, dynamic
Bayesian nets, factorial HMMs, Boltzmann trees and hidden Markov
random fields are among the areas being

explored [6].

A good way to learn more about HMMs is to try using them. To start, check out the URLs listed in Figure 18.

Figure 18. URLs for HMM web servers

You can find more information about HMMs and protein research at the URLS shown in Figure 19.

Figure 19. URLs for HMM and protein information

I also highly recommend the newly released book by Eddy,
Mitchison, and Durbin, *Biological Sequence Analysis: Probabilistic
models of proteins and nucleic acids *[6], which
you can order online at http://www.amazon.com.

References

[1] J. Olmsted and G. Williams. *Chemistry the Molecular Science,* pp. 549-557. William C. Brown. 1997.

[2] M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins. *Proceedings of the National Academy of Sciences of
the USA,* 84:4355-4358, 1987.

[3] R. Hughey and A. Krogh. Hidden Markov models for sequence analysis: extension and analysis of the basic method. *Computer Applications in the Biosciences,* 12:95-107. 1996.

[4] J.D. Thompson, D.G. Higgins, and T.J. Gibson. Improved sensitivity of profile searches through the use of sequence weights and gap excision. *Computer Applications in the Biosciences, *10:19-29. 1994b.

[5] S.R. Eddy, G. Mitchison, and R. Durbin. Maximum discrimination hidden Markov models of sequence consensus. *Journal of Computational Biology, *2:9-23. 1995.

[6] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. *Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids*. pp. 51-68. Cambridge University Press. 1998.

[7] S. Henikoff and J.G. Henikoff. Position-based sequence weights. *Journal of Molecular Biology, *243:574-578. 1994.

[8] K. Sjolander, K. Karplus, M. Brown, R. Hughey, A. Krogh, S.I. Mian, and D. Haussler. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. *Computer Applications in the Biosciences, *12:327-345. 1996.

[9] C. Barrett, R. Hughey, and K. Karplus. Scoring hidden Markov models. *CABIOS, *13(2):191-199. 1997.

[10] A. Milosavljevic and J. Jurka. Discovering simple DNA sequences by the algorithmic similarity method. *CABIOS, *9(4):407-411. 1993.

Twenty Amino Acids Commonly Found in Proteins

A | ALA | Alanine |

V | VAL | Valine |

L | LEU | Leucine |

I | ILE | Isoleucine |

F | PHE | Phenylalanine |

P | PRO | Proline |

M | MET | Methionine |

D | ASP | Aspartic Acid |

E | GLU | Glutamic Acid |

K | LYS | Lysine |

R | ARG | Arginine |

S | SER | Serine |

T | THR | Threonine |

C | CYS | Cysteine |

N | ASN | Asparagine |

Q | GLU | Glutamine |

H | HIS | Histidine |

Y | TYR | Tyrosine |

W | TRP | Tryptophan |

G | GLY | Glycine |