This section briefly describes the structure and basic theory of
the type of hidden Markov model used in this work. For a general
introduction to hidden Markov models, refer to [Rabiner, 1989]. For
additional details on our hidden Markov models, refer to
[Krogh *et al.*, 1994a].

The basic model topology is shown in Figure 1. Each position, or module, in the model has three states. A state shown as a rectangular box is a match state that models the distribution of letters in the corresponding column of an alignment. A state shown by a diamond-shaped box models insertions of random letters between two alignment positions, and a state shown by a circle models a deletion, corresponding to a gap in an alignment. States of neighboring positions are connected, as shown by lines. For each of these lines there is an associated `transition probability', which is the probability of going from one state to the other.

This model topology was chosen so as to feature insertions and
deletions similar to biological sequence alignment techniques based on
affine gap penalties. Other model topologies can certainly be
considered, see for example [Baldi *et al.*, 1994, Krogh *et al.*, 1994b], but for
reasons of efficiency the software described here is limited to the
topology shown in Figure 1. Gibbs sampling is an
alternate approach that does not allow arbitrary gaps within aligned
blocks [Lawrence *et al.*, 1993].

Alignment of a sequence to a model means that each letter in the
sequence is associated with a match or insert state in the model. Two
5-character sequences, *A* and *B*, are shown in a 4-state model in
Figure 2, along with the corresponding alignment
between the sequences.

One can specify such an alignment by giving the corresponding sequence of states with the restriction that the transition lines in the figure must be followed. For example, to match a letter to the first match state ( ) and the next letter to the third match state ( ) can only be done by using the intermediate delete state ( ), so that part of the alignment can be written as . In HMM terminology such an alignment of a sequence to a model is called a path through the model. A sequence can be aligned to a model in many different ways, just as in sequence-to-sequence alignments, or sequence-to-profile alignments. Each alignment of a given sequence to the model can be scored by using the probability parameters of the model.

In the following, a sequence of *L* letters is denoted . The path specifying the particular alignment is denoted
, where is the *k*th state in the path. To
simplify notation, as compared to [Krogh *et al.*, 1994a], only the match
and insert states are included in the path; for two states not
directly connected, delete states are filled in as needed. Because the
first state in a path is always the begin state , and the last
one is always the end state , these two trivial states are
also not included in the state sequence. The number of match
positions in the model is called *M*, and is referred to as the length
of the model. The probability of an alignment of a sequence to a model
is the product of all the probabilities met in the path, written as

where is the probability associated with the
transition between states and . The probability of
matching letter *x* to state *q* is called . If two
states in the path are not directly connected by a transition, the
transition probability has to be calculated by filling in the
appropriate delete states. For example, the probability of a transition
from the match state at position 1 ( ) to the insert state at
position 4 ( ), passing through delete states , , and
, is

By taking the negative logarithm of equation (1) it can be turned into a more familiar type of alignment `cost':

Now the numbers can be interpreted as `penalties'. A term like
is a positive penalty for initiating a
deletion after position *i*, and is the
penalty for continuing a deletion at position *i*. Some of the other
terms have similar interpretations.

Thu Aug 29 15:28:54 PDT 1996