This section briefly describes the structure and basic theory of the type of hidden Markov model used in this work. For a general introduction to hidden Markov models, refer to [Rabiner, 1989]. For additional details on our hidden Markov models, refer to [Krogh et al., 1994a].
The basic model topology is shown in Figure 1. Each position, or module, in the model has three states. A state shown as a rectangular box is a match state that models the distribution of letters in the corresponding column of an alignment. A state shown by a diamond-shaped box models insertions of random letters between two alignment positions, and a state shown by a circle models a deletion, corresponding to a gap in an alignment. States of neighboring positions are connected, as shown by lines. For each of these lines there is an associated `transition probability', which is the probability of going from one state to the other.
This model topology was chosen so as to feature insertions and deletions similar to biological sequence alignment techniques based on affine gap penalties. Other model topologies can certainly be considered, see for example [Baldi et al., 1994, Krogh et al., 1994b], but for reasons of efficiency the software described here is limited to the topology shown in Figure 1. Gibbs sampling is an alternate approach that does not allow arbitrary gaps within aligned blocks [Lawrence et al., 1993].
Alignment of a sequence to a model means that each letter in the sequence is associated with a match or insert state in the model. Two 5-character sequences, A and B, are shown in a 4-state model in Figure 2, along with the corresponding alignment between the sequences.
One can specify such an alignment by giving the corresponding sequence of states with the restriction that the transition lines in the figure must be followed. For example, to match a letter to the first match state ( ) and the next letter to the third match state ( ) can only be done by using the intermediate delete state ( ), so that part of the alignment can be written as . In HMM terminology such an alignment of a sequence to a model is called a path through the model. A sequence can be aligned to a model in many different ways, just as in sequence-to-sequence alignments, or sequence-to-profile alignments. Each alignment of a given sequence to the model can be scored by using the probability parameters of the model.
In the following, a sequence of L letters is denoted . The path specifying the particular alignment is denoted , where is the kth state in the path. To simplify notation, as compared to [Krogh et al., 1994a], only the match and insert states are included in the path; for two states not directly connected, delete states are filled in as needed. Because the first state in a path is always the begin state , and the last one is always the end state , these two trivial states are also not included in the state sequence. The number of match positions in the model is called M, and is referred to as the length of the model. The probability of an alignment of a sequence to a model is the product of all the probabilities met in the path, written as
where is the probability associated with the transition between states and . The probability of matching letter x to state q is called . If two states in the path are not directly connected by a transition, the transition probability has to be calculated by filling in the appropriate delete states. For example, the probability of a transition from the match state at position 1 ( ) to the insert state at position 4 ( ), passing through delete states , , and , is
By taking the negative logarithm of equation (1) it can be turned into a more familiar type of alignment `cost':
Now the numbers can be interpreted as `penalties'. A term like is a positive penalty for initiating a deletion after position i, and is the penalty for continuing a deletion at position i. Some of the other terms have similar interpretations.