next up previous
Next: Estimation of the model Up: Algorithm Previous: Algorithm

The hidden Markov model


This section briefly describes the structure and basic theory of the type of hidden Markov model used in this work. For a general introduction to hidden Markov models, refer to [Rabiner, 1989]. For additional details on our hidden Markov models, refer to [Krogh et al., 1994a].


Figure 1: A linear hidden Markov model.  

The basic model topology is shown in Figure 1. Each position, or module, in the model has three states. A state shown as a rectangular box is a match state that models the distribution of letters in the corresponding column of an alignment. A state shown by a diamond-shaped box models insertions of random letters between two alignment positions, and a state shown by a circle models a deletion, corresponding to a gap in an alignment. States of neighboring positions are connected, as shown by lines. For each of these lines there is an associated `transition probability', which is the probability of going from one state to the other.

This model topology was chosen so as to feature insertions and deletions similar to biological sequence alignment techniques based on affine gap penalties. Other model topologies can certainly be considered, see for example [Baldi et al., 1994, Krogh et al., 1994b], but for reasons of efficiency the software described here is limited to the topology shown in Figure 1. Gibbs sampling is an alternate approach that does not allow arbitrary gaps within aligned blocks [Lawrence et al., 1993].


Figure 2: An example of two sequences whose characters are matches to states in an HMM, and the corresponding alignment.

Alignment of a sequence to a model means that each letter in the sequence is associated with a match or insert state in the model. Two 5-character sequences, A and B, are shown in a 4-state model in Figure 2, along with the corresponding alignment between the sequences.

One can specify such an alignment by giving the corresponding sequence of states with the restriction that the transition lines in the figure must be followed. For example, to match a letter to the first match state ( tex2html_wrap_inline1590 ) and the next letter to the third match state ( tex2html_wrap_inline617 ) can only be done by using the intermediate delete state ( tex2html_wrap_inline1594 ), so that part of the alignment can be written as tex2html_wrap_inline1596 . In HMM terminology such an alignment of a sequence to a model is called a path through the model. A sequence can be aligned to a model in many different ways, just as in sequence-to-sequence alignments, or sequence-to-profile alignments. Each alignment of a given sequence to the model can be scored by using the probability parameters of the model.

In the following, a sequence of L letters is denoted tex2html_wrap_inline1600 . The path specifying the particular alignment is denoted tex2html_wrap_inline1602 , where tex2html_wrap_inline1604 is the kth state in the path. To simplify notation, as compared to [Krogh et al., 1994a], only the match and insert states are included in the path; for two states not directly connected, delete states are filled in as needed. Because the first state in a path is always the begin state tex2html_wrap_inline1608 , and the last one is always the end state tex2html_wrap_inline1610 , these two trivial states are also not included in the state sequence. The number of match positions in the model is called M, and is referred to as the length of the model. The probability of an alignment of a sequence to a model is the product of all the probabilities met in the path, written as


where tex2html_wrap_inline1614 is the probability associated with the transition between states tex2html_wrap_inline1616 and tex2html_wrap_inline1618 . The probability of matching letter x to state q is called tex2html_wrap_inline1624 . If two states in the path are not directly connected by a transition, the transition probability has to be calculated by filling in the appropriate delete states. For example, the probability of a transition from the match state at position 1 ( tex2html_wrap_inline1590 ) to the insert state at position 4 ( tex2html_wrap_inline1628 ), passing through delete states tex2html_wrap_inline1594 , tex2html_wrap_inline1632 , and tex2html_wrap_inline659 , is


By taking the negative logarithm of equation (1) it can be turned into a more familiar type of alignment `cost':


Now the numbers can be interpreted as `penalties'. A term like tex2html_wrap_inline1636 is a positive penalty for initiating a deletion after position i, and tex2html_wrap_inline1640 is the penalty for continuing a deletion at position i. Some of the other terms have similar interpretations.

next up previous
Next: Estimation of the model Up: Algorithm Previous: Algorithm

Rey Rivera
Thu Aug 29 15:28:54 PDT 1996