In principle the hidden Markov model method is a simple and elegant way of modeling sequence families. For practical purposes, however, various problems arise, of which the most important ones were discussed in this paper. A variety of heuristics and extensions to overcome the problems which are all part of the SAM software package were presented. Most of these techniques were mentioned in our original paper [Krogh et al., 1994a], but here we have discussed them in more detail, and treated it from a more practical perspective, whereas we have not focussed on the biological results at all.
The algorithm for estimating HMMs is a simple hill-climbing optimization technique which will very often find suboptimal solutions resulting in inferior models. Three methods were introduced to deal with the problem. Firstly, several different models can be trained and the best one selected, based on the over-all NLL score. Secondly, noise of slowly decreasing size can be added during the estimation process in a fashion similar to simulated annealing, and thirdly the surgery method for adding and deleting states from the model was shown to also help to obtain models with better NLL scores.
The general theory for HMMs does not tell how to choose the topology of the model. In our work we have chosen a model structure that we believe fits the biological sequences particularly well, and most of the probability parameters can be interpreted as penalties familiar from other alignment methods, such as gap penalties. For choosing the length of the model the above mentioned surgery heuristics was introduced, which deletes parts of the model not used very much and inserts more states where the existing states are `over-loaded.'
In any parameter estimation process over-fitting is a danger, in particular when there are many parameters and little data. This can be overcome using Dirichlet and Dirichlet mixture prior distributions to regularize the models. They are particularly useful because they incorporate prior biological information and insight into the model but can be overcome by sufficient numbers of sequences.
For the problem of modeling domains and other subsequences we introduced the free insertion modules (FIMs). These modules treat the part of the sequences outside the domain as completely random sequences, and by an example it was shown that an HMM can locate domains automatically. By using several FIMs one can model proteins with many subdomains (always appearing in the same order). By using long backward transition, one can in principle model domains occurring several times in different order, but it is not currently implemented in the software package.
SAM is an evolving system. Future additions will include repeated motif location, alternate scoring methods using null models, subsequence to submodel training, sequence weighting, an improved user interface, and use of our new algorithm for parallel sequence alignment in limited space to greatly extend the capabilities of the MasPar code [Grice et al., 1995].