For applications that can afford the computing cost of the Dirichlet mixture regularizers, they are clearly the best choice. In fact they are so close to the theoretical optimum for regularizers, that there doesn't seem to be much point in looking for better regularizers. The evaluations of regularizers for searches in biological contexts have also found Dirichlet mixtures to be superior [TAK94, HH95], validating the more information-theoretic approach taken here.
Although most applications (such as training hidden Markov models or building profiles from multiple alignments) do not require frequent evaluation of regularizers, there are some applications (such as Gibbs sampling) that require recomputing the regularizers inside an inner loop. For these applications, the substitution matrix plus pseudocounts plus scaled counts is probably the best choice, as it has only about 0.03 bits more excess entropy than the Dirichlet mixtures, but does not require evaluating Gamma functions.
For applications in which there is little data to train a regularizer, the pseudocounts are probably the best choice, as they perform reasonably well with few parameters. If you have enough data to train a substitution matrix technique, then you should have enough data to train a Dirichlet mixture, as they have comparable numbers of parameters.
One weakness of the empirical analysis done in this report is that all the data was taken from the BLOCKS database, which contains only highly conserved blocks. While this leads us to have high confidence in the alignment, it also means that the regularizers do not have to do much work. The appropriate regularizers for more variable columns may look somewhat different, though one would expect the pseudocount and substitution matrix methods to degrade more than the Dirichlet mixtures, which naturally handle high variability. I plan to build regularizers for the HSSP structural alignments [SS91] to check that Dirichlet mixtures are the most effective in that application as well.
To get significantly better performance than a Dirichlet mixture regularizer, we have to step away from using a pure regularizer that only knows about the sample of amino acids seen in the context. There are at least two ways to do this. One uses other information about the column (such as solvent accessibility or secondary structure) and the other uses other information about the sequence (such as a phylogenetic tree relating it to other sequences).
Using extra information about a column could improve the performance of a regularizer up to the ``full'' row shown in Table 2.1, but no more, since that entropy reflects the best we could do if the extra information uniquely identified the column. There is about 0.6 bits that could be gained by using such information (relative to a sample size of 5), far more than difference between the best regularizer and a crude zero-offset regularizer.
One possible way to use such column information would be to classify each column with one of a small number of labels, and to tune a different regularizer for each label. For this application, pseudocount regularizers are probably most appropriate, both because the labeling will reduce the size of the training set, and because a good labeling should provide fairly pure distributions that shouldn't need the ability of Dirichlet mixtures to match a variety of different distributions. I plan to pursue creating such a collection of regularizers in spring and summer 1995.
Using sequence-specific information may yield even larger gains than using column-specific information. Preliminary investigations at UCSC indicate that there may be a full bit per column to be gained by taking into account phylogenetic tree relationships among sequences in a multiple alignment. Even if phylogenetic tree data is not available, sequence distance information may be useful.
Another way to use sequence-specific information is to use modified regularizers for residues that are in contact, adjusting the probabilities for one amino acid based on what is present in the contacting position. I hope to work on this approach in summer 1995 as well.