next up previous contents
Next: Gribskov average-score method Up: Estimation methods Previous: Zero-offset

Pseudocounts

 

Pseudocount methods are a slight variant on the zero-offset, intended to produce more reasonable distributions when |s|=0. Instead of adding a constant zero-offset, a different positive constant is added for each amino acid:

displaymath1668

These zero-offsets are referred to as pseudocounts, since they are used in a way equivalent to having counted amino acids.

Again, as tex2html_wrap_inline1626 the pseudocounts have diminishing influence on the probability estimate and tex2html_wrap_inline1652 . For |s|=0, we can get tex2html_wrap_inline1676 , by setting tex2html_wrap_inline1678 , for any positive constant a. This setting of the pseudocounts has been referred to as background pseudocounts [LAB tex2html_wrap_inline1470 93] or the Bayesian prediction method [TAK94] (for the Bayesian interpretation of pseudocounts, see Appendix B). For the Blocks database and |s|>0, the optimal value of a is near 1.0.

For non-empty samples, the pseudocounts that minimize the encoding cost of Section 2.1 are not necessarily multiples of tex2html_wrap_inline1646 (see Section 4 to see how the pseudocounts are optimized). For example, Figure 3.1 shows the the probability density implied by the optimal pseudocounts for different values of |s|. To get the actual pseudocounts, multiply the densities by the weight at the top of each column.

For |s|=0, the weight is arbitrary, since no real counts are added to the pseudocounts, and the normalization of the posterior counts to probabilities will eliminate the overall weight. Since the weight is arbitrary, the reported weight for |s|=0 is chosen to get the best performance for |s|=1, holding the probabilities fixed so that optimality is not lost for |s|=0.

Note that four amino acids (G=glycine, P=proline, W=tryptophan, C=cysteine) consistently have much smaller pseudocounts than would be expected from the background distribution, while three (M=methionine, Q=glutamine, and S=serine) have consistently higher pseudocounts than expected.

The pseudocounts roughly reflect the chances of seeing the amino acid in a context in which we have not previously seen it. A low pseudocount for an amino acid means that the amino acid is not often seen in a context in which some other amino acid has already been observed. If the pseudocount is lower than we would expect from the background probabilities, then the amino acid must be more highly conserved than other amino acids. Using this reasoning, we expect that G, P, W, and C are often highly conserved. Using symmetric reasoning for pseudocounts that are higher than expected from the background probabilities, we also expect that M, Q, and S are less conserved than other amino acids.

   table154
Table 3.1: Density functions corresponding to optimal pseudocounts for different sample sizes |s|. The pseudocounts were optimized for the entire blocks database, with weighted sequences. To get the actual pseudocounts, multiply the density by the weight for the pseudocounts given in the first row. Note that G, P, C, and W have smaller optimal pseudocounts than would be expected from scaling the background distribution (|s|=0).


next up previous contents
Next: Gribskov average-score method Up: Estimation methods Previous: Zero-offset

Rey Rivera
Thu Aug 1 17:59:45 PDT 1996