We can use Bayesian probability techniques to interpret the pseudocount regularizers. To apply these methods we have to view amino acids as being generated by a two-stage random process. First, a 20-dimensional density vector over the amino acids is chosen randomly, then amino acids are chosen randomly with probabilities . The probability of amino acid i given a sample s is the integral over all possible vectors of the probability of choosing that vector times the probability of choosing i given that vector:
Computing the probability requires applying Bayes' rule:
giving us a new formula for the probability of amino acid i:
The probability is easily computed for any density vector , but we need to know the prior distribution of in order to compute the integral. The computation for is the same as in Section 2.1:
There is an obvious generalization to non-integer s(j) values by replacing the factorial function with the equivalent expression using the Gamma function:
In order to compute the integral, we must choose a model for the the prior distribution of . One choice that allows us to compute the integral is to model the prior as a Dirichlet distribution, that is
for some parameter vector z, where C is a constant chosen so that P .
Showing in detail how to compute the integral is beyond the scope of this paper, but the answer can be derived from the standard definition of the Beta function [GR65, p. 948,]
and the combining formula [GR65, p. 285,]:
By writing the integral over all vectors as a multiple integral over the 20 dimensions of the vector and doing some rearrangement, we can get the solution
where we have introduced the notation as an simple generalization of to the vector argument z.
With this choice of prior distribution for , we can compute
We can now compute the estimated probability of the sample
The integral for estimating the conditional probability of amino acid i given sample s is then
Notation: is used above to mean the vector consisting of a one in the ith position and a zero elsewhere. is one if i=j and zero otherwise.
This rather involved computation finally ends up with the pseudocount method for estimating the probability of an amino acid given a sample of amino acids. The regularizer parameters z can be interpreted as assuming a Dirichlet distribution for the prior probabilities . Previous work with pseudocounts has relied heavily on this Bayesian interpretation of the parameters, going so far as to assign , which does indeed provide the optimal estimates for , but which we have seen in Section 3.2 is not the best setting of the parameters for |s|>0.
The posterior distribution of after seeing a sample s is P P P . As we can see from the above computations, this posterior distribution is again a Dirichlet distribution, with parameters s(j)+z(j), instead of the prior distribution's parameters z(j). This interpretation of as the parameters of the posterior distribution is what inspired naming them the posterior counts. The scaling of does matter for this interpretation, and so not all the posterior counts produced by regularizers can be automatically interpreted as Dirichlet posterior distributions on .
We can extend the Bayesian analysis to compute the posterior distribution of given that we have seen several independent samples: P . The computation is fairly straightforward. First we apply Bayes rule:
Repeating the mathematics for a single sample would be tedious, but we can take a shortcut. Since the posterior distribution after seeing a sample is again a Dirichlet distribution, we can treat it as the prior distribution for adding the next sample. Using this trick, we can see that the final posterior distribution after seeing all n samples is a Dirichlet distribution with parameters . In other words, we get the same result from observing n independent samples as we would get from adding all the samples together and using the resulting counts as a single sample.