We can use Bayesian probability techniques to interpret the
pseudocount regularizers.
To apply these methods we have to view amino acids as being generated
by a two-stage random process.
First, a 20-dimensional density vector
over the amino acids is
chosen randomly, then amino acids are chosen randomly with
probabilities
.
The probability of amino acid i given a sample s is the integral
over all possible vectors
of the probability of choosing that
vector times the probability of choosing i given that vector:
Computing the probability
requires applying Bayes' rule:
giving us a new formula for the probability of amino acid i:
The probability
is easily computed for any
density vector
, but we need to know the prior distribution of
in order to compute the integral.
The computation for
is the same as in Section 2.1:
There is an obvious generalization to non-integer s(j) values by replacing the factorial function with the equivalent expression using the Gamma function:
In order to compute the integral, we must choose a model for the
the prior distribution of
.
One choice that allows us to compute the integral is to model the prior
as a Dirichlet distribution, that is
for some parameter vector z, where C is a constant chosen so that
P
.
Showing in detail how to compute the integral is beyond the scope of this paper, but the answer can be derived from the standard definition of the Beta function [GR65, p. 948,]
and the combining formula [GR65, p. 285,]:
By writing the integral over all
vectors as a multiple integral
over the 20 dimensions of the vector and doing some rearrangement, we
can get the solution
where we have introduced the
notation as an simple
generalization of
to the vector argument z.
With this choice of prior distribution for
, we can compute
We can now compute the estimated probability of the sample
The integral for estimating the conditional probability of amino acid i given sample s is then
Notation:
is used above to mean the vector consisting of
a one in the ith position and a zero elsewhere.
is
one if i=j and zero otherwise.
This rather involved computation finally ends up with the pseudocount
method for estimating the probability of an amino acid given a sample
of amino acids.
The regularizer parameters z can be interpreted as assuming a
Dirichlet distribution for the prior probabilities
.
Previous work with pseudocounts has relied heavily on this Bayesian
interpretation of the parameters, going so far as to assign
, which does indeed provide the optimal estimates for
, but which we have seen in Section 3.2 is
not the best setting of the parameters for |s|>0.
The posterior distribution of
after seeing a sample s is
P
P
P
. As we can see from the
above computations, this posterior distribution is again a Dirichlet
distribution, with parameters s(j)+z(j), instead of the prior
distribution's parameters z(j). This interpretation of
as the parameters of the posterior distribution is what
inspired naming them the posterior counts. The scaling of
does matter for this interpretation, and so not all the posterior
counts produced by regularizers can be automatically interpreted as
Dirichlet posterior distributions on
.
We can extend the Bayesian analysis to compute the posterior
distribution of
given that we have seen several independent samples:
P
.
The computation is fairly straightforward. First we apply Bayes rule:
Repeating the mathematics for a single sample would be tedious, but we
can take a shortcut. Since the posterior distribution after seeing
a sample is again a Dirichlet distribution, we can treat it as the
prior distribution for adding the next sample. Using this trick, we
can see that the final posterior distribution after seeing all n
samples is a Dirichlet distribution with parameters
. In other words, we get the same result from
observing n independent samples as we would get from adding all the
samples together and using the resulting counts as a single sample.