For many search and comparison algorithms involving protein sequences, we need to estimate the probability of seeing each of the twenty amino acids in a given context. This probability is often expressed indirectly as a score for each of the amino acids, with positive scores for expected amino acids and negative scores for unexpected ones.
For example, in sequence-sequence alignment, the traditional scoring matrices assign a positive score for each amino acid that would be a good match to the one in the reference sequence, and a negative score to each that would be a poor match.
As Altschul pointed out [Alt91], any alignment scoring system is really making an assertion about the probability of the test sequences given the reference sequence. The score for an alignment is the sum of the scores for individual matched positions, plus the costs for insertions and deletions. We'll only look at the match positions here, although one could make similar arguments for the amino acids in insert positions. For sequence-sequence alignment, the only information about a match position that we can use for alignment is what amino acid was seen in that position in the reference sequence.
For each match position, there are twenty scores--one for each of the possible amino acids in the test sequence. Each match score can be interpreted as the logarithm of the ratio of two estimated probabilities: the probability of the test amino acid given the amino acid in the reference sequence and the probability of the the test amino acid in the background distribution.
Let's define
as the estimated probability that amino acid
i will be seen in the test sequence aligned with amino acid j in
the reference sequence and
as the estimated probability
that an amino acid i will be seen in any position of the test sequence.
Then the score
for matching test amino acid i to reference amino acid j is
for some arbitrary logarithmic base
b. [Alt91]
Any method for estimating the probabilities
and
defines a match scoring system for sequence-sequence
alignment. Rather than looking at the final scoring system, this paper
will concentrate on the methods that can be used for estimating the
probabilities themselves.
In more sophisticated models than single sequence alignments, such as
multiple alignments, profiles [GME87], and hidden Markov
models [KBM
94, BCHM94], we may have more than one reference
sequence in our training set.
Each position of such a model will define a context for which we
to want to estimate the probabilities of the twenty amino acids.
The only information we will use about the context is
the sampling of the amino acids we have seen in that position in the
reference sequences.
In this paper, I'll use s to refer to a sample of amino acids and
s(i) to the number of times that amino acid i appears in that sample.
Our problem, then is to compute the estimated probabilities
for the context from which sample s was taken, given
only the twenty numbers s(i).
Note that aligning a test sequence to a single reference sequence is a
special case of this problem, in which the sample consists of just a
single amino acid. Similarly, estimating the background
is a special case in which the sample is empty
(
).
For alignment and search problems, we usually add scores from many positions, and so fairly small improvements in computing the individual match scores can add up to significant overall differences. For example, the small differences between the PAM scoring matrices and the BLOSUM scoring matrices have been shown to make a significant difference in the quality of search results [HH92].
The differences between regularizers is often fairly small. In this paper we attempt to quantify these small differences for several different methods for estimating the distributions. Section 2 explains the measure used to quantify the tests, Section 3 lists the different methods tested, Section 4 explains how the parameters of the different methods are set, Section 5 describes the data used for training and testing, and Section 6 gives the quantitative comparisons of the different methods.