- dist.20comp
- This mixture was added to our suite of mixtures November 2001.
It was trained on 314,585 columns from 1216 PDB chains.
The 1216 ids in this set were chosen from one of Dunbrack's culled PDB
sets (20% 3.0 Ang) with fragments removed (no piece >=20 residues) and
further restricting the set to those that had >=10 chains in
t2k-thin62 (t2k alignments thinned to 62% residue identity).
The t2k-thin62 alignments were used to get the column counts, with no weighting.
Although not yet tested, it is hoped that this mixture will do a
better job of generalizing small samples to distributions typical of
the superfamily.
- recode5.20comp
- There was some hope that this mixture would supersede
recode4.20comp,
and
recode3.20comp, but so far
recode3.20comp still seems to be better.
recode5.20comp was reoptimized from recode3.20comp on a selection
of 198,567 columns taken from sam-t99 alignments. The columns were
those in the "overrep" training set (essentially FSSP plus a set of
high-resolution structures with up to 50% identity) weighted with
the Henikoffs' weighting scheme, scaled to gain
an average of 0.5 bits/column relative to the background.
Only columns from alignments with a reasonably diverse multiple
alignment (weight of 5.0 or more) were used
(198,567 columns out of 497,210).
- recode4.20comp
- This was Karplus's favorite mixture for HMMs intended
for finding distantly related proteins, superseding
recode1.20comp,
recode2.20comp,
and
recode3.20comp, but subsequent testing
showed recode3.20comp as preferable.
The recode4.20comp mixture was re-optimized from
fournier-fssp.20comp on the fssp-3-5-98-select-0.8-3.cols data set,
to minimize the errors in estimating distributions from samples of 1,
2, or 3 amino acids. It differs from earlier mixtures in the "recode"
series in predicting a broader distribution for any given set of counts.
It does a better job of matching distributions structural alignments than
earlier alignments in the series.
The recode3.20comp mixture may still be more appropriate for modeling
close homologs, as it does slightly better on the target98
alignments. (Note: this may be an artifact, as the target98 alignments
were built using either the recode3.20comp mixture or the similar
recode2.20comp mixture.)
- fournier-fssp.20comp
-
The fournier-fssp.20comp mixture was created by
Dave Fournier. He started from the
recode3.20comp dataset, then optimized to maximize the likelihood of
the fssp-3-5-98-select-0.8-3.cols data set, using
AD Model Builder program. This mixture is very good at computing
the likelihood of a given amino-acid distribution, but is not as good
at estimating the distribution given a sample of one or two amino
acids from the distribution.
- recode3.20comp
-
Optimized for the target98 alignments built for all the leaves of
the FSSP tree (version from 3-5-98). The sequences were weighted to
obtain an average information content of 1.0 bits/column (relative to
using background frequencies).
This regularizer was trained on a dataset that included columns
with few counts, so it probably overestimates the probability of
residues being conserved.
- hydro-cons.3comp
-
This is a three-component mixture (hydrophobic/hydrophilic/highly
conserved), that does quite well at encoding a database based on DALI
structural alignments from FSSP. Despite having only 62 degrees of
freedom, it does better than more complicated mixtures that were
trained on different datasets. The database it was trained on
consisted of FSSP alignments (with Z-score >= 7.0), in which there
were at least 22 sequences. The sequences were weighted with the
Henikoffs' weighting scheme with total weight=num_seq ^ 0.25.
This mixture is a good one for people who want a minimal number of
parameters and an easily explained mixture.
- recode2.20comp
Previously Karplus's favorite mixture, superseding recode1.20comp.
- Optimized for a subset of realigned HSSP files. The sequences
were weighted to obtain an average information content of 1.4
bits/column (relative to using background frequencies), then only
those columns that had a total weight of at least 4.5 were used, to
ensure that only alignments representing a moderately or very diverse
family were used. The mixture was repeatedly tweaked by hand and
re-optimized to make the individual components as meaningful as
possible (physiochemically).
This mixture does a good job of regularizing, and should also be
useful for recoding inputs to neural networks (using the component
probabilities instead of amino acid frequencies).
(The recode1.20comp regularizer did not do as good a job of recoding
highly conserved C, G, H, P, and W as recode2.20comp does.)
- byst-4.5-0-3.9comp
Current best 9-component Dirichlet mixture.
- Optimized for same subset of realigned HSSP files as
recode1.20comp, but starting from the 9-component mixtures we have
used previously. This mixture (and recode1.20comp) should be better
for remote homology search than our previous ones, but may not be
quite as good for aligning very close homologs.
- rev4-opt3-weight.9comp
Old best 9-component Dirichlet mixture.
- Optimized for the revision 4 realigned HSSP whole chains, with
entropy weighting.
- uprior.9comp
Published 9-component Dirichlet mixture.
- Optimized for the unweighted BLOCKS database.
This was our first really good Dirichlet mixture. Many subsequent
9-component mixtures were created by retraining this one for different
data sets. The optimization problem for Dirichlet mixtures is quite
difficult, since there are many local minima of similar quality, and
it takes a lot of optimization to get the value of the local minimum
determined well enough to distinguish it from the others. This
mixture provides a particularly good starting point for optimization,
since it has components that are fairly easily explained.
This is the mixture describe in the tech report and the CABIOS article.
- CGP-opt6.12comp
A 12-component Dirichlet mixture.
- This mixture provides separate components for
highly conserved C, P, and G residues, which might be useful when the
probabilities of the individual components are used to predict
properties of the position, but as a regularizer, this mixture is
probably worse than the 9-component mixtures above, since it was
optimized on a smaller data set with unweighted sequences.
Eventually, we'll get around to creating a mixture which separates
the highly-conserved residues into separate components, and which
works well as a regularizer, but we haven't needed this feature much yet.
- merge-opt.13comp
-
A 13-component mixture (derived from merging some of the best
9-component mixtures and reoptimizing. One notable feature of this
mixture is that there are components centered around ND, ED, and EK,
while the 9-component mixtures tend to clump these into only one or
two components. The component that 9-component mixtures have for
"highly conserved residues" is also split into two components (one
mainly for C and W, the other mainly for P and G---these should
probably both be split further, to get down to components dominated by
a single residue).
-