: 12 Parameter descriptions
: SAM (Sequence Alignment and
: 10 Related programs
Ìܼ¡
The SAM system runs on a variety of Unix workstations (we have checked
installation on workstations including DEC DECstation and Alpha, HP
715, IBM RS6000, SGI Onyx Reality Engine, Sun Sparc, Intel Pentium
with the Linux operating system, and the UCSC Kestrel parallel processor.
The distribution includes an INSTALL file that discusses
installation procedures.
The gnuplot, gunzip, and uncompress programs should
be in the user's path, and other programs should be available as
required by SAM-T2K. See Section 4.11.
11.1 Environment variables
The SAM system has several environment variables, typically set in
csh using a command of the form +setenv PRIOR_PATH
/projects/compbio/lib+.
- PRIOR_PATH
- Directory with prior libraries, regularizers,
and makelogo color definitions.. See Section 8.1.1
and Section 10.10.4.
- BLASTMAT
- Directory for BLAST scoring matrices.
See Section 10.2.8.
11.2 Runtime statistics
At the end of each run of buildmodel, a line of statistics is
printed out, such as the line
-218.36 -217.00 -217.68 0.96 22 0 149
mentioned in Section 3. These numbers are quite useful for
quick comparison of results when, for example, running the program
many times using a shell script. The numbers are: minimum NLL-NULL score,
maximum score, average score, sample deviation of
scores, number of re-estimates, number of surgeries, and the length of
the final model. In the above case, the scores are for the
training set: if a test set were specified (Section 7.4), the
minimum, maximum, average, and sample deviation for the test set would be
reported after the model length, followed by the ratio of the average
test set score to the average training set score (ideally, this value
should be close to unity -- larger values may indicate overfitting of
the model to the training set).
The underlying SAM programs were originally written to work with
single models and single databases of sequences. Many current users,
including UCSC, use SAM to create vast libraries of models and to
analysis many independent, small files of sequences.
If this is also appropriate for your application, you will undoubtedly
find yourself designing scripts, Makefiles, and databases in a manner
similar to any other large project.
Several features in SAM may help you with this design
- If you have many models, you can place them in a model library,
and score the group of them against one or more databases all at the
same time. See Section 10.2.10.
- If you have large database files and would like faster results,
split the database into smaller segments, and run multiple hmmscore jobs at the same time. See Section 10.2.7.
- If you have a large number of database files, and get tired of
endlessly long db commands (or if you exceed system limits on
commandline length), you can put all those db commands into a
single file, and then use the insertion i command to include
this database list, and score all the databases.
There are many future features we would like to include in SAM. The
following list will also point out some of the things you currently
cannot do using the system. The items are of varying difficulty.
- Position-specific regularizer strengths to extend the special
node concepts between entirely fixed and entirely free.
- Model learning and combining using genetic algorithms.
- A coarse-grain (MPI) parallel implementation.
- A version that can run on the Kestrel programmable parallel
processor:
11.5 Prior versions
- Corrections to MSF and FSSP file reading.
- Addition of SW 3 option for modeling a domain within a longer
protein. See Section 8.5.
July, 1998
- Release of 2.1.1 and 2.1.2.
- More efficient implementation of Dirichlet mixture priors.
June, 1998.
- Implementation of posterior-decoded alignments and output of
posterior-decoded values of the dynamic programming operation. The
viterbi variable has been renamed dpstyle. An alias
to viterbi is currently included.
See Section 9.5,
Section 10.1, and Section 10.5.
- Implementation of local and semi-local
training. See Section 9.5.
- Change in the definition of FIM probabilities: all outgoing
arcs from a FIM node now have (unnormalized) unity probability (zero
cost), as the internal delete to insert and insert to insert
transitions have always had. FIMs continue to not have match states.
Optionally, the FIM insert to insert transition can be set with fimtrans. See Section 8.5.
- Change in the implementation of jumps for local and semi-local
scoring. Jumps are now from the delete node after the initial FIM to
an internal match state and from an internal match state to the delete
node of the final FIM. Previously, internal delete states were used.
See Section 10.1.2.
- The extra delete state in alignments induced by
automatically-added FIMs is now automatically removed for both
alignments and multiple-domain alignments. If FIMs are present in the
model and auto_fim is set, the delete state is also removed.
- The prettyalign FASTA-like format no longer includes
semicolon-delimited comments.
April, 1998.
- User-defined alphabets for sequences. See Section 7.1.1.
- Negative fimstrength values will adjust both insert and
FIM states. See Section 8.5.
February, 1998.
- The multdomain program has been removed and its function
has been merged into hmmscore. The mdNLLminusNULL
parameter has been renamed mdNLLnull. The multdomainshort parameter has been renamed alignshort.
The old names are currently aliased to the new names for these two parameters.
See Section 10.2.5.
- The hmmscore program can now print selected sequence
alignments and selected sequence multiple domain alignments during
scoring. See Section 10.2.3
and Section 10.2.5.
- The interactive mode of hmmscore has been removed.
See Section 10.2.
- The scored sequence letter counts null model has been removed.
Null model scores can now be calculated based on the reverse
sequences. The simple_threshold variable determines when
complex null model calculations, such as the reverse null model or the
user's null model, should be performed in terms of the
simple null model score. See Section 10.2.1.
- The content of score files has changed, as has the use of select_seq, select_score, sort, and subtract_null. See Section 10.2.
- The uniqueseq program has been updated. See Section 10.12.9.
- The checkseq program has been updated. See Section 10.12.2.
- Scoring examples in this manual have been changed to use
fully-local scoring and hmmscore now prints a warning whenever
fully-local scoring is not used. See Section 10.2.4.
- The protein_prior and nucleotide_prior variables
can be used to specify default prior
libraries. The Dirichlet mixture recode1.20comp is now used by default
with protein sequences. See Section 8.1.
- Internal weighting in buildmodel is now by default
turned on with internal_weight set to 1. If an external weight file is specified
and internal_weight is not explicitly set on the command line,
internal weighting will be turned off. See Section 9.4.4.
- The sequence_models variable, when set, causes buildmodel
to create initial models from random sequences in the training set
which are then regularized. The each single sequence is given a
weight equal to the value of sequence_models. This option is
recommended and is expected to become default behavior in a
future release. It can both reduce runtime by providing an initial
starting point when an alignment is not available and increase
modeling performance.
See Section 8.3.
- The seed_runs parameter has been removed from buildmodel.
November, 1997.
- A complete rewrite of the inner dynamic programming loop to save
memory (see the Grice, Hughey, and Speck, and the Tarnas and Hughey
papers mentioned in the introduction) and allow local and semi-local
scoring and alignment, as well as Viterbi-based training. Memory use
is now proportional to the product of model length and the square root
of the sequence length rather than the model length and the sequence
length. See Section 10.1.2
and Section 10.2.4.
- HSSP-based structural transition regularizer. See Section 8.1.2.
- The multdomain program now performs scoring adjustments
identical to those of hmmscore when SW is
set. See Section 10.2.4.
- Internal sequence weighting inspired by HMMer's Maximum
Discrimination method has been implemented. It significantly
increases discrimination performance in the presence of
biased training sets. See Section 9.4.4.
- The a2mdots variable can be cleared to avoid printing
dots in a2m files, leading to an at times considerable space
reduction. See Section 10.2.
- The hmmscore program can partition a database to aid in
distributed scoring. See Section 10.2.7.
- SAM will now read its input from compressed (.gz or .Z) files.
See Section 5.
- The alphabet code has been rewritten to make it simpler for
those with source licenses to modify the code.
- Weight files for buildmodel initial alignments and modelfromalign alignments can be specified with the alignment_weights parameter. See Section 9.4.
- A broader collection of Dirichlet mixture and transition
regularizers is included with this version. See Section 8.1.
- The MasPar implementation is no longer supported.
August, 1996.
- The geometric average of the match state probabilities is now
available for use (and the default) with simple and complex null
models. Complex null models are now built from the transition and
insert probabilities of the model, and the geometric average of all
the model's match tables in the match table.
See Section 10.2.1.
- The train_reset_inserts variable causes buildmodel to, at the completion of re-estimation cycle, reset all the
insertion and FIM tables to (by default) the geometric
average of the match states. Set to 0 to turn off. See Section 8.6.
- If no IDs are specified, the hmmscore and multdomain programs will read in sequences a few at a time, instead
of all at once, saving a tremendous amount of memory. If this is
done, sequence output by hmmscore is no longer sorted by score,
though the score file can still be sorted. Given a sorted score file
and unsorted sequence file, the new sortseq program will sort
the sequences according to the score file.
See Section 10.2,
Section 10.2.5, and Section 10.12.8.
- The uniqueseq program will eliminate sequences with
duplicate IDs from a file. The checkseq program will read a
sequence file and print information about it. See Section 10.12.9.
- Training noise is reduced by retrain_noise_scale
(default 0.1) whenever an initial model or alignment is provided.
Noise is also reduced between the first and successive surgery
iterations by surgery_noise_scale (default 0.1).
See Section 9.1.
- Models can be edited using the new utility program,
modifymodel. See Section 10.10.5.
- The program makehist will turn one or two .dist score file
into a histogram, makeroc will turn two .dist score
files into a false positive/false negative plot showing score vs.
counts (number of sequences with the score) and makeroc2 will
turn two .dist score files into a plot of false positive vs. false
negative as a function of threshold score. All three programs require gnuplot. See Section 10.11.
- Weight file reading is more robust. We plan to implement a
WWW weighting server which, given a multiple alignment, will return
sequence weights under a variety of weighting schemes.
- The a2mallcaps variable has been removed: modelfromalign, buildmodel, and other alignment reading
routines will first check to see if the file is an HSSP file, if not,
the a2m format will be checked, and if that does not result in every
sequence having the same number of columns, all characters will be
treated as uppercase. See Section 8.3.
- Binary model output. Models can be printed in human unreadable
binary form. This reduces file size to about one quarter, and greatly
increases model reading speed. See Section 8.4.5.
May, 1996.
- Weighted training. See Section 9.4.
- Sequence weight annealing. See Section 9.4.2.
- The ability to use files as model type specifiers rather than
keywords such as REGULARIZER. See Section 5.
- The ability to print scores of only those sequences doing better
than some threshold. See Section 10.2.
- When multiple models are trained, training is stopped for each
model individually according to the stopcriterion. Previously,
training was stopped when the average score difference reached the
stopcriterion. See Section 9.
- Modelfromalign can be told to treat all letters as match
columns, and turns the letter `O' (capital `o') into a FIM. See Section 10.7.
- Efficiency improvements to model reading.
- SAM alignments have undergone significant changes. Align2model output is now in a normal sequence format, though still
with uppercase, lowercase, `.' and `-' meanings. Prettyalign
can read any readseq format, with lower-case letters indicating
insertions. Prettyalign can no longer be used as a pipe. Modelfromalign can read any readseq format. The buildmodel program can be given an initial
alignment. See Section 10.7.
- The method for specifying multiple database files
or multiple sequence IDs has changed. Multiple db or id
declarations on the command file or a parameter file will add to a
list of database files or id files.
- The command lines for many programs has changed. Except for
prettyalign, all programs now take arguments in the form of a
run name followed by variable name and value pairs. See Section 6.
- The modelfromalign program now uses prior libraries.
See Section 10.7.
- FIM normalization has been moved to another place in the code,
and can be avoided if desired. See Section 9.6.
March, 1996.
- The ability to globally apply various FIM and insertion table
settings to the regularizer during training and to the model during
scoring. This reflects a general cleaning up of the log-odds scoring
introduced in 1.1. The defaults are to use training set letter counts
in both FIMs and insert states for training, and match state frequency
averages for FIMs during scoring (with no change to the insert states).
See Section 8.6. See Section 10.2.1.
- The ability to score sequences according to the difference
between two models, such as models trained on positive and
negative family examples. See Section 10.2.1.
- By default, hmmscore will add FIMs to
a model before scoring it. See Section 10.2.1.
- By default, SAM will start with three models of random length,
and then pick the best model for surgery and further re-estimation.
This will increase runtime from Version 1.1, but improves model generation.
- Several new parameters exist. See Section 12.
- An interface is provided between HMMer and SAM.
See Section 10.10.7.
- Non-default initial models are no longer printed in model files.
This led to too much confusion about which model was the real one, as
well as those pesky ``non-default model being replaced'' messages.
- The use of Dirichlet mixture priors has been updated to reflect
in our most recent work (http://www.cse.ucsc.edu/research/compbio/dirichlets/index.html). The
format of prior libraries has changed slightly, and only one prior
library (uprior9.plib) is included in the distribution. Mixtures in the earlier format may crash the program.
Mixture
priors are particularly useful in database search from a small set of
training examples. See Section 8.1.
November, 1995.
- The default protein regularizer has
been changed from having the uniform distribution in the insert states
to having the background distribution. This generally helps
discrimination experiments, though may hurt sequence alignment.
See Section 8.1.
- The default scoring has been changed from calculating Z-scores
using length bins to NULL model subtraction, which accounts for both
sequence length and wildcards. For scoring, FIMs must be added to a
model before it is scored for this to produce valid results.
This corresponds to the log-odds scoring used in Sean Eddy's HMMER.
See Section 10.2.1.
- Scoring and training with wildcards has been modified so that
sequences with many wildcards can be properly scored with null models.
- An iterative program for finding multiple motifs in a single
sequences is part of SAM. See Section 10.2.5.
January, 1995.
: 12 Parameter descriptions
: SAM (Sequence Alignment and
: 10 Related programs
Ìܼ¡
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group