The Sequence Alignment and Modeling (SAM) suite of programs includes several tools for modeling, aligning, and discriminating related DNA, RNA, and protein sequences. Given a set of related sequences, the system can automatically train and use a linear HMM representing the family. The SAM-T2K method, available in SAM 3.0 and higher, is a more automated method for building and using HMMs. Readers may wish to read the T2K example before the current section. See Section 4.
SAM uses a linear hidden Markov model (Figure 1) to represent biological sequences. The model is a linear sequence of nodes, each of which includes match (square), insert (diamond), and delete (circle) states. Each match state has a distribution over the appropriate alphabet indicating which characters are most likely. The chain of match states forms a model of the family, or of columns of a multiple alignment. Some sequences may not have characters in specific positions -- delete states enable them to skip through a node without `using up' any characters. Other sequences may have extra characters, which are modeled with the insert states. Insertions are thus used when a small number of sequences have positions not found in most other sequences, while delete states are used when a small number of sequences do not have a character in a position found in most other sequences. As with the match states, all transition probabilities (the chance of having a delete or moving to an insertion state) are local, enabling, for example, the system to strongly penalize sequences that delete conserved regions.
The primary programs include:
prettyalign program will make
align2model output more readable.
buildmodel.
A basic flowchart for using SAM is shown in Figure 2.
As a simple example, consider the task of modeling the 10 tRNAs included
in the file trna10.seq of the distribution. For this
experiment, default program settings will be used: the many
adjustable parameters are described Sections 6
and 12.
buildmodel. This program always requires a name for the run:
if the name is test, the system will create the model output
file test.mod, which will include parameter settings, iteration
statistics, and CPU usage, as well as the initial and final model.
Parameters to buildmodel are specified with hyphens. For this experiment, first we try the command:
buildmodel test -train trna10.seq -randseed 0This specifies the run name and where the sequences for training can be found. Here, to hopefully make this example reproducible, a seed for the random number generator has also been specified.
Buildmodel then prints out a line on standard output such as:
-39.71 -36.21 -37.54 1.02 36 2 78This is a brief summary of the statistics provided in the output model file, and is discussed in more detail in Section 11.2. If no random seed had been specified,
buildmodel would use the
process number, and the statistics line would be different for multiple
runs.
The run has generated a file called test.mod. This file
contains various statistics, described later, as well as the final
model. Statistics are printed to the file after each re-estimation
so that the progress of a run can be readily checked.
To generate a multiple alignment, a command such as the following is used:
align2model trna10 -i test.mod -db trna10.seq prettyalign trna10.a2m -l90 > trna10.prettyThis aligns each sequence to the model, places the alignment in the file trna10.a2m, and then cleans up the output and places it in the file
trna10.pretty, with 90 characters per
line. The alignment will look something like:
; SAM: prettyalign v3.5 (July 15, 2005) compiled 07/15/05_11:25:33 ; (c) 1992-2001 Regents of the University of California, Santa Cruz ; ; Sequence Alignment and Modeling Software System ; http://www.cse.ucsc.edu/research/compbio/sam.html ; ; --------- Citations (SAM, SAM-T2K, HMMs) ---------- ; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis: ; Extension and analysis of the basic method, CABIOS 12:95-107, 1996. ; K. Karplus, et al., What is the value added by human intervention in protein ; structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001. ; A. Krogh et al., Hidden Markov models in computational biology: ; Applications to protein modeling, JMB 235:1501-1531, Feb 1994. ; ----------------------------------- 10 20 30 40 50 60 | | | | | | TRNA1 gg........GGAUGUAGCUCAG-.UGG...U.AGAGCGCAUGCUUCG.CAUGUAU.G.A.GGcCCCGGGUUCGAUCCCCGGC TRNA2 gc........GGCCGUCGUCUAGU.CUGgauU.AGGACGCUGGCCUCC.CAAGCCA.GcA.AU.CCCGGGUUCGAAUCCCGGC TRNA3 ggcccugugg-----CUAGC.UGG...UcAAAGCGCCUGUCUAG.UAAACAG.G.AgAU.CCUGGGUUCGAAUCCCAGC TRNA4 gggcga....--AUAGUGUCAG.CGG...G.AGCACACCAGACUUG.CAAUCUG.G.U.AG.GGAGGGUUCGAGUCCCUCU TRNA5 gccggg....--AUAGCUCAGU.UGG...U.AGAGCAGAGGACUGA.AAAUCCUcG.U.GU.CACCAGUUCAAAUCUGGUU TRNA6 gg........GGCCUUAGCUCAGC.UGG...G.AGAGCGCCUGCUUUG.CACGCAG.G.A.GG.UCAGCGGUCGA-CCCGCUA TRNA7 gg........GCACAUGGCGCAGU.UGG...U.AGCGCGCUUCCCUUG.CAAGGAA.G.AgGU.CAUCGGUUCGAUUCCGGUU TRNA8 gg........GCCCGUGGCCUAGU.CUGga.U.ACGGCACCGGCCUUC.UAAGCCG.GgG.AU.CGGGGGUUCAAAUCCCUCC TRNA9 cg........GCACGUAGCGCAGCcUGG...U.AGCGCACCGUCCUGGgGUUGCGG.G.G.GU.CGGAGGUUCAAAUCCUCUC TRNA10 uccgu.....--CGUAGUCUAGG.UGGu..U.AGGAUACUCGGCUUU.CACCCGA.G.A.GA.CCCGGGUUCAAGUCCCGGC70 | TRNA1 AUCUCCA----....... TRNA2 GGCCGCA----....... TRNA3 GGGGCCUCCA--....... TRNA4 UUGUCCACCA--....... TRNA5 -------ccuggca TRNA6 GGCUCCACCA--....... TRNA7 GCGUCCA----....... TRNA8 GGGUCC----g...... TRNA9 GUGCCGACCA--....... TRNA10 GACGGAACCA--.......
Here, hyphens indicate deletes while lower-case letters, and the corresponding periods (`.') indicate inserts. The column numbers refer to match states in the model, not to column numbers, so that insertions are disregarded in calculating these index points. It is important to remember that insertions are not aligned among them selves: the fact that two insertion characters are in the same printed column only means that they were generated by the same insertion state, not that they should be aligned.
![]() |
Model structure can be quite interesting. SAM includes two programs for examing models: makelogo and drawmodel
The first, makelogo, shows the negative entropy in bits of each match state and the relative frequencies of the different letters in a sequence logo format, invented by Tom Schneider and Mike Stephens (NAR 18:6097-6100, 1990) http://www-lmmb.ncifcrf.gov/ toms/index.html. The command:
makelogo test -modelfile test.modcreates the test.eps file of Figure 3. The program has many options. See Section 10.10.4.
The second, drawmodel,
program generates postscript drawings of models that include
match-state histograms and transition line styles that correspond to
their frequency of use. These drawings are most useful when derived
from frequency counts, values that can be optionally included in the
output file:
buildmodel test -train trna10.seq -randseed 0 -print_frequencies 1The command
drawmodel test.mod test.ps will run the drawmodel
program, which scans the file finding a model and frequency count
data. By selecting the frequency count data in `overall' mode, the
postscript drawing in
Figure 4 is generated.1 The histograms in the match states
correspond to the columns of the multiple alignment above, the numbers
in the diamond insert states correspond to the average number of
insertions for each sequence that uses that state, and the node
numbers are given in the circular delete states. Transitions that are
not used by a significant number of the sequences are not drawn.
See Section 10.10.2.
The scoring program, hmmscore, generates a file of the
negative log-likelihood minus NULL model (NLL-NULL, or log-odds)
scores for each sequence given the model. Let's see
how the 10 tRNA sequences fit the model (test.mod):
hmmscore test -i test.mod -db trna10.seq -sw 2The arguments are the name of the run, the model file, the database sequence file, and a specifier to use fully-local scoring similar to that of the Smith & Waterman algorithm (this is suggested for all scoring runs unless semi-local, domain, or global scoring is specifically required).
hmmscore produces the test.dist file of Figure 5.
The score file contains six columns. The first is the sequence identifier, followed by sequence length, the `NLL-NULL' score using a simple null model, the reverse sequence `NLL-NULL' and an E-value based on the size of the database and the reverse sequence score. The simple null model is just the product of the character probabilities in each sequence multiplied together. Thus, the NLL-NULL score show how much improvement modeling with an HMM gained versus modeling with just a single insertion state from the HMM. The reverse sequence null model is more expensive to calculate, but a much better indicator of sequence similarity to the model. The reverse sequence NLL-NULL score is the difference between the raw negative log-likelihood (NLL) sequence score from that of the raw NLL score of the reversed sequence. Thus, there for sequence null model matches exactly the length and composition of the sequence, but does not match the ordering of the residues.
Other options of hmmscore enable the selective output of sequences according to their NLL scores, NLL per base scores, or E-value. The hmmscore program enters an interactive mode when called with no command line arguments. See Section 10.2.
As you will see, the SAM system, being an active research tool, has a vast number of parameters. This section briefly mentions a few of the parameters and ways we have found to make SAM work reasonably well.
At UCSC, our primary use of SAM is for modelling proteins. In this case, the SAM-T2K method (described next) is the method to use for creating a multiple alignment of database sequences, which can then be turned into an HMM using the, for example, w0.5 script.
We have had less experience with DNA and RNA uses of SAM, though we have had many reports of excellent results in these fields. The SAM DNA alignment page uses the following parameters to create an HMM from a file of aligned sequences:
buildmodel run -train train.a2m -modellength 0 -alphabet DNA -alignfile train.a2m -aweight_method 2 -aweight_bits 0.3 -aweight_exponent 0.5
For the final alignment the command is:
hmmscore run -i run.mod -dpstyle 0 -adpstyle 5 -sw 2 -select_align 8 -db train.a2m -db database.seq -alphabet DNA
A few of the guiding principles in these two commands are: