Sequence Alignment and Modeling System
SAMT02 HMM WWW Servers
SAM 3.5 (July 2005) is available!
The SAM documentation (the 175 page,
manual is also available in PDF and PS)
discusses the changes from previous versions.
If you are a college, university, U.S. government lab, or nonprofit,
you can download the software from the
SAM distribution page.
If you are interested in SAM for commercial use, please request more
information from saminfo@cse.ucsc.edu
Martin Madera and Julian Gough have written a perl converter between
SAM and HMMer 2.0 formats. You can
get it from them (be sure to read their excellent documentation!)
or download a 10/24/2000 copy.
Please read the ISMB99
tutorial on using HMMs
A linear hidden Markov model is a sequence of nodes, each
corresponding to a column in a multiple alignment. In our HMMs, each
node has a match state (square), insert state (diamond) and delete
state (circle). Each sequence uses a series of these states to
traverse the model from start to end. Using a match state indicates
that the sequence has a character in that column, while using a delete
state indicates that the sequence does not. Insert states allow
sequences to have additional characters between columns. In
many ways, these models correspond to profiles.
The primary advantage of these models over
standard methods of sequence search
is their ability to characterize an entire family of sequences. Thus,
each position has a distribution of bases, as do transitions
between states. That is, these linear HMMs have positiondependent
character distributions and positiondependent insertion and deletion
gap penalties. The alignment of each of a family to a trained model
automatically yields a multiple alignment among those sequences.
The SAM software system is a collection of tools for creating
and using these models.
The algorithms and methods used by SAM and other HMM systems
were initially described
in several papers from the University of California, Santa Cruz.
These papers, several of which are described below, are available
in the
UCSC Computational Biology
group's
Protein FTP
directory.
The complete SAM documentation
is available in
compressed (.gz) postscript and as a series of
WWW pages.
We also have a
2page overview of SAM in
postscript.
SAM runs on Unix workstation.
Building a model using SAM can require minutes to several hours
on a workstation depending on the length of the model, the number of
sequences, and other factors.
SAM makes use of UCSC's Dirichlet mixture
regularizer research.
The creation and distribution of SAM has been supported in
part by NSF grants
CDA9115268, IRI9123692, DBI9408579 and
DBI9808007; ONR grant N0001491J1162; NIH
grants GM17129 and 1 R01 GM06857001; DOE
grant DEFG0395ER62112; a grant
from the Danish Natural Science Research Council; and the UCSC Center for Biomolecular
Science and Engineering;
Sean Eddy has written another program suite based on these methods
called HMMER,
which may also be of interest. SAM includes conversion programs
between the two systems' formats.
Hidden Markov models are used extensively in
speech recognition.

Hidden Markov models in computational biology: Applications to
protein modeling.
A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler.
Journal of Molecular Biology , 235:15011531, February
1994.
The original journal article.

Hidden Markov models for sequence analysis: Extension and
analysis of the basic method.
R. Hughey and A. Krogh,
CABIOS 12(2): 95107, 1996.
(HTML version)
or
(POSTSCRIPT version)
Experimental evaluation of noise methods and regularizers, with
discussions of surgery, the parallel SAM code, and finding motifs.

Hidden Markov Models for Detecting Remote Protein Homologies
K. Karplus, C. Barrett, and R. Hughey, Bioinformatics 14(10):846856, 1998.
(HTML version) or (postscript).
Detailed discussion of the SAMT98 method we applied to CASP3 to
predict protein structure.

Predicting protein structure using hidden Markov models
K. Karplus, K. Sjolander, C. Barrett, M. Cline, D. Haussler,
R. Hughey, L. Hold, C. Sander, Proteins: Structure, Function, and
Genetics. Pp. 134139, Supplement 1, 1997
(HTML version)
Discussion of our CASP2 methods for using hidden Markov models to
predict protein structure.

Weighting Hidden Markov Models for Maximum Discrimination.
R. Karchin and A. Hughey,
Bioinformatics, 14(9):772782, 1998.
(HTML version with mangled table headings)
and postscript.
Adding internal weighting to SAM to create SAM Version 2.0. Includes
a comparison of SAM to HMMer, MetaMEME, and Probabistic Smith
Waterman (from Agarawal and States paper) based on 67 discrimination
tests from Pearson.

C. Tarnas and R. Hughey
Reduced space hidden Markov model training
14(5):401406, 1998.
Also available in
postscript
and pdf.
Discussion and analysis of the implementation of the checkpoint method
(see Grice, below) in SAM.

Transparencies from our
CASP2 talk, at
which UCSC's hidden Markov model methods were
among the very top overall scores among threadingbased predictions of
protein structure.

Scoring Hidden Markov Models
C. Barrett and R. Hughey and K. Karplus
CABIOS 13(2):191199, 1997.
Available in
postscript and
compressed (.gz) postscript
as well.
Experimental evaluation of several different scoring methods using
both SAM and HMMer.

Tutorial: Stochastic Modeling Techniques: Understanding and using hidden
Markov models.
L. Grate, R. Hughey, K. Karplus, K. Sjölander. University of
California, Santa Cruz, CA, June 1996. SAM and HMMER tutorial used at
ISMB last June 1996. (compressed
postscript (.ps.Z))

"A Flexible Motif Search Technique based on Generalized Profiles"
(compressed postscript)
Philipp Bucher, Kevin Karplus, Nicolas Moeri, and Kay Hoffman,
Computers and Chemistry
Jan 1996, 20(1) 324. (
postscript).
An evaluation of search techniques for linead hidden Markov
models and generalized profiles.

J Alicia Grice, Richard
Hughey, and Don Speck
Reduced Space Sequence Alignment
CABIOS 13(1):4553, 1997.
To be part of SAM2.0, this checkpoint method has many advantages over
the divideandconquer method.

SAM : Sequence alignment and modeling software system.
R. Hughey and A. Krogh,
Technical Report UCSCCRL957, University of California,
Santa Cruz, CA, January 1995. (Regularly updated.)
The SAM documentation.

Dirichlet Mixtures: A Method for Improving
Detection of Weak but Significant Protein Sequence Homology.
Sjolander, K,
Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S., and Haussler, D.
The most uptodate discussion of Dirichlet Mixtures. The method is
an option in SAM.

Using Dirichlet mixture priors to derive hidden Markov models for
protein families.
M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, K. Sjolander, and
D. Haussler.
In L. Hunter, D. Searls, and J. Shavlik, editors, Proc. of First
Int. Conf. on Intelligent Systems for Molecular Biology ,
pages 4755, Menlo Park, CA, July 1993. AAAI/MIT Press.
The original Dirichlet paper.

Massively parallel biosequence analysis.
R. Hughey.
Technical Report UCSCCRL9314, University of California, Santa
Cruz, CA, April 1993.
(HTML version)
or
(POSTCRIPT version)
Parallel sequence analysis on specialized hardware, and the parallel
SAM code.
Other papers and pointers of interest (please email new pointers!)

"Profile Hidden Markov Models" Sean R. Eddy (1998)
Bioinformatics 14(9), review of HMMs.

"Maximum Discrimination Hidden Markov Models of Sequence Consensus"
Sean R. Eddy, Graeme Mitchison, and Richard Durbin (1995).
J. Computational Biology 2:923. PostScript; 30 pages.
Describes an alternative to maximum likelihood parameter optimization
for HMMs which compensates for the biased sequence representation
caused by phylogenetic relationships.

"Multiple Alignment Using Hidden Markov Models"
Sean R. Eddy (1995).
Proc. Third Int. Conf. Intelligent Systems for Molecular Biology,
C. Rawlings et al., eds. AAAI Press, Menlo Park. pp. 114120.
PostScript; 7 pages.
Describes a simulated annealing algorithm for HMM training and
a probabilistic suboptimal alignment algorithm. Compares HMMbased
multiple alignment to CLUSTALW.

Parameterization studes for the SAM and HMMER methods of hidden Markov
model generation Marcella A. McClure, Chris Smith, and Pete Elton.
Proc. Fourth Int. Conf. Intelligent Systems for Molecular Biology,
D. States et al., eds. AAAI Press, Menlo Park. pp. 155164. A
detailed comparison of HMM training methods for constructing
multiplie alignments.

"Fitting a mixture model by expectation maximization to discover motifs in biopolymers" ,
Timothy L. Bailey and Charles Elkan,
Proceedings of the Second International Conference on Intelligent
Systems for Molecular Biology, (2836), AAAI Press, 1994, and
an associated MEME server.

"MetaMEME: Motifbased Hidden Markov Models of Protein Families".
Grundy, William N., Timothy L. Bailey, Charles P. Elkan and Michael E. Baker.
Computer Applications in the Biosciences, 3(4):397406, 1997, and
an associated MetaMEME server.

Searching for statistically significant regulatory modules.
Timothy L. Bailey and William Stafford Noble
Bioinformatics (Proceedings of the European Conference on Computational Biology).,
19(Suppl. 2):ii16ii25, 200 and
an associated MCAST server.