The SAM system understands several alphabets and many sequence formats.
The SAM system currently supports two nucleotide alphabets (`DNA' and `RNA'), one amino acid alphabet (`protein'), and several secondary structure alphabets (including `DSSP', `EHL', `EHL2', `EHTL', 'TCO', 'ALPHA', 'PB', 'STR', and 'ANG'), as well as user-defined alphabets of up to 25 letters. The predefined alphabets can be specified by setting the alphabet variable. If no alphabet is chosen, the first sequence in a specified file will be examined using readseq (discussed below) to determine if a nucleotide or protein alphabet should be used. If this method does not work, the protein alphabet is the default. The SAM software includes several warning messages if it appears that an incorrect alphabet has been chosen.
The alphabets use standard characters. DNA sequences are composed of the characters ``AGCTRYN'' and RNA of ``AGCURYN,'' where `R' is a purine (`G' or `A'), `Y' is a pyrimidine (`C,' or `T' or `U,' as appropriate), and `N' is a wildcard character that could be any of the four normal characters. SAM's sequence I/O routines can convert between DNA and RNA alphabets if the alphabet is specified incorrectly.
The protein alphabet is ``ACDEFGHIKLMNPQRSTVWYBZX.'' In addition to the twenty amino acids, `X' is the general wildcard character, `B' matches `N' or `D', and `Z' matches `Q' or `E.' Protein alignments (specified with alignfile to buildmodel or modelfromalign) converted to models can additionally use the letter `O' to indicate insertion of a free-insertion module. Except for these two instances, the `O' character is converted to the all-matching `X' wildcard.
The DSSP alphabet is ``EHTSGBICX'', including `E' (beta strand), `H' (alpha helix), `T' (turn), `S' (bend), `G' (3-10 helix), `B' (short beta bridge), `I' (pi helix), and `C' (random coil). The character `L' (loop) is an alias for `C'. The character `C' is used to indicate coils rather than the space character used by DSSP. The following secondary structure alphabets are subsets of the DSSP alphabets with various groups merged.
The TCO alphabet is ``EFGH''. `TCO' is the cosine of the dihedral
angle defined by the carbonyl double bonds of residue
and residue
-
(near +1 for helices and near -1 for sheet). The cosine
values are assigned to four classes `E' (extended), `F'
(transition-1), `G' (transition-2) and `H' (helical), determined by
the k-means algorithm.
The ALPHA alphabet is ``GHISTAEBCDF''. `ALPHA' is the virtual
dihedral angle between the
atoms of residues
-
,
,
+
and
+
. The angles are assigned to eleven classes:
,
,
,
,
,
,
,
,
,
,
,
according to peaks in the
distributions of ALPHA angles for the twenty possible side-chains.
Assignment was done manually, by Kevin Karplus.
The PB (`Protein Blocks') alphabet is ``ABCDEFGHIJKLMNOP''. PB was designed by de Brevern et. al. They used overlapping residue fragments of length five (chosen empirically), extracted from a non-redundant set of structures, and encoded them as sequence ``windows'' of (PHI, PSI) pairs called dihedral vectors. An unsupervised Kohonen self-organizing map network was trained on the dihedral vectors with an RMSDA (root mean square deviations on angular values) distance metric, to produce a set of 16 clusters, each with a representative Protein Block or PB. The clusters are `A' (N-cap beta), `B' (N-cap beta), `C' (N-cap beta), `D' (beta), `E' (C-cap beta), `F' (C-cap beta), `G' (mainly coil), `H' (mainly coil), `I' (mainly coil), `J' (mainly coil), `K' (N-cap alpha), `L' (N-cap alpha), `M' (alpha), `N' (C-cap alpha), `O' (C-cap alpha) and `P' (C-cap alpha to N-cap beta).
The STR (strand) alphabet is ``BGHSTLPAMZQE''an enhanced version of DSSP, conceived by Yael Mandel-Gutfreund. It subdivides DSSP letter E (beta strand) into 6 letters, according to properties of a residue's relationship to its strand partners. `P' is surrounded by two parallel partners, `A' by two anti-parallel partners, `M' by one anti-parallel and one parallel partner. Edge strands may be `Q' (parallel parnter) or `Z' (anti-parallel partner). A strand with no partners is assigned the letter `E'.
The ANG (angle) alphabet is based on a design by Bystoff et. al.
Bystroff partitioned the (PHI, PSI) plane into eleven regions, using
the k-means algorithm on all trans (PHI, PSI) pairs in PDB with
=
. Cluster boundaries were calculated with a Voronoi
method. All cis (PHI, PSI) pairs were assigned to an eleventh
cluster, which ANG does not use. In the ANG assignments, a (PHI,PSI)
pair is associated with a cluster representative from which it has the
minimum Euclidean distance. Representatives are as follows:
,
,
,
,
,
,
,
,
,
.
The BYS (Bystroff) alphabet is the same as the ANG alphabet with the addition of the letter `C' for cis peptides.
The RELSA2 alphabet (Rost and Sander, 1994) is a two-letter alphabet based on a residue's relative solvent-accessibile surface area. Residues with relative solvent accessibility less than 16% are assigned to the buried state and those above 16% are assigned to the exposed state.
The RELSA3 alphabet (Rost and Sander, 1994) is a three-letter alphabet based on a residue's relative solvent-accessibile surface area. Residues with relative solvent accessibility less than 9% are assigned to the buried state, those between 9% and 36% to the intermediate state, and those above 36% to the exposed state.
The RELSA10 alphabet (Rost and Sander, 1994) is a ten-letter alphabet based
on a residue's relative solvent-accessibile surface area. A residue is
assigned to one of the states according to the formula
.
The ABSSA7 alphabet is a seven-letter alphabet (all seven states are equiprobable) based on a residue's absolute (unnormalized) solvent-accessible surface area.
The RELSA7 alphabet is an equiprobable, seven-letter alphabet based on a residue's relative solvent-accessible surface area. This alphabet was selected to provide a direct comparison of relative and absolute solvent-accessibility measures, by measuring its performance against that of ABS-SA-7.
Neighborhood count alpahbets with seven equiprobable classes:
The BURIAL_GEN_8_7, BURIAL_GEN_9_7, and BURIAL_GEN_10_7 alphabets are based on neighborhood counts within a spherical cut-off of a residue of interest. These alphabets use maximally informative spots and count all atoms within the spherical-cutoff region. The radii of the spheres are 8, 9, and 10Å, respectively.
The CA_BURIAL_12_7 and CA_BURIAL_14_7 alphabets use the
atom as the sphere center and 12 and 14Å sphere radii. Residues are
represented by their
atoms, and only
atoms inside
the sphere are counted. The
of the residue of interest is not
included in the count.
The CB_BURIAL_12_7, CB_BURIAL_14_7, and CB_BURIAL_16_7 alphabets use the
atom as the sphere center (
for glycine), with 12, 14,
and 16Å sphere radii. Residues are represented by their
atoms
and only
atoms (except
for glycines) inside the sphere are
counted. The
of the residue of interest is not included in the
count.
In all alphabets, unknown characters are converted to wildcards and a warning message is printed.
When a model is created, a wildcard character's probability is the sum of the probabilities of the component letters. Thus, the `X' character will have unity probability, giving it no preference to one state over another. During the training process, wildcard character frequency counts are proportioned among the appropriate true characters according to the relative probabilities of those characters.
All alphabets have internal regularizers (single-component background distributions), and the protein alphabets includes several in Dirichlet mixture prior libraries in the SAM library directory. The listalphabets program should be relied on (rather than this documentation) for determining what alphabets are available, what character aliases are used, and what the default character and transition regularizers are. Alphabet background distributions can be changed with an alphbackfile. See Section 8.1, Section 8.1.1, and Section 10.12.1.
SAM also supports user-defined alphabets of 2 to 25 user-selected letters (`A'-`Z') and one (required) wildcard letter. The restriction to alphabetic characters is a result of the need for both uppercase and lowercase letters in the sequence alignment format. As the system always requires an all-matching wildcard, only 25 letters are allowed.
User-defined alphabets are specified with the alphabet_def variable. As with the standard alphabets, the definition will be included in all resulting models, so future specification of the alphabet on the command line is not required.
For example, performing the commands
buildmodel texttest -train text.seq -alphabetdef "text QWERTYUIOPASDFGHJKLZCVBNMX" align2model texttest -i texttest.mod -db text.seqresults in the alignment file:
>sentence1 thequ..ICKBROWNFOXJUMPEDOVERTHES.LOWLAZ.YDOG--.. ...................... >sentence1 thequ..ICKBROWNFOXJUMPEDOVERTHES.LOWLAZ.YDOG--.. ...................... >sentence2 thequicKERGREENFOXHOPPEDOVERTHES.LOWLUCkYPIG--.. ...................... >sentence3 .......-----------THES.LOWLAZ.YPIGWADDle dintothequickpurplefox >sentence4 thef...ASTBROWNFOXHOPPEDINTOTHEQuICKLAZ.YDOG--.. ......................
Note that the above example does not model the letter `X' because it is a wildcard: the `X' character was not trained and does not have a preference for any state over any other state.
A minimum of three characters, 2 normal and one wildcard, is required to define an alphabet. Default flat regularizers are created automatically, but users may wish to create their own alphabet-specific regularizers with regularizer_file.
As with alphabets, models are tagged with the alphabet_def line, for example
MODEL - Final model for run text alphabet_def text QWERTYUIOPASDFGHJKLZCVBNMx GENERIC 1.886984 0.254944 0.376488 .....See Section 8.4.
Alphabet definition statement set include a standard alphabet name and a hyphen for the character list referred to the standard alphabet and use its regularizers. For example, both of the following refer to the predefined protein alphabet:
listalphabets foo -alphabetdef "protein -" listalphabets foo -a protein
Data sets alphabets larger than 25 characters can also be modeled using SAM and user-defined alphabets.
In the numeric alphabet, characters are numbers and sequences use whitespace to separate the individual characters. For example, the trna10n.seq file is a translation of trna10.seq into a numeric alphabet. One sequence is:
>TRNAn1 1 1 1 1 0 3 1 3 0 1 2 3 2 0 1 3 1 1 3 0 1 0 1 2 1 2 0 3 1 2 3 3 2 1 2 0 3 1 3 0 3 1 0 1 1 2 2 2 2 1 1 1 3 3 2 1 0 3 2 2 2 2 1 1 2 0 3 2 3 2 2 0
All the normal commands can be used, as in:
listalphabets testn -alphabetdef "RNAnumbers 5" buildmodel testn -train trna10num.seq -randseed 0 -alphabetdef "RNAnumbers 5" hmmscore testn -modelfile testn.mod -db trna10n.seq -alphabetdef "RNAnumbers 5" align2model testn -modelfile testn.mod -db trna10n.seq -id TRNAn1 -id TRNAn2 -id TRNAn3 -alphabetdef "RNAnumbers 5"In this alphabet, the five legal characters are the number is zero through three as regular characters, in the number four as the required any-character wildcard. When such an alphabet is defined, SAM creates a flat regularizer as with other defined alphabets. Especially for large alphabets, the user is strongly recommended to provide SAM with a different regularizer based on the background probabilities of the alphabet. (For numeric alphabets, the Order line is not used.) See Section 7.1.3..
The alphabet information from the above commands is:
########### Track 0 alphabet ############## #InternalID: 35 Name: RNAnumbers Comment: <none> Length: 4 WildCards: 1 FLength: 4 FWildCards: 1 Letters: <numeric alphabet> Convert: <none> MatchRegularizer: 0 0.250000 1 0.250000 2 0.250000 3 0.250000 # Total match regularizer weight is 1.000000 InsertRegularizer: 0 0.250000 1 0.250000 2 0.250000 3 0.250000 # Total insert regularizer weight is 1.000000 TransitionRegularlizer: IntoDeleteFrom: D 1.886984 M 0.254944 I 0.376488 IntoMatchFrom: D 1.819972 M 15.521340 I 3.764209 IntoInsertFrom: D 0.225758 M 0.265967 I 4.006562 ########### Track 1 alphabet ############## ########### Track 2 alphabet ############## ########### Track 3 alphabet ############## ########### Track 4 alphabet ############## ########### Track 5 alphabet ##############
Alignment (.a2m) files use the symbols `*', '-', and '.' in a column proceeding each numeric character to indicate whether or not that character was matched,is a blank (delete) character, or is an insert character. Presently, SAM and the prettyalign program are not able to read these alignments.
>TRNAn1 - *1*1*1 *1*0*3*1*3*0*1*2*3*2 *0*1- - *3*1*1 *3*0*1 *0*1*2*1*2*0*3*1*2*3 *3*2*1*2*0*3*1*3*0*3 *1*0*1*1*2*2*2*2*1*1 *1*3*3*2*1*0*3*2*2*2 *2*1*1*2*0*3*2*3*2*2 *0 >TRNAn2 - *1*2*1 *1*2*2*1*3*2*1*3*2*3 *0*1*3*2*3*1*1.0.3*3 *0*1*1*0*2*1*2*3*1*1 *2*2*3*2*2*2*0*0*1*2 *2*0*1*2*0*0*3*2*2*2 *1*1*1*3*3*2*1*0*0*3 *2*2*2*1*1*2*1*1*2*2 *1*2*0 >TRNAn3 .1.1.2.2.2.3.1.3.1.1 - - - - - - - - - - - - *2*3 *0*1*2*3*1*1*3 *2*0 *0*0*1*2*1*2*2*3*1*3 *2*3*0*1*3*0*0*0*2*0 *1*1*0*1*0*3*2*2*3*1 *1*1*3*3*2*1*0*0*3*2 *2*2*0*1*2*1*1*1*1*2 *2*3*2*2.0
The distribution of SAM limits the use of these alphabets to 255 characters and a wildcard for efficiency concerns with respect to the standard alphabets. A version capable of handling larger alphabets can be obtained by sending electronic mail to sam-info@soe.ucsc.edu.
Background frequencies for internal alphabets may be modified, and background frequencies for user-defined alphabets may be set by specifying one or more background files with the alphbackfile keyword.
If one or more alphabets have been set, then only the set alphabets are modified, otherwise all internal alphabets are modified (useful when using listalphabets to verify the changes). When reading an alphbackfile, SAM proceeds as follows:
The file format reads files such as the following, but only pays attention to the Alphabet, Order, and Probs lines.
# Alphabet background fileClassName = BackgroundProbs Alphabet = dssp_ehl2 Order = E H L Probs = 0.2277 0.3668 0.4055 EndClassName = BackgroundProbs
ClassName = BackgroundProbs Alphabet = STRIDE Order = B C E G H I T Probs = 0.0114 0.1810 0.2221 0.0384 0.3420 0.0001 0.2049 EndClassName = BackgroundProbs
For the Codon alphabet, the order statement only includes the four nucleotides. For example, an order statement that listed ``T G C A'' would be followed by a line of 64 values all on a single line in the order of TTT, TTG, TTC, TTA, TGT, ...AAAA.
For numeric alphabets, the order statement, if included, is ignored. Numeric alphabets must always be presented in numeric order.
SAM has three ways of reading sequences. SAM's FASTA (and a2m-format alignment) format reader is by far the quickest. SAM also includes an HSSP alignment file reader and, for many other formats, D. G. Gilbert's readseq package. Because SAM's FASTA reader is tuned both to SAM and to a single format, it can be up to 10 times faster than readseq. Users are advised to convert large databases into FASTA format using readseq or some other program.
When working with formats other than FASTA (including FASTA variants), it is always a good idea to use the checkseq program to ensure that SAM is reading your sequence format correctly. The checkseq program will let you know the number of sequence in a file, the format type, and the total length. See Section 10.12.2.
SAM's modified version of the readseq package by D. G. Gilbert of the Indiana University. The code is based on the February 1, 1993 release, and is included as a subdirectory of the SAM source directory. We are grateful that Gilbert has provided this useful package that may be used by anyone.
The readseq package can read most common formats: examples of all these formats are included in the readseq directory. The formats include
We usually use FASTA format, which looks like this:
; Comments are ignored. >IDENTIFIER LMLDQQTINI IKATVPVLKE HGVTITTTFY KNLFAKHPEV RPLFDMGRQE SLEQPKALAM T >SEQ2 Annotations after identifier are preserved AKHPEVRPLFDMGRQESLEQPKALAMT
For information on other formats, please look through the test files and the Formats file in the readseq directory.
Sequence output will be in FASTA format regardless of the input file format.
Alignment output by align2model and hmmscore is in a FASTA-compatible format in which uppercase letters indicate match states and lowercase letters indicate insertion states and hyphens indicate deletion states (model positions for which the given sequence has no corresponding character). The prettyalign program can be used to line up the match columns of an a2m-format alignment file.
Additionally, align2model can include periods so that its sequence outputs can be visually aligned without the use of prettyalign. If, for example, the longest sequence in a collection is 2000 characters long, all sequences will be filled (using periods) to that longest sequence's alignment length, which will be more than 2000 if any deletion states are used. Thus, allowing the periods to be printed can greatly expand the size of the alignment file. If periods are not desired, the paramater a2mdots can be set to 0. The prettyalign program will work whether or not the a2m format alignment has periods.
SAM can also read HSSP files.
As a relatively untested feature, probabilistic sequences may be read into SAM. A probabilistic sequence is one that, instead of being a sequence of letters from an alphabet, has in sequence a probability distribution across all letters in the alphabet. The input format is a SAM model. When a SAM model is used as a sequence, the probabilities from the match state are used to define each position in the sequence. When comparing a probabilistic sequence to an HMM, the dot product of the HMM state and the probabilistic character is used to decide the probability of the character being generated by the HMM state.
A probabilistic sequence can be specified by provided a model file as a database, for example, -db test.mod.
When alignments are performed between a probabilistic sequence and an HMM, the a2m file uses the most likely character of each state in the probabilistic sequence as a letter. If different letters are desired, the dbguide parameter may be used multiple times to specify files that contain sequences that exactly match the model length of each sequence specified in the list of db databases.
This feature is in the midst of development, and is subject to change.
The buildmodel program uses two sets of sequences: the training
and the test set. Training is performed exclusively on the training
set, and at the end of the model creation, all sequences in the test
set are checked against the model, and the average NLL distance is
reported for both the training and the test set.
Training and test sets can be specified in up to two files each: train, train2, test, and test2. At most Nseq sequences will be read from any one file, so that at most 4Nseq sequences will be read in if four files are specified. The buildmodel program ignores zero-length sequences in the training set file(s) after printing a warning.
The system can also randomly partition sequences into the training and the test set. If Ntrain is set, the system will randomly pick Ntrain sequences from all files specified (training and testing) using the random seed trainseed, and reserve the rest for the test set. By default, the seed is set to the process ID number, which is printed on the output file so that the partition can be reproduced. Sequence partitioning and model training use different random seeds, though both default to the process ID.
Several other programs, such as hmmscore and align2model, take an arbitrary number of sequence database files specified as db. Unlike most variables, repeating the db declaration adds a new file to the list, rather than replacing the previous database file. Zero-length sequences are processed the same way as all other sequences. The related variables, id and not_id, restrict sequences to those with specific identifiers, or exclude those with specific identifiers. If both id and not_id are used on the same command line, sequences that are selected by id but are not selected by not_id will be used.