next up previous contents
: 11 System installation : SAM (Sequence Alignment and : 9 The buildmodel estimation   ܼ


10 Related programs


10.1 align2model and prettyalign

After you have obtained a model of your sequence family, align2model can be used to give a multiple alignment of sequences. Often one is just interested in aligning the sequences that were also used to train the model, but in principle any sequence can be included in the alignment.

The multiple alignment is obtained by aligning each of the sequences to the model by the Viterbi algorithm. This has the advantage that it can be done for each sequence independently, and therefore it is very simple to add new sequences to the multiple alignment. Also, once the model is found, the multiple alignment is very fast and easy to produce.

The program to produce the basic alignment is called align2model. Calling it with no arguments gives a brief explanation. To align all the sequences in trna10.seq use the command:


align2model trna10 -i test.mod -db trna10.seq
This will put an intermediate form of the alignment in the file trna10.a2m. In this FASTA-compatible intermediate format deletions are shown as dashes (`-') and insertions (produced in the insert states of the model) are shown as lower case characters, while periods (`.') are used to fill in the sequences that did not have any insertions if a2mdots is set to 1, the default.

In case you only want to align a few sequences in a large file, you can specify the identifiers of these sequences on the command line. For instance


align2model trna2 -i test.mod -db trna10.seq -id TRNA1 -id TRNA9
will only align the two specified sequences. The output (in trna2.a2m) would look like this:

>TRNA1
ggGGAUGUAGCUCAG-.UGGUAGAGCGCAUGCUUCG.CAUGUAUGAGGcC
CCGGGUUCGAUCCCCGGCAUCUCCA----
>TRNA9
cgGCACGUAGCGCAGCcUGGUAGCGCACCGUCCUGGgGUUGCGGGGGU.C
GGAGGUUCAAAUCCUCUCGUGCCGACCA--

The not_id option can be used to exclude specific sequences from being aligned, and has precedence over the id option.

The Viterbi algorithm finds the single best path for the alignment -- that which has the highest probability. The SAM system can also find the single most likely path. Internally, this corresponds to using the EM algorithm to evaluate all possible paths, calculating the probability of each residue appear in each model state and the probability of use for each transition (the posterior probabilities). A second (Viterbi-style) dynamic programming is then performed on these probabilities to find the most likely path (see, for example, Holmes and Durbin in Recomb98).

SAM includes two flavors of posterior-decoded alignment. When adpstyle is 4, posteriors are calculated using a global (SW is 0) forward calculation. Then, the minimizing path through the transition and character emission posteriors is calculated using the specified SW mode.

When adpstyle is 5, posteriors for character emissions are calculated using the specified global or local SW mode. Then, the minimizing path for character emission posteriors only is calculated using a global dynamic programming mode.

These alignment methods are much slower than standard Viterbi alignment, and have not yet been implemented for reduced memory use. The maxposdecodemem specifies the numbers of bytes that may be allocated in to this calculation. If the sequence is too long to perform this alignment calculation in the space specified by maxposdecodemem, a warning messages printed and the Viterbi algorithms used.

10.1.1 prettyalign

To get a nice display of the alignment produced by align2model, you can use the prettyalign program, which has several display options. The program reads from a file like the one made in the example above:


prettyalign trna10.a2m > trna10.pretty
which would give you an alignment similar to the one shown in the Section 3

Prettyalign does not follow SAM's normal commandline format. To see an explanation of the various options, run the program with some invalid option (like prettyalign -h). Some of the most useful options are

-f
Print in a FASTA-like format.
-i
Do not include sequence identifiers in front of each line.
-l num
Set the output line length equal to num.
-n
Toggle indexing the sequences, as well as labeling them.
-c
Toggle column numbering.
-m
Set maximum insertion length (longer insertions are printed as their length).
-I
I-G style alignment. Also sets maximum insertion length very high.
-b
Generate BLAST-style alignments.
The commands

align2model  trna3 -i test.mod -db trna10.seq -id TRNA1 -id TRNA2 -id TRNA9
prettyalign trna3.a2m -l 50 > trna3.pretty
gives the following output


                10            20        30        
                 |             |         |        
TRNA1 ggGGAUGUAGCUCAG-.UGG...UAGAGCGCAUGCUUCG.CAUG
TRNA2 gcGGCCGUCGUCUAGU.CUGgauUAGGACGCUGGCCUCC.CAAG
TRNA9 cgGCACGUAGCGCAGCcUGG...UAGCGCACCGUCCUGGgGUUG

       40          50        60        70                |           |         |         |        TRNA1 UAUG.AGGcCCCGGGUUCGAUCCCCGGCAUCUCCA---- TRNA2 CCAGcAAU.CCCGGGUUCGAAUCCCGGCGGCCGCA---- TRNA9 CGGG.GGU.CGGAGGUUCAAAUCCUCUCGUGCCGACCA--

The prettyalign program can compress long insertions to only the initial segment of bases in the insertion plus digits representing the total length of the insertion. For example, the sequence GacguacguG could be printed out as Ga8guG if 4 was the largest number of insertions that was to be allowed (note that the character 8 is using up one of the positions). By default, insertions of up to length ten thousand are fully printed. This can be changed with the -m flag to prettyalign, which sets the maximum number of insertions that are printed. If set to zero, no insertions are printed, and no indication of the lack is given. If less than zero, insertion characters are not printed, and that number of digits is used to indicate the length of each insertion.

The -I switch will create a compatible IG-style alignment file which may be converted to other formats using the readseq package included as a subdirectory of SAM. The -I option automatically sets a high value for the insertion length parameter.

The -b switch generate a BLAST-style output, comparing the first sequence in the alignment (by default) to each subsequent sequence, and generating pairwise alignments with a middle line highlighting identical residues and conservative substitutions. These pairwise alignments do not include terminal inserts: These pairwise alignments run from the first to last residue aligned by either sequence. They do not include terminal inserts; they include insernal inserts only if some sequence in the pairwise alignment is inserting residues. The -R option can be used to specify a different sequence to use as this reference sequence.


10.1.2 Aligning fragments and SW

Consider a model of 100 nodes and a fragment of 25 that very closely matches some contiguous section of the model. Even though that section would align very well, the overall alignment of the fragment could be quite poor because of its need to use 75 delete states in the model. The problem in this example of global alignment is that in addition to modeling conserved regions, the model also models the length of the conserved region.

SAM's SW variable can be used to control the alignment type. Global alignments can be forced by setting SW to 0. The default (2) is local alignment similar to Smith and Waterman method of sequence comparison, which will find the best alignment for any pair of subsequences within two sequences. The same can be done with models, allowing a submodel to match a subsequence. This type of dynamic programming specified with SW 2, the default. In this case, sequences can jump from the initial module (presumably a FIM, automatically added when auto_fim is set) into the match state of any module in the model, and can also jump out of the match state of any module within the model to the delete state of the next-to-last node. The first and next-to-last module are assumed to be FIMs, hence the rational is that a sequence will use the FIM for some period of time to consume characters that do not match the model, then the sequence will jump to the model node corresponding to the start of the fragment, use several model nodes, and then jump to the ending FIM to consume the rest of the sequence.

The probability of these jumps is set by the variables jump_in_prob and jump_out_prob, both of which have a default value of unity. That is, as in the sequence-to-sequence Smith and Waterman, there is no cost associated with jumping in and out of the model.

Setting SW 1 specifies a semi-local alignment. This option allows a sequence to start matching the model at any location (rather than only the begin node) and end at any location (rather than only the end node). This will improve alignment for short sequences that match a segment of the model. (Internally, no FIMs are added to the model and jumps are allowed).

For domains, SW can be set to 3 to match the full model to part of the sequence. (Internally, FIMs are added to the model but jumps are not allowed).

The file trna1frag.seq contains several sequences that contain part or all of TRNA1. The sequence include TRNA1 (72 bases), TRNA1Long (the complete tRNA with additional characters), Long (58 base segment of TRNA1), Medium (34 base segment), Short (6 base segment TRNA1), and AAMediumA, an embedding of Medium within several segments of As to bring it to 176 characters. Additionally, the file contains several (obviously) non-tRNAs of various lengths, all of whose IDs begin with the word `Not'.

When this file is aligned to the model test.mod, created above, the alignment of the sequence and fragments is reasonable, but the non-tRNAs still align the entire model and may even use internal insertion states. (As shall be seen in Section 10.2.4, the scoring of these fragments with the SW option off is not nearly so good as their alignments).


                              10        20        30        40         50  
                               |         |         |         |          |  
TRNA1         gg......GGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
TRNA1Long     aaa13aggGGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
Short         ........-----------------CAUGC---.----
ShortReverse  u.......--CGUA-------------------.----
Medium        ........----------GAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
MediumReverse uug20uac--------------GCUUCGUACGCGAG--.----
AAMediumAA    aaa70aaa---------AGAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
Long          ........----GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
NotShort      a.......----------------------.----
Not           a.......----------------------.----
NotLong       a.......----------------------.----
NotExtraLong  a.......----------------------.----

                    60        70                                     |         |                TRNA1         CGAUCCCCGGCAUCUCCA----........ TRNA1Long     CGAUCCCCGGCAUCUCCA----aaa11aaa Short         -------------u....... ShortReverse  -------------c....... Medium        -------------u....... MediumReverse -------------........ AAMediumAA    UAAA-----------aaa68aaa Long          CGAUCCCCGGCAU------........ NotShort      -AAA-----------aaa9aaaa Not           -AAA-----------aaa69aaa NotLong       -AAA-----------aa105aaa NotExtraLong  -AAA-----------aa434aaa

Alignment with the SW option set to 1 is much the same (again, scoring will be improved), though the alignment procedure has managed to better isolate the AAMediumAA sequence's tRNA core, modeled by match states, from its prefix and postfix, modeling by internal insertion nodes.


                                10              20        30        40     
                                 |               |         |         |     
TRNA1         ggG........GAUGUAGCUCAG-UGG......UAGAGCGCAUGCUUCGCAUGUAUGAGGc
TRNA1Long     ..-aaa14gggGAUGUAGCUCAG-UGG......UAGAGCGCAUGCUUCGCAUGUAUGAGGc
Short         ..-........--------......----CAUGCU-------.
ShortReverse  ..-........--------......-------UCGUAC----.
Medium        ..-........--------......-GAGCGCAUGCUUCGCAUGUAUGAGGc
MediumReverse ..-........------UUGGgccccgGAGUAUGUACGCUUCGUACGCGAG--.
AAMediumAA    ..-........--------......--------------.
Long          ..-........---GCUCAG-UGG......UAGAGCGCAUGCUUCGCAUGUAUGAGGc
NotShort      ..-........--------......-------AAAAAAAAAAAAA.
Not           ..-........--------......--------------.
NotLong       ..-........--------......--------------.
NotExtraLong  ..-........--------......--------------.

                  50        60        70                                   |         |         |                TRNA1         CCCGGGUUCGAUCCCCGGCAUCUCCA----........ TRNA1Long     CCCGGGUUCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------------........ ShortReverse  -----------------........ Medium        CCCGGGUU-------------........ MediumReverse -----------------........ AAMediumAA    --------------AAAAAaa171aaa Long          CCCGGGUUCGAUCCCCGGCAU------........ NotShort      -----------------........ Not           --------------AAAAAaaa68aaa NotLong       --------------AAAAAaa104aaa NotExtraLong  --------------AAAAAaa433aaa

When alignment is performed using the SW option set to 2, only the core tRNA segments are aligned: the non-tRNAs, as well as the prefix and postfix of AAMediumAA are aligned to the FIMs that have been automatically added to the model. The one problem is that the Short sequence has made some use of the end FIM because it is not long enough to make a really significant hit to the model's internal nodes.


                              10        20        30        40         50  
                               |         |         |         |          |  
TRNA1         gg......GGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
TRNA1Long     aaa13aggGGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
Short         ........-----------------CAUGC---.----
ShortReverse  u.......--CGUA-------------------.----
Medium        ........----------GAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
MediumReverse uug20uac--------------GCUUCGUACGCGAG--.----
AAMediumAA    aaa70aaa---------AGAGCGCAUGCUUCGCAUGUAUGAGG.CCCCGGGU
Long          ........----GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGcCCCGGGUU
NotShort      a.......----------------------.----
Not           a.......----------------------.----
NotLong       a.......----------------------.----
NotExtraLong  a.......----------------------.----

                    60        70                                     |         |                TRNA1         CGAUCCCCGGCAUCUCCA----........ TRNA1Long     CGAUCCCCGGCAUCUCCA----aaa11aaa Short         -------------u....... ShortReverse  -------------c....... Medium        -------------u....... MediumReverse -------------........ AAMediumAA    UAAA-----------aaa68aaa Long          CGAUCCCCGGCAU------........ NotShort      -AAA-----------aaa9aaaa Not           -AAA-----------aaa69aaa NotLong       -AAA-----------aa105aaa NotExtraLong  -AAA-----------aa434aaa

A different alignment is produced when SW is set to 2 (fully local) and adpstyle is set to 4 (posterior-decoded alignment with transitions). As can be seen in the example below, this option does not neatly cut the sequences to their matching positions, but may produce a better core alignment.


                              10        20        30        40          50 
                               |         |         |         |           | 
TRNA1         gg......GGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
TRNA1Long     aaa13aggGGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
Short         c.......-----------------------.-.--
ShortReverse  u.......--CGUA-------------------.-.--
Medium        ........----------GAGCGCAUGCUUCGCAUGUAUGAGGC.CC.CGGG
MediumReverse uu......----------GGGCCCCGGA------GUAUG.UA.CGCU
AAMediumAA    aaa63aaa------AAAAAAAAGAGCGCAUGCUUCGCAUGUAUGAGGC.CCcGGGU
Long          ........----GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
NotShort      aaaa....-----------------------.-.--
Not           aaa18aaa-----------AAAAAAAAAAAAAAAAAAAAAAAA.AA.AAAA
NotLong       aaa52aaa-------------------AAAAA-.-.--
NotExtraLong  aa218aaa-----------------A-----.-.--

                     60        70                                      |         |                TRNA1         UCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------AUGCU........ ShortReverse  -------------c....... Medium        U-------------u....... MediumReverse UCGUACGC---------gag..... AAMediumAA    UAAAAAAAAAAAAA------aaa58aaa Long          UCGAUCCCCGGCA-------u....... NotShort      ---------AAAAAAAAA........ Not           AAAAA-----------aaa20aaa NotLong       -------------aaa52aaa NotExtraLong  -------------aa219aaa

As well as when SW is set to 2 (fully local) and adpstyle is set to 5 (posterior-decoded alignment solely on character emission).


                              10        20        30        40          50 
                               |         |         |         |           | 
TRNA1         ggg.....-GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
TRNA1Long     aaa14ggg-GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
Short         cau.....-----------------------.-.--
ShortReverse  ucg.....-----------------------.-.--
Medium        g.......----------AGCGCAUGCUUCGCAUGUAUGAGGC.CC.CGG-
MediumReverse uug18ugu-----------------------.-.--
AAMediumAA    aaa70aaa---------AGAGCGCAUGCUUCGCAUGUAUGAGGC.CCcGGGU
Long          gc......-----UCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCC.GGGU
NotShort      aaaaaaa.-----------------------.-.--
Not           aaa36aaa-----------------------.-.--
NotLong       aaa54aaa-----------------------.-.--
NotExtraLong  aa219aaa-----------------------.-.--

                     60        70                                      |         |                TRNA1         UCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UCGAUCCCCGGCAUCUCCAAAAA--aaaaaaa. Short         -------------gcu..... ShortReverse  -------------uac..... Medium        -------------guu..... MediumReverse -------------acg16gag AAMediumAA    UAAA-----------aaa68aaa Long          UCGAUCCCCGGC-------au...... NotShort      -------------aaaaaa.. Not           -------------aaa37aaa NotLong       -------------aaa55aaa NotExtraLong  -------------aa219aaa

Here is the posterior-decoded (adpstyle 4) global alignment (SW is 0).


                                10        20        30        40         50
                                 |         |         |         |          |
TRNA1         ggG........GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
TRNA1Long     aa-aaa12gggGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
Short         ..-........----------------------.---
ShortReverse  ..-........---------------UCGUAC-----.---
Medium        ..-........---------GAGCGCAUGCUUCGCAUGUAUGAGGC.CCCGG
MediumReverse ..-........--------UUGGGCCCCGGA------GUAUG.UACGC
AAMediumAA    ..A........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
Long          ..-........---GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
NotShort      ..-........----------------AAAAA----.---
Not           ..A........AAAAAAAAAAAA--AAAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotLong       ..A........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotExtraLong  ..A........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA

                      60        70                                       |         |                TRNA1         UUCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UUCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------CAUGCU........ ShortReverse  --------------........ Medium        GUU------------........ MediumReverse UUCGUA--------CGCGAG........ AAMediumAA    AAAAAA------AAAAAAAAAAaa115aaa Long          UUCGAUCCCCGGCAU------........ NotShort      --AAA--------AAAAA........ Not           AAAAAAAAAAAAAAAAAAAAAAAAAAA........ NotLong       AAAAAA------AAAAAAAAAAaaa48aaa NotExtraLong  AAAAAA------AAAAAAAAAAaa377aaa

Here is the posterior-decoded (adpstyle 5) global alignment (SW is 0).


                                10        20        30        40         50
                                 |         |         |         |          |
TRNA1         ggG........GAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
TRNA1Long     aa-aaa12gggGAUGUAGCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
Short         ..-........----------------------.---
ShortReverse  u.-........-CG--------------------.---
Medium        ..-........---------GAGCGCAUGCUUCGCAUGUAUGAGGC.CCCGG
MediumReverse ..-........------UU--GGGCCCCGGAGU------AUG.UACGC
AAMediumAA    a.-........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
Long          ..-........---GCUCAG-UGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCcCCGGG
NotShort      a.-........---------AA-------AA-----.---
Not           a.-........AAAAAAAAAAAAA--AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotLong       a.-........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA
NotExtraLong  a.-........AAAAAAAAAAAA---AAAAAAAAAAAAAAAAAAAAAAAAAAA.AAAAA

                      60        70                                       |         |                TRNA1         UUCGAUCCCCGGCAUCUCCA----........ TRNA1Long     UUCGAUCCCCGGCAUCUCCAAAAAAAAaaaa.... Short         -----------CAUGCU........ ShortReverse  ------------UAC........ Medium        G------------UU........ MediumReverse UUCGUA--------CGCGAG........ AAMediumAA    AAAAAA-----------aa125aaa Long          UUCGAUCCCCGGCA------U........ NotShort      ----------AAAAAAAA........ Not           AAAAAAA----------aaa20aaa NotLong       AAAAAA-----------aaa58aaa NotExtraLong  AAAAAA-----------aa387aaa


10.2 hmmscore

Any sequence can be compared to a model by calculating the probability that the sequence was generated by that model. Taking the negative (natural) logarithm of this probability gives the NLL score. For sequences of equal length the NLL scores measures how `far' they are from the model, and it can be used to select sequences that are from the same family. However, the NLL score has a strong dependence on sequence length and model length. Hmmscore provides several less biased means of scoring by reporting NLL scores as the difference between a null model and trained model NLL score (a log-odds score, as used in HMMER).

Null model scoring is discussed in more detail in the Barrett, Karplus, and Hughey paper mentioned in the introduction and available from the SAM WWW page.
(http://www.cse.ucsc.edu/research/compbio/papers/nullmod/nullmod.html).

The program hmmscore can find NLL and NLL$-$NULL (log-odds) scores. E-values can be calculated for the reverse sequence null model. The most common operation is to calculate NLL$-$NULL scores for a large number of sequences. This can be done by supplying the name of the model file and one or more sequence database files on the command line, optionally followed by hmmscore parameter specifications. For instance, for the example files described earlier the NLL scores are found the following way


hmmscore test -insert test.mod -db trna10.seq -sw 2

Produces the file test.dist already displayed:


%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  test   Host: peep    Fri Jul 15 13:45:00 2005
%  rph    Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files:  test.mod
% Database Files:  /projects/kestrel/rph/sam32/demos/trna10.seq
%
% Subsequence-submodel (local) (SW = 2)
% Simple scores adjusted by +1.5*ln(seq len) (adjust_score = 2)
% Track 0 FIMs added (geometric mean of match probabilities (6))
% Single Track Model:  test.mod
% Score DP Method: forward all-paths (dpstyle = 0)
% Align DP Method: viterbi (adpstyle = 1)
% 10 sequences, 747 residues, 78 nodes, 0.01 seconds
%
% Sequence scores selected:  All  (select_score=8)
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=10) sequences:
%    N / (1 + exp(-(lambda(=1.0000) * Reverse)**tau(=1.0000)))
%    Calculated when Simple < simple_threshold (10000.00)
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID   Length      Simple     Reverse     E-value      X count
TRNA7              73       -36.19      -28.50     4.18e-12
TRNA2              76       -35.48      -28.13     6.08e-12
TRNA4              75       -35.66      -27.90     7.65e-12
TRNA9              77       -35.54      -27.82     8.29e-12
TRNA10             76       -35.21      -27.69     9.39e-12
TRNA1              72       -34.85      -27.32     1.37e-11
TRNA8              75       -36.32      -27.26     1.44e-11
TRNA5              73       -35.10      -26.36     3.58e-11
TRNA3              76       -34.43      -26.26     3.93e-11
TRNA6              74       -34.59      -23.81     4.56e-10

As discussed in Section 3.4, the score file contains six columns. The first is the sequence identifier, followed by sequence length, the `NLL-NULL' score using a simple null model. The next score column is either the raw NLL score if only the simple null model is calculated (the default), or the complex or reverse sequence null model's `NLL-NULL' if one of the more time-consuming null models is used, as discussed below. If E-values are calculated, they are listed after the two score columns. Last, the number of all-character wildcards in each sequence is listed for those that have any wildcards.

By default, hmmscore uses the EM scoring method, just as is used to train a model. If desired, scores can be based on exact alignment to the model, multiplying the probabilities along the best path rather than all paths. This method, which corresponds to the forward half of align2model, can be turned on by setting viterbi to 1. Viterbi scoring is appropriate for finding out how good a sequence's best alignment to a model is.

The E-value computation is based on a simple, but somewhat unrealistic, assumption that the scores for the sequence and the reversed-sequence are independent draws from an extreme-value (Gumbel) distribution:

\begin{displaymath}P( \mbox{score} < a) \approx 1- e^(-k e^{\lambda a}) \ .
\end{displaymath}

Subtracting the two scores (derivation not published yet) gives

\begin{displaymath}P( \mbox{diff} < a) \approx {1 \over 1+ e^{-\lambda a}} \ .
\end{displaymath}

The E-value is the expected number of sequences with that good a score, so is simply the probability of seeing that negative a difference, multiplied by the number of sequences scored (or dbsize, if that is specified).

The only parameter that needs to be estimated is the natural scaling $\lambda$. Since we use natural logs in computing our probabilities, we set $\lambda=1$, which seems to be correct experimentally. Thus we compute our E-values with no parameter fitting at all--it is based purely on theoretical considerations.

The calibrate option of hmmscore can be used to calibrate the lambda values for the E-value calculation. Once a model library has been created, it can be specified to hmmscore using the model_library directive (or modlib alias). See Section 10.2.10.

The hmmscore program can also be used to select sequences according to various criteria.

Plots of the NLL-NULL scores can be used to visually look for a break between significant and insignificant matches. See Section 10.11.


10.2.1 NLL-NULL scoring

SAM includes several possibilities for NULL model scoring. The null model can be a simple probability distribution, effectively a model with a single FIM. The null model can be any model specified in SAM format, with the key word `NULLMODEL' (rather than, for example, `MODEL' or `REGULARIZER'), or the first model in a file specified with the nullmodel_file parameter. The null model can be the reversed HMM, or equivalently, the score of the reversed sequence through the original HMM.

To report differences between the model NLL score and the simple null model score (possibly modified by FIM_method_score, see below), set the subtract_null variable to 1. To report differences between two models (for example, one trained on positive family examples and one trained on negative examples of a family), set subtract_null to 3. To report differences between the score of the sequence and the score of the reversed sequence, which provides an automatic adjustment for compositional bias and allows the simple calculation of E-values, set subtract_null to 4 (the default). To report scaled reverse null model scores, set subtract_null to 5. The complex null model (previously subtract_null 2) is no longer supported, and reverts to the simple null model.

Because the difference between the sequence and reverse sequence scores automatically adjusts for model and sequence length, SAM is able to add E-values to the score file in this case. The command


hmmscore testrev -i test.mod -db trna10.seq -subtract_null 4 -sw 2
produces a score file that includes both simple and reverse-sequence null model scores:

%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  testrev   Host: peep    Fri Jul 15 13:45:01 2005
%  rph       Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files:  test.mod
% Database Files:  /projects/kestrel/rph/sam32/demos/trna10.seq
%
% Subsequence-submodel (local) (SW = 2)
% Simple scores adjusted by +1.5*ln(seq len) (adjust_score = 2)
% Track 0 FIMs added (geometric mean of match probabilities (6))
% Single Track Model:  test.mod
% Score DP Method: forward all-paths (dpstyle = 0)
% Align DP Method: viterbi (adpstyle = 1)
% 10 sequences, 747 residues, 78 nodes, 0.01 seconds
%
% Sequence scores selected:  All  (select_score=8)
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=10) sequences:
%    N / (1 + exp(-(lambda(=1.0000) * Reverse)**tau(=1.0000)))
%    Calculated when Simple < simple_threshold (10000.00)
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID   Length      Simple     Reverse     E-value      X count
TRNA7              73       -36.19      -28.50     4.18e-12
TRNA2              76       -35.48      -28.13     6.08e-12
TRNA4              75       -35.66      -27.90     7.65e-12
TRNA9              77       -35.54      -27.82     8.29e-12
TRNA10             76       -35.21      -27.69     9.39e-12
TRNA1              72       -34.85      -27.32     1.37e-11
TRNA8              75       -36.32      -27.26     1.44e-11
TRNA5              73       -35.10      -26.36     3.58e-11
TRNA3              76       -34.43      -26.26     3.93e-11
TRNA6              74       -34.59      -23.81     4.56e-10

We are presently experimenting with a scaled reverse null model score (subtract_null set to 5). This is computation is inspired by the PSI-BLAST approach. The scaled score is the reverse null model score divided by the average simple null model score per character.

The NLL-NULL scores, especially for the simple null model, are most useful when the model has had free insertion modules (Section 8.5) added to it. Then, the null model and the FIMs will cancel out, and the score will be based primarily on the section of the sequence that matches the region that has been modeled. By default, hmmscore automatically adds FIMs to any model that does not already contain them when SW scoring is used. To change this, if for example you want to ensure that entire sequences are modeled, rather than simply subregions, change auto_fim from its default value of 1 to 0. The auto_fim variable has no effect when a user-specified null model is used.

Since NLL-NULL scores are negative logs, the lower the better. In the case above, all of the tRNAs have been positively identified as tRNAs. (Not surprising as they were all in the training set!)

New to Version 1.2 is the ability to adjust the null model scoring. Since this determines the probability that a sequence was randomly generated according to the residue insertion probabilities, these values should reflect knowledge of the problem domain. Five possibilities are offered. The flat distribution or the background distribution of amino acids over all proteins can be used. Both of these distributions are invariant over all families, and are thus a simplistic assumption. The distribution can also be the distribution of the residues in the training set or the average residue distribution over all columns (match states) modeled by the HMM. The advantage of these two, especially the latter, is the ability to partially correct for compositional bias in the sequences. Lastly, the insertion probabilities can reflect the residue distribution of the sequence currently being scored. This is the most pessimistic null model, as it demands not that the HMM model the sequence better than fixed background frequencies, but that is model the sequence significantly better than frequencies exactly matching the sequence's composition.

So the options available for FIM_method_score are

0
Use the tables present in the model.
1
The relative frequencies of residues in the training sequences (from the Lettercounts node or the training sequences).
2
The relative frequencies of residues in model match states (from the Frequency node).
3
Uniform (flat) probability over all residues.
5
Amino acid background frequencies over all proteins (from the Generic node).
6
Geometric average of the match state probabilities in the model.

The default setting, for experimental and statistical reasons, is the geometric average of the model match states (6). The insertion tables can be similarly modified with Insert_method_score, the default of which is no change (0).

As with the training methods, if the method value is negative instead, the FIMs and insert tables will only be modified if there is no initial model read in (an unlikely occurrence for hmmscore).

The FIM scoring methods are only applied to the primary track of multitrack models. Other tracks never have their insert or FIM tables changed, and if FIMs are added to the model either the character table of the generic node is used (when present) or the geometric match state average (when there is no generic node).

The FIM scoring methods are not applied to the user's null model.

A more detailed discussion of these issues can be found in (Barrett, Hughey & Karplus 1996), mentioned in the Introduction.


10.2.2 Reducing hmmscore runtime

Full evaluation of a hidden Markov model takes time. The dynamic programming algorithm underlying hmmscore and buildmodel requirest time proportional to the product of the length of the model and the total length of the sequences.

The choice of dynamic programming style (dpstyle) also has a dramatic effect on execution time. The default all-paths (EM) method (dpstyle 1) multiplies and adds probabilities represented as log-probabilities in each of the $O(n^2)$ dynamic programming cell calculations. Adding probabilities represented as log-probabilities is internally sped with table-lookup, but still takes time. The single-best-path (Viterbi) method (dpstyle 0) multiplies and maximizes probabilities represented as log-probabilities, where the operations become addition and minimization, and thus is faster. The posterior-decoded (PD) alignment and scoring methods (dpstyle 5) perform an EM pass followed by a Viterbi pass, and use far more memory, and are the slowest. EM scoring is approximately twice as slow as Viterbi scoring, and PD scoring is three times slower than EM because a full forward-backward algorithm (as used in training) is performed followed by the Viterbi algorithm. For training models with buildmodel, Viterbi is about X times faster than EM. The SAM-T04 iterative model building procedure uses Viterbi training and full EM multiple domain scoring with the reverse null model.

The default reverse null model means that two dynamic programming calculations are done for each sequence. The simple_threshold variable can be used to control when the second calculation is performed based on the simple null model. When set to less than 10000, the reverse (or user) null model will only be evaluated when the simple null model score is less than simple_threshold. Sequences for which the more time consuming null model was not calculated will have the score 10000 in the second score column of the distance file. The default value of simple_threshold is 10000 (off); a setting of $0$ is a reasonable setting to halve the running time of database searches, though occasionally some positive E-values may be lost. It is suggested that users experiment on their own data to ensure results are not lost. Runtime for a large database search will be sped by approximately one half.

Independtly, when EM or PD scoring is used, the viterbi_threshold variable may be used to first calculate a single-best-path (Viterbi) simple null model score, and then decide whether or not to perform the EM or PD scoring based on that score. As with simple_threshold, the default value is 10000, and lower values will activate this option. A setting of 0 is also reasonable. This can reduce PD execution time by about ZZ, and EM execution time by about YY.


10.2.3 Selecting sequences, scores, and alignments

Sequences can be selected by hmmscore and placed in a .sel file.

A selection mode is chosen by setting select_seq. If 0, no sequences are selected; if 1, sequences are selected according to their simple null model scores and NLLNull; if 2, sequences are selected according to their column 2 score (user or reverse sequence null, or raw NLL score if subtract_null is 0 or 1) and NLLcomplex; if 4, sequences are selected according to their E-value and Emax; if 8, all sequences are selected. Selection criteria can be combined: 3 requires sequences to score better than NLLnull with the simple null model and NLLcomplex with the complex (user or reverse sequence) null model. Negative numbers indicate that sequences that do not pass the corresponding positive test should be selected.

The following will place labeled copies of all sequences scoring lower than -35 with the reverse sequence null model into test.sel.


hmmscore tests -i test.mod -db trna10.seq -select_seq 2 -NLLcomplex -35 -sw 2

Selected sequences are written out in the same order they are encountered in the database files, which may be different from the order they are listed in the score file if scores are sorted. The sortseq program can be used to write out the sequences in the same order they are listed in the score file. See Section 10.12.8. The sort variable controls whether sequence scores are unsorted (0); sorted by E-value (4), column 2 of the distance file (2), or column 1 of the distance file (1). Sequences in the select file will be selected by other than the sorting criteria unless sort and select_seq are set to corresponding values.

The select_score variable can be set in the same manner as select_seq, in which case only scores of those sequences that match the specified criteria will be recorded in the distance file. This is particularly useful for database searches in which only sequence IDs are of interest.

The select_align variable can be used in a similar way to cause selected sequence alignments to be placed in the runname.a2m file. Note that all selection variables use the same NLLnull, NLLcomplex, and Emax thresholds, though different combinations of the thresholds can be used by the different parameters.

If the binary expansion of the many_files variable includes a '2' (e.g., 2, 3, 6, 7), the score information is sent to standard output. If the binary expansion of the many_files variable includes a '4' (e.g., 4, 5, 6, 7), the multiple domain score information is sent to standard output. (See Section 10.2.5.) If the binary expansion of the many_files variable includes a '1' (e.g., 1, 3, 5, 7), buildmodel will create multiple files. See first two options enable UNIX pipe processing, such as:


hmmscore tests -i test.mod -db trna10.seq | awk  '$1 !~ % { print $5 "\t" $1}'
to print a file with an evalue and an identifier on each line after removing the comment lines. If both options are in force (e.g., 6, 7), both the multiple domain and the simple scoring values will be sent to standard output in an undefined order. In a terminal window, you will also see hmmscore's standard error output of various diagnostic messages.


10.2.4 Scoring Fragments

As with alignments, the SW parameter can be used to set global (0), semi-local (1), local (2), and domain (3) scoring styles. See Section 10.1.2.

The default local scoring is is similar to Smith and Waterman method of sequence comparison, which will find the best score for any pair of subsequences within two sequences. The same can be done with models, allowing a submodel to match a subsequence. With local scoring, sequences can jump from the initial module (presumably a FIM) into the delete state of any module in the model, and can also jump out of the delete state of any module within the model to the delete state of the next-to-last node. The first and next-to-last module are assumed to be FIMs, hence the rational is that a sequence will use the FIM for some period of time to consume characters that do not match the model, then the sequence will jump to the model node corresponding to the start of the fragment, use several model nodes, and then jump to the ending FIM to consume the rest of the sequence.

The probability of these jumps is set by the variables jump_in_prob and jump_out_prob, both of which have a default value of unity. That is, as in the sequence-to-sequence Smith and Waterman, there is no cost associated with jumping in and out of the model.

The file trna1frag.seq contains several sequences that contain part or all of TRNA1. The sequence include TRNA1 (72 bases, TRNA1Long (the complete tRNA with additional characters), Long (58 base segment) TRNA), Medium (34 base segment), Short (6 base segment TRNA1), and AAMediumA, an embedding of Medium within several segments of As to bring it to 176 characters. Additionally, the file contains several (obviously) non-tRNAs of various lengths, all of whose IDs begin with the word `Not'.

When this file is scored using hmmscore with SW set to 0 for global alignment, the scores of the full sequence and the long fragment place them clearly as tRNAs. However, the score of the short and medium fragments are greatly penalized by the large number of delete states they must use.


TRNA1Long           94       -36.25      -28.73     4.01e-12	:TRNA1 with flanking As                            
TRNA1               72       -34.85      -27.32     1.64e-11	:the original sequence                             
Long                58       -27.64      -20.06     2.33e-08	:A long partial segment from TRNA1                 
AAMediumAA         176       -16.81       -9.61     8.04e-04	:The medium segment with long flanking A regions   
Medium              34       -14.62       -7.63     5.84e-03	:A medium-length segment from TRNA1                
ShortReverse         6        -5.77       -0.11     5.68e+00	:A short reverse segment from TRNA1                
NotShort            13        -6.42        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73        -6.84        0.00     6.00e+00	:A string of As of equal length to TRNA1           
NotExtraLong       438        -6.90        0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        -6.86        0.00     6.00e+00	:A string of As longer than TRNA1                  
Short                6        -5.66        0.11     6.32e+00	:A short segment from TRNA1                        
MediumReverse       34        -7.00        7.63     1.20e+01	:The reversal of medium                            
(Several of the sequences had simple null model scores worse than simple_threshold, so their reverse null model scores were not calculated.)

When scoring is performed using the SW option set to 1, the following score file is generated, which places even the short fragment in the possible tRNA range.


TRNA1               72       -31.14      -37.43     6.69e-16	:the original sequence                             
TRNA1Long           94       -19.56      -28.35     5.88e-12	:TRNA1 with flanking As                            
Long                58       -24.76      -27.13     1.99e-11	:A long partial segment from TRNA1                 
Medium              34       -12.10      -13.03     2.64e-05	:A medium-length segment from TRNA1                
Short                6        -1.47       -0.04     5.87e+00	:A short segment from TRNA1                        
NotExtraLong       438        26.74       -0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        10.22       -0.00     6.00e+00	:A string of As longer than TRNA1                  
NotShort            13         0.58        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73         8.16        0.00     6.00e+00	:A string of As of equal length to TRNA1           
AAMediumAA         176        13.75        0.00     6.00e+00	:The medium segment with long flanking A regions   
ShortReverse         6        -1.43        0.04     6.13e+00	:A short reverse segment from TRNA1                
MediumReverse       34         0.93       13.03     1.20e+01	:The reversal of medium                            

When scoring is performed using the SW option set to 2, the following score file is generated, which also picks up the sequence AAMediumAA, which is a segment of a tRNA embedded within a longer sequence.


TRNA1Long           94       -36.25      -28.73     4.01e-12	:TRNA1 with flanking As                            
TRNA1               72       -34.85      -27.32     1.64e-11	:the original sequence                             
Long                58       -27.64      -20.06     2.33e-08	:A long partial segment from TRNA1                 
AAMediumAA         176       -16.81       -9.61     8.04e-04	:The medium segment with long flanking A regions   
Medium              34       -14.62       -7.63     5.84e-03	:A medium-length segment from TRNA1                
ShortReverse         6        -5.77       -0.11     5.68e+00	:A short reverse segment from TRNA1                
NotShort            13        -6.42        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73        -6.84        0.00     6.00e+00	:A string of As of equal length to TRNA1           
NotExtraLong       438        -6.90        0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        -6.86        0.00     6.00e+00	:A string of As longer than TRNA1                  
Short                6        -5.66        0.11     6.32e+00	:A short segment from TRNA1                        
MediumReverse       34        -7.00        7.63     1.20e+01	:The reversal of medium                            

The unclear separation in this file between the best non-tRNA and the worst tRNA segment is because local alignment allows sequences to use only the parts of the model that improve its score. Thus, a long random sequence will have a better score than a shorter random sequence. Better scores are gained by using the more expensive (to calculate) reverse sequence null model. Setting subtract_null to 4 will differentiate the tRNAs, excepting the 6-nucleotide sequence, from the non-tRNAs. (In this contrived example, the long sequences are composed of the single letter A, and thus the sequence and the reverse sequence score the same going through the HMM.)


TRNA1Long           94       -36.25      -28.73     4.01e-12	:TRNA1 with flanking As                            
TRNA1               72       -34.85      -27.32     1.64e-11	:the original sequence                             
Long                58       -27.64      -20.06     2.33e-08	:A long partial segment from TRNA1                 
AAMediumAA         176       -16.81       -9.61     8.04e-04	:The medium segment with long flanking A regions   
Medium              34       -14.62       -7.63     5.84e-03	:A medium-length segment from TRNA1                
ShortReverse         6        -5.77       -0.11     5.68e+00	:A short reverse segment from TRNA1                
NotShort            13        -6.42        0.00     6.00e+00	:A short segment that is not TRNA1                 
Not                 73        -6.84        0.00     6.00e+00	:A string of As of equal length to TRNA1           
NotExtraLong       438        -6.90        0.00     6.00e+00	:A string of As much longer than TRNA1             
NotLong            109        -6.86        0.00     6.00e+00	:A string of As longer than TRNA1                  
Short                6        -5.66        0.11     6.32e+00	:A short segment from TRNA1                        
MediumReverse       34        -7.00        7.63     1.20e+01	:The reversal of medium                            

Significance levels for the simple null model, especially for the second option, change greatly. In the first option, because sequences can start at any location (e.g., the initial FIM) and jump out of any location (e.g., also the initial FIM), no sequence will have scores worse than zero -- even non-family members will have a negative NLL$-$NULL score. However, the significance level will be similar to that of standard scoring.

In the second option, the number of placements of a sequence to the model is essentially the number of starting points of the sequence plus the number of exit points once the sequence has started using the core of the model. That is, sequences start in the initial FIM, may at any time jump anywhere into the model, and then jump out again latter. The effect seen in the significance level depends on both the sequence length and the model length, and for comparison between different models, must be done to the scores themselves as they are being generated. If the adjust_score variable is set to 1 and SW=2, all simple null model scores will have added to them the log of the sum of sequence and model length. If adjust_score is set to 2 (the default) and SW=1 or 2, then all scores will have added to them 1.5 times the log of the sequence length. In the future, the adjust_score parameter may be refined as we further explore the length dependence of scores. The adjustment is not sufficient for models that have internal FIMs. See Section 10.2.1.


10.2.5 Selecting multiple domain alignments

The hmmscore program also can create multiple-domain alignments and score files from selected sequences. Prior to Version 2.1, this feature was called the multdomain program. To enable this option, the select_mdalign parameter is set in a manner similar to other selection parameters. See Section 10.2.3.

For each selected sequence, the multiple domain search procedure will locate copies of a single motif within each selected sequence. A user specified select_md selection variable, along with thresholds mdNLLnull, mdNLLcomplex, and mdEmax is the criterion by which a subsequence is judged to be a match to the model. If select_md is 1, whenever an mdNLLnull simple null model score or better is achieved, a match to the model has been found, and so forth. The default select_md value of 4 uses mdEmax (default of 0.01) to select subsequence matches. The adpstyle paramater determines whether Viterbi (1, default), posterior-decoded with transitions and emissions (4), or posterior-decoded with only character emissions (5) alignment is used.

Once this match is found, it is cut from the sequence and another match is looked for. The process terminates when no matches scoring better than the selection criteria are found. Note that the multiple domain scoring procedure always uses Viterbi scoring. Thus, it is theoretically possible for a sequence to be selected by hmmscore for multiple domain search but for no domain to be found even if the selection criteria are the same.

The output is similar to that of align2model, except that for each match the sequence ID is modified to indicate where in the sequence the match occurred. Additionally, all letters in the sequence that are part of the match are capitalized. Unlike align2model, multiple domain search sequence output does not include periods (`.') as spacers: prettyalign must be used to correctly space the multiple alignment.

As an example, the file multtrna.seq contains two sequences, each of which contains two tRNA motifs. The multdomain feature is used by selecting selecting all sequences for a multiple domain


hmmscore multtrna -i testf.mod -db multtrna.seq -select_mdalign 8 -sw 2
prettyalign multtrna.mult -l90 > multtrna.pretty

the file multtrna.pretty generated is


;  SAM: prettyalign v3.5 (July 15, 2005) compiled 07/15/05_11:25:33
;  (c) 1992-2001 Regents of the University of California, Santa Cruz
;
;          Sequence Alignment and Modeling Software System
;         http://www.cse.ucsc.edu/research/compbio/sam.html
;
; --------- Citations (SAM, SAM-T2K, HMMs) ----------
; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
;  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
; K. Karplus, et al., What is the value added by human intervention in protein
;  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
; A. Krogh et al., Hidden Markov models in computational biology:
;  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
; -----------------------------------
                                                                      
                                                                      
TRNA12_90:163  ugcuaggggauguagcucagugguagagcgcaugcuucgcauguaugaggccccg
TRNA12_8:77    ugcuagg................................................
TRNA34_104:172 gcuagcguaggcccuguggcuagcuggucaaagcgccugucuaguaaacaggaga
TRNA34_20:87   gcuagcguaggcccugugg....................................

                                                                                                                                              TRNA12_90:163  gguucgauccccggcaucuccaguacugcguugc..............GGCCGUC TRNA12_8:77    ................................................GGAUGUA TRNA34_104:172 uccuggguucgaaucccagcggggccuccagcauaguuugacgggcga--AUA TRNA34_20:87   ................................................----

                10            20        30        40           50                       |             |         |         |            |      TRNA12_90:163  GUCUAGUCUGgauU.AGGACGCUGGCCUCCCAAGCCAGcA.AU.CCCGGGUUCGA TRNA12_8:77    GCUCAG-UGG...U.AGAGCGCAUGCUUCGCAUGUAUG.A.GGcCCCGGGUUCGA TRNA34_104:172 GUGUCAGCGG...G.AGCACACCAGACUUGCAAUCUGG.U.AG.GGAGGGUUCGA TRNA34_20:87   -CUAGCUGG...UcAAAGCGCCUGUCUAGUAAACAGG.AgAU.CCUGGGUUCGA

                  60        70                                                            |         |                                         TRNA12_90:163  AUCCCGGCGGCCGCA----ucgcuu.......................... TRNA12_8:77    UCCCCGGCAUCUCCA----guacugcguugcggccgucgucuagucuggau TRNA34_104:172 GUCCCUCUUUGUCCACCA---guacguagauccgcggc............... TRNA34_20:87   AUCCCAGCGGGGCCUCCAGC--auaguuugacgggcgaauagugucagcgggag

                                                                                                                                              TRNA12_90:163  ....................................................... TRNA12_8:77    uaggacgcuggccucccaagccagcaaucccggguucgaaucccggcggccgcau TRNA34_104:172 ....................................................... TRNA34_20:87   cacaccagacuugcaaucugguagggaggguucgagucccucuuuguccaccagu

                                                              TRNA12_90:163  ............... TRNA12_8:77    cgcuu.......... TRNA34_104:172 ............... TRNA34_20:87   acguagauccgcggc

This file shows the matching area of the sequence within several copies of the sequence. Even though there is an extra FIM state in the model, the required ending delete state is automatically removed from the alignment because auto_fim is at its default value of 1.

If the variable alignshort is set to zero or higher, matching segments of the sequence are clipped, with alignshort positions shown on either side. The IDs are the same as if complete sequences are printed, corresponding to the starting and ending points of the motif within the original sequence. Depending on how large alignshort is, the subsequences may overlap. For example, the commands


hmmscore multtrnas -i testf.mod -db multtrna.seq -alignshort 3
        -select_mdalign 8 -sw 2
prettyalign multtrnas.mult -l90 > multtrnas.pretty
produce the alignment

;  SAM: prettyalign v3.5 (July 15, 2005) compiled 07/15/05_11:25:33
;  (c) 1992-2001 Regents of the University of California, Santa Cruz
;
;          Sequence Alignment and Modeling Software System
;         http://www.cse.ucsc.edu/research/compbio/sam.html
;
; --------- Citations (SAM, SAM-T2K, HMMs) ----------
; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
;  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
; K. Karplus, et al., What is the value added by human intervention in protein
;  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
; A. Krogh et al., Hidden Markov models in computational biology:
;  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
; -----------------------------------
                          10            20        30        40        
                           |             |         |         |        
TRNA12_90:163  ugcGGCCGUCGUCUAGUCUGgauU.AGGACGCUGGCCUCCCAAGCCAGcA.AU.C
TRNA12_8:77    aggGGAUGUAGCUCAG-UGG...U.AGAGCGCAUGCUUCGCAUGUAUG.A.GGcC
TRNA34_104:172 cga--AUAGUGUCAGCGG...G.AGCACACCAGACUUGCAAUCUGG.U.AG.G
TRNA34_20:87   ugg-----CUAGCUGG...UcAAAGCGCCUGUCUAGUAAACAGG.AgAU.C

                  50        60        70                               |         |         |            TRNA12_90:163  CCGGGUUCGAAUCCCGGCGGCCGCA----ucg TRNA12_8:77    CCGGGUUCGAUCCCCGGCAUCUCCA----gua TRNA34_104:172 GAGGGUUCGAGUCCCUCUUUGUCCACCA---gua TRNA34_20:87   CUGGGUUCGAAUCCCAGCGGGGCCUCCAGC--aua

In addition to multtrna.mult, the file multtrna.mstat is produced. It reports the NLL-NULL score for each of the motifs listed in multtrna.mult. In this case, multtrna.mstat (or multtrnas.mstat) looks like


%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  multtrna   Host: peep    Fri Jul 15 13:45:02 2005
%  rph        Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files:  testf.mod
% Database Files:  /projects/kestrel/rph/sam32/demos/multtrna.seq
%
% Motifcutoff 0.500000
% Motifs selected from sequences into file multtrna.mult if better than:
%	 E-value (mdEmax 1.0e-02) (select_md=4)
% See related information in multtrna.dist
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=2) sequences:
%    N / (1 + exp(-(lambda(=1.0000) * Reverse)**tau(=1.0000)))
%    Calculated when Simple < simple_threshold (10000.00)
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID     Length      Simple     Reverse     E-value      X count
TRNA34_104:172       69       -32.06      -30.14     1.63e-13
TRNA12_90:163        74       -31.41      -29.78     2.34e-13
TRNA12_8:77          70       -30.25      -27.98     1.42e-12
TRNA34_20:87         68       -29.98      -27.46     2.37e-12

Because reliable results are only obtained if FIMs are added to the model, multiple domain searchers are best performed when auto_fim is set to 1 (the default).

If the binary expansion of the many_files variable includes a '4' (e.g., 4, 5, 6, 7), the multiple domain score information (.mstat) is sent to standard output instead. See Section 10.2.3.


10.2.6 Multi-track HMM scoring

Suppose one wanted to model a group of protein sequences with associated secondary structure information. The secondary structure labels are important information that could lead to better modelling. To enable such analysis (without the need, for example, of a 60 character alphabet of each amino acid with each secondary structure type), SAM can process multi-track sequences and multi-track HMMs.

A multi-track sequence is specified by a pair or set of sequence files. The first sequence file contains the first track of data (amino acid sequences in this case), while the second sequence file contains the second track of data (secondary structure) with each sequence corresponding exactly in length, identifier, and position to a sequence in the first file. The sequences are specified as a comma-separated list with the db paramater.

Multi-track HMMs are similarly paired or sets of SAM HMM files. The models are specified as a comma-separated list for the -trackmod filename list variable. During the dynamic programming calculation, all transition probabilities are taken from the first or primary track, while character emission probabilities are taken from the joint (product) probability of the first track character being emitted from the given first track HMM state, and the second track character being emitted from the given second track HMM state, and so on. All track models must, of course, be of the same length.

By default, the tracks are all given unity weight, so that, ignoring transition probabilities, a 2-track model will produce scores of about twice the magnitude of a 1-track model. To compensate for this, each track can also be given a character emission coefficient with trackcoeff. The scores (log-probabilities) of the match and insert states are scaled by this ($\geq 0.0$) value, so that a 2-track model with identical tracks and coefficients of 0.5 will produce the same scores (approximately, due to rounding) as a one-track model, such as with the following two commands:


hmmscore t2  -trackmod test.mod,test.mod -db trna10.seq,trna10.seq
             -a RNA,RNA -trackcoeff 0.5,0.5
hmmscore t1 -i test.mod -db trna10.seq

if a mixture of user-defined and standard alphabets are used, the hyphen (`-') can be used in place of the normal character list in an alphabetdef statement to indicate a standard alphabet. For example:


hmmscore t2  -trackmod test.mod,test.mod -db trna10.seq,trna10.seq
             -alphabetdef "RNA -","RNA -" -trackcoeff 0.5,0.5

Multi-track models are easy to create with external software scripts.

The additional tracks have the same SW mode preprocessing (i.e., possibly adding FIMs), but never have their insert or FIM character tables changed (i.e., Insert_method_score and FIM_method_score are ignored for tracks other than the first). If FIMs are added to the model either the character table of the generic node is used (when present) or the geometric match state average (when there is no generic node).


10.2.7 Distributed scoring

Scoring a large database with hmmscore can take several hours on even the fastest workstation. The scoring program includes primitive support for distributed scoring. If the segments variable is set to an integer larger than 1, hmmscore assumes that that many runs of hmmscore are being used to score a complete database. The segment_number variable is used to label each segment.

These parameters might be used as follows:


hmmscore  test1 -i test.mod -db bigdatabase -segments 2 -segment_number 1 -sw 2&
rsh othermachine hmmscore test2 -i test.mod -db bigdatabase -segments 2 -segment_number 2 -sw 2
wait cat test1.dist test2.dist > test.dist

The associated parameter, segment_size, specifies that number of sequences that are read in at a time. At a value of 100, two segments would produce the effect that the first 100 sequences are processed by segment 1, the next 100 be segment 2, the next 100 by segment 1, and so on. Note that workload is partitioned according to the number of sequences rather than the number of residues, some segments may take longer to complete than other segments.

The segment_size parameter is important even if distributed scoring is not being performed to reduce memory consumption by hmmscore. Unfortunately, selection based on E-values requires knowledge of the database size. For this reason, if selection is being performed, segmented file reading will be performed twice; the first time to calculate the size of the database, in the second time to perform the scoring. The dbsize variable can be set to externally indicate the size of the database for calculating E-values. When set, the first reading of the database is not performed.


10.2.8 Single Sequence Smith & Waterman Scoring and Alignment

The hmmscore program can be used to perform Smith and Waterman scoring, alignment, and multiple domain alignment. This option is achieved by internally creating a SAM model based on a single query sequence, gap and continue penalties, and a scoring matrix. These models cannot be stored, and can only be used with Viterbi scoring. Local or global alignment can be performed on this model by appropriately setting sw.

The internal model zeros the appropriate transitions and uses an apporpriately simplified dynamic programming calculation. Its dynamic programming speed is about the same as a dedicated Smith and Waterman program, though the default reverse-sequence e-value calculation doubles execution time.

Smith and Waterman mode is specified by including a query file on the command line. The first sequence in this file is taken to be the query sequence. Additional parameters include gap and continue, which are by default set to 12 and 1, respectively, and matrix, set to ``blosum62.'' Matrices are read in BLAST format. They are first looked for in the current working directory, then in the directory pointed to by BLASTMAT, if set. If not, the subdirectories aa and nt, depending on alphabet, of the environment variable PRIOR_PATH are checked, and finally, the same subdirectories of the default prior path compiled into the code. See Section 8.1.1.

As with other forms of scoring, SAM is able to calculate e-values for Smith and Waterman queries based on the reverse sequence in all model. For this to be effective, the system must know the scaling factor for comping E-values based on how the scoring matrix has been formed. The value swlambda should be set to $\ln(b)/u$, where $b$ is the base of the matrix and $1/u$ is the unit weight. For example, the Hennikoff & Hennifkoff ``blosum62'' matrix distributed with BLAST and included with SAM is in units of 1/2 bit, so $b=2$ (bits are base 2) and $u=2$, since a `2' in the matrix corresponds one unit (a bit, in this case). The default value of lambda is thus $\ln(2)/2=0.34657$.


hmmscore sw -query globins50 -db globins50              -matrix blosum62 -gap 12 -continue 1 
will produce a score file using the first sequence of globins50 as a query and the first 10 sequence of globins50 as a database.

%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  sw   Host: peep    Fri Jul 15 13:45:02 2005
%  rph   Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files: 
% Database Files:  /projects/kestrel/rph/sam32/demos/globins50.seq
%
% Smith & Waterman Query Sequence:  BAHG$VITSP
% Cost matrix:  /projects/kestrel/rph/sam32/lib/matrix/aa/blosum62
% Gap cost 12    Continue cost 1
% Score DP Method: viterbi single-path (dpstyle = 1)
% Align DP Method: viterbi (adpstyle = 1)
% 50 sequences, 7369 residues, 0 nodes, 0.08 seconds
%
% Sequence scores selected:  All  (select_score=8)
%
% Simple: NLL-NULL using FIM probabilities
% Reverse: NLL-NULL for the reverse sequence NULL model
%    Calculated when Simple < simple_threshold (10000.00)
% E-value on N (=50) sequences:
%    N / (1 + exp(-(swlambda(=0.3466) * Reverse)**tau(=1.0000)))
%    Rescale E-values or use -dbsize for multiple scoring runs.
%    WARNING:  E-VALUES ARE NOT CALIBRATED!
% Scores sorted by E-value, best first
%
% Sequence ID   Length      Simple     Reverse     E-value      X count
The various select options (e.g., selectalign, selectmdalign, etc.) can also be used, according to Smith and Waterman score (NLLnull and mdNLLnull), reverse sequence (NLLcomplex and mdNLLcomplex, or e-value (Evalue and mdEvalue) score.


10.2.9 Scoring using the Kestrel parallel processor

The hmmscore program can use the UCSC Kestrel parallel processor to perform high-speed scoring of sequence databases.

Currently hmmscore only supports global or local EM scoring on Kestrel and models that have a final length, with added FIMs, of no more than 512 nodes. If any of these restrictions are not met, hmmscore reverts to the sequential algorithm, unless kestrel_fallback is set to 0.

Kestrel scoring is enabled using the use_kestrel parameter. Since a request queue mechanism is not currently available as part of the Kestrel runtime environment, a sleep and retry mechanism is implemented as part of hmmscore. Retrying is controlled using the kestrel_retry_cnt and kestrel_retry_time parameters. The kestrel runtime environment program is executed to access the Kestrel server and is found using the PATH.

Small models will generally score more quickly using the sequential algorithm than with Kestrel. A minimum model size for using Kestrel may be specified with the kestrel_min_model_len parameter. Models smaller than this will use sequential scoring.

Kestrel-format databases, created using _kestrel_db, are used by hmmscore. The original sequence databases are specified as the value of the -db arguments to hmmscore. The file names of the Kestrel-specific database will be derived from the db file names by appending the appropriate suffix. If subtract_null is 4, then .krseq is used, otherwise .kseq. Additionally, the .kids file must exist in the same directory as the -db file. The .kseq or .krseq are found either in the -db file directory or in the directory on the Kestrel server specified by kestrel_remote_db_dir. See Section 10.12.10.


10.2.10 Calibration and Model libraries

A SAM model library is a SAM parameter file that has been divided into several sections, one for each item in the model library. For a model, each section includes either a modelfile or trackmod, though it is acceptable to actually include the model definition in the library file. The typical parameters completely specify the method of scoring, including dynamic programming mode (dpstyle, sw, autoaddfims) and scoring parameters (subtractnull, and swlambda or emlambda). Also, each entry includes a name and a comment.

Single-entry model libraries are created when the calibrate option of hmmscore is used on a modelfile or a trackmod. If a modellibrary is passed to hmmscore, the resulting runname.mlib file has the same number of models as the original model library. To create a multi-model library, individual libraries can be concatenated, though you may wish to remove the comments (lines beginning with `%') for a more compact form.

When a model library is scored with rdb output, a column is included for the model name. Distance .dist files, however, do not have a means of indicating model names. Therefor, the standard .dist output will create one distance file for each model in the model library. The files will have a three-part name, being the run name, a sequence number (starting at 1 for the first model in the model library), and the name of the model as indicated in the model library. For example, test.1.firstmodel.dist is the first score file created by hmmscore run test that used a model library in which the first model was named ``firstmodel''. This format is used so that a model library can be scored multiple times (for example, with different databases) in the same directory. If the binary representation of manyfiles includes a 1 in the 8s place (for example, by setting manyfiles to any number between 8 and 15, inclusive), the distance file names will only use the model names specified in the model library, such as ``firstmodel.dist''. This will create a problem, of course, if multiple models in the model library have the same name.

SAM model library output will use absolute path names if the modlib_absolute variable is set to 1. This should be done if model libraries created in different directories are to be combined, and absolute path names were not used on the command line.

All models in a model library must have the same number of sequence tracks to facilitate comparison against a sequence database.

Suppose that model library test.mlib contains:


MODLIBMOD first Test.mod with sw 1
modelfile test.mod
dpstyle 0
sw 1
ENDMODLIBMOD

MODLIBMOD second Test.mod with sw 2 modelfile test.mod dpstyle 0 sw 2 ENDMODLIBMOD

This test.mlib file can then be calibrated using hmmscore:


hmmscore caltest -modellibrary test.mlib -calibrate 1

To produce the file


%  SAM: hmmscore v3.5 (July 15, 2005) compiled 07/15/05_11:25:31
%  (c) 1992-2001 Regents of the University of California, Santa Cruz
%
%          Sequence Alignment and Modeling Software System
%         http://www.cse.ucsc.edu/research/compbio/sam.html
%
% --------- Citations (SAM, SAM-T2K, HMMs) ----------
% R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
%  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
% K. Karplus, et al., What is the value added by human intervention in protein
%  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
% A. Krogh et al., Hidden Markov models in computational biology:
%  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
% -----------------------------------
%  caltest   Host: peep    Fri Jul 15 13:45:27 2005
%  rph       Dir:  /projects/kestrel/rph/sam32/SAMBUILD/peep/demos
% -----------------------------------
% Inserted Files: 
% Database Files: 
%

MODLIBMOD first Test.mod with sw 1 % Calibration data for 1000 random sequences % 1000 of 1000 scores > calibrate_threshold (-1e+20) used (default). model_file test.mod lambda 0.863821 tau 0.911306 sw 1 dpstyle 0 jumpinprob 1.000000 jumpoutprob 1.000000 subtractnull 4 FIMmethodscore -6 insertmethodscore 0 adjust_score 2 ENDMODLIBMOD

MODLIBMOD second Test.mod with sw 2 % Calibration data for 1000 random sequences % 1000 of 1000 scores > calibrate_threshold (-1e+20) used (default). model_file test.mod lambda 2.319682 tau 0.792675 sw 2 dpstyle 0 jumpinprob 1.000000 jumpoutprob 1.000000 subtractnull 4 FIMmethodscore -6 insertmethodscore 0 adjust_score 2 ENDMODLIBMOD

As can be seen, model calibration has added additional information to the model library. The most important is the setting of the lambda parameter for E-value calculation.

If any of the parameters shown in the resulting model library are changed, callibration must be performed again. It is important to note that models must be calibrated separately for all differences in scoring method.

The process of model calibration involves scoring thousands of sequences from either a database or a an internal random sequence generator (for protein sequences, a dirichlet mixture distribution is used). This can be a time-consuming process, but once calculated hmmscore will produce much more accurate E-values.

At present, we prefer scoring against a database of sequences. In this process, we assume a symmetric distribution, and remove all reverse null model scores less than (better than) zero to ensure that calibration is based on non-matching sequences. To perform database calibration, set calibrate to 1 to use the entire database, or a number larger than one, such as 1000, to use the first 1000 sequences. The evalues reported as a result of a calibration run will be the newly calibrated evalues. If you generally score small sets of sequence, you should calibrate your model against a large, non-redundent database, and then score the calibrated model against the smaller database.

For random sequence calibration, the only required parameter is calibrate, which specifies the number of random sequences to score (if calibrate is set to `1', an internal default is used). Optionally, trackprior specifies a list (one per track) of Dirichlet mixtures over sequence composition, genprot_prior indicates the default Dirichlet mixture prior for protein sequences, genehl2_prior indicates the default Dirichlet mixture prior for the EHL2 alphabet, gs_mean_log_len is the natural logarithm of the mean sequence length to generate, and gs_sd_log_len is the natural logarithm of the standard deviation of the synthetic sequence length distribution. If a Dirichlet mixture is not available for one or more of the sequence tracks, the characters are drawn according to SAM's internal default background frequencies with no variation in distribution between sequences.

We are reasonably happy with our single-track model calibration method. Our random sequence generators for multi-track model calibration are not appropriately linked, and thus do not effectively calibrate multi-track models, such as those with amino acid and secondary structure sequences. This is currently an active research area, and we hope to improve the calibration method for these models in the next release.


10.3 addfims

The addfims program is obsolete and may be removed from the SAM suite at a future date. With the addition of full-local and semi-local scoring, training, and alignment using the SW and SWtrain options, FIMs are automatically added to the beginning and end of the model internally (and latter removed) See Section 9.6.

The addfims program can be used to add Free Insertion Modules to the beginning and end of a model. The program modifymodel can also be used for this purpose, and `O' characters in modelfromalign also produce FIMs. See Section 8.5. This is particularly useful if a model has been trained on a clipped sequence motif, and is to be used in analyzing full sequences.

For example, the file trna10f.seq is the same as trna10.seq, except that sequences 2-5 have had extraneous characters appended to one or both ends. Using the test.mod file previously generated, an alignment of the first 5 sequences looks like this:


                                    10            20        30        
                                     |             |         |        
TRNA1  gg...................GGAUGUAGCUCAG-UGG...U.AGAGCGCAUGCUUCGCAUGU
TRNA2X gcggccgucgucuagugc...GGCCGUCGUCUAGUCUGgauU.AGGACGCUGGCCUCCCAAGC
TRNA3X cccggcccugugg........-----CUAGCUGG...UcAAAGCGCCUGUCUAGUAAAC
TRNA4X gggcgaauagugucagggcga--AUAGUGUCAGCGG...G.AGCACACCAGACUUGCAAUC
TRNA5X gccggg...............--AUAGCUCAGUUGG...U.AGAGCAGAGGACUGAAAAUC

       40            50        60        70                                   |             |         |         |                           TRNA1  AU.G.A.GGcCCCGGGUUCGAUCCCCGGCAUCUCCA----................... TRNA2X CA.GcA.AU.CCCGGGUUCGAAUCCCGGCGGCCGCAC---ggcggccgca......... TRNA3X AG.G.AgAU.CCUGGGUUCGAAUCCCAGCGGGGCCUCCA--gggg............... TRNA4X UG.G.U.AG.GGAGGGUUCGAGUCCCUCUUUGUCCACCA--................... TRNA5X CUcG.U.GU.CACCAGUUCAAAUCUGGUU-------ccuggcaugguuccuggca

This alignment is incorrect: the extra end characters do not all use end insert states. Rather, internal insert states are found to minimize the alignment cost for sequences 2X, 4X, and 5X.

The addfims program has an interface identical to that of buildmodel, though only the runname, model file (and its model, or if not present, its regularizer), and the alphabet are used:


addfims testf -insert test.mod
align2model testf -i testf.mod -db trna10f.seq
prettyalign testf.a2m -l90 > testf.align

                                    10            20        30        
                                     |             |         |        
TRNA1  gg...................GGAUGUAGCUCAG-UGG...U.AGAGCGCAUGCUUCGCAUGU
TRNA2X gcggccgucgucuagugc...GGCCGUCGUCUAGUCUGgauU.AGGACGCUGGCCUCCCAAGC
TRNA3X cccggcccugugg........-----CUAGCUGG...UcAAAGCGCCUGUCUAGUAAAC
TRNA4X gggcgaauagugucagggcga--AUAGUGUCAGCGG...G.AGCACACCAGACUUGCAAUC
TRNA5X gccggg...............--AUAGCUCAGUUGG...U.AGAGCAGAGGACUGAAAAUC

       40            50        60        70                                    |             |         |         |                            TRNA1  AU.G.A.GGcCCCGGGUUCGAUCCCCGGCAUCUCCA----................... TRNA2X CA.GcA.AU.CCCGGGUUCGAAUCCCGGCGGCCGCAC----ggcggccgca......... TRNA3X AG.G.AgAU.CCUGGGUUCGAAUCCCAGCGGGGCCUCCA---gggg............... TRNA4X UG.G.U.AG.GGAGGGUUCGAGUCCCUCUUUGUCCACCA---................... TRNA5X CUcG.U.GU.CACCAGUUCAAAUCUGGUU--------ccuggcaugguuccuggca


10.4 fragfinder

The fragfinder program will, given a database (db) and a model (modelfile, insert, or trackmod and trackcoeff), find a specified number of gapless sequence fragments of a specified length commencing at each model state (excepting the ones at the end where fragments of the given length would not fit).

The program calculates the posterior matrix for each sequence -- the values ``what is the probability that this specific character is in this specific model state'' using the forward-backward algorithm. Then, for each position in the model, the top numpermatch (default 5) scoring fragments of length fraglen (default 10) are saved. Fragment scores are the sum of the posterior scores for the fragment with a null model subtracted off. In this case, we use the posteriors of the reversed fragment as the null model. This provides some compenstation for the strong length-dependent signal present in the posteriors, even with fully-local alignment (characters near beginning the sequence are far more likley to be at the start of the model).

The program produces two output files: runname.frag and runname.fstat. The first is an a2m multiple alignment format of the fragments. If desired, a single sequence can be specified to be aligned against the model, and then be used as a guide, specified with firstsequence. The second file is a standard SAM score file with mangled names, sequence lengths, and fragment scores. The names are mangled by appending ``.x.y'' to each sequence ID, where `x' is the starting position in the model and `y' is the starting position in the sequences. Fragments will be listed in order of starting model node number, with numpermatch fragments for each model node.


10.5 grabdp

The grabdp program can be used to examine the dynamic programming table of the forward-backward dynamic programming calculation.

If used with no arguments, or a dynamic programming style dpstyle other than 2 (forward-backward with posterior saving), it will produce a runname.seqid.freq model file with a frequency model for every sequence within the db presented to grabdp. The frequency files are similar to those produced with buildmodel, and indicate the fractional number of sequences that have used each transition and each character generating state in the model.

When grabdp is run with dpstyle 2, it produces a full dump of the dynamic programming posterior matrix. The resulting runname.pdoc file is very large, as it includes the posteriors for the three transitions into match and the character posteriors for each for the index points in the dynamic programming matrix (i.e., approximately sequence length times model length records). The file first includes a copy of the sequence and of the model, followed by the posteriors. The following posteriors are for a sequence `AT' against a 2-state model trained on the several copies of the sequence `AT'.


POSTERIORS
(Residue 2 DM+MM+IM) 46.0517 19.3068 18.9014 -7.92682e-05 
(Char T, index 3, Match) 19.9366 13.342 0.00100001 19.5945 
(Char T, index 3, Insert) 11.603 7.322 7.806 19.5945 
(Residue 1 DM+MM+IM) 46.0517 13.3407 0.00114948 18.9014 
(Char A, index 0, Match) 19.9366 0.00699997 9.183 19.5945 
(Char A, index 0, Insert) 9.616 4.984 8.509 19.5945 
(Residue 0 DM+MM+IM) 46.0517 0.00699997 9.18296 18.9014 
Posteriors are output in reverse order. Each ``Residue'' line shows the log-probability that residue being in a given matchstate. Since there is no residue 2, the first line should be ignored. The second and third lines indicate negative log probaiblities of the character T (which has character code 3 in the DNA alphabet) being in the match or insert states of each model node. As can be seen, the character is almost certainly in model node 2, as indicated by the `0.00100001'. The `Residue 1' line indicates that the match state is again most likely; it shows the sum of all the transitions going into match states for that node. Residue 0, `A' is similarly placed in model node 1.

Because of the size of these files, the grabdp program will only output the values for a single sequence and model, choosing the first sequence in a database. The program requires a model file, a database file, and optionally a single sequence identifier.

The view_pdoc program is a viewer for the posterior decoding files. It enables zooming and coordinate location in regions of interest in the posterior decoding matrix. The program requires the wish simple windowing shell, part of the Tk suite. It may be necessary to edit the view_pdoc script to include path names for xwd, the X window dump program, and ppmtogif, the image conversion program. An example visualization is shown in Figure 16.

16: The view_pdoc visualization of posteriors calculated for a globin against the 1babA model.
\begin{figure}\centerline{\psfig{figure=1babA-pdoc.eps,width=0.7\textwidth}}\end{figure}

For diagnostic purposes, it can sometimes be useful to view the amino acid priors and posteriors. When the dump_match_probs flag is set, grabdp will generate a RDB file listing the posteriors for each node, and listing the priors under the label ``FREQAVE''.


10.6 get_fisher_scores

The get_fisher_scores program scores sequences with a model and outputs Fisher score vectors for input an external discriminative learning program as described in:

T. Jaakkola and M. Diekhans and D. Haussler A discriminative framework for detecting remote protein homologies, Journal of Computational Biology 7(1):2000

The hmmscore arguments specifying the model, sequences, and scoring parameters (e.g. -dpstyle, -sw) are valid. The -fisher_feature parameter specifies which features are used to calculate the Fisher score vectors. Use the -null_score_weight_scale parameter to weight sequences by their NLL-NULL score.

As the Fisher score vectors can be large, they are written as a stream to stdout, allowing piping directly to another program without creating an intermediate file. The output stream maybe in ASCII or binary format, selected by the -binary_output8.4.5 option. The first line of each record indicates the record format and sequence id. ASCII records start with a line in the form


>A seqid1
and binary start with

>B seqid1

In both formats, the record header is terminated by a newline.

The first number in a record indicates the total number of features in the record. It is followed by that many floating point numbers. In an ASCII stream, the numbers area whitespace separated (both spaces and newlines are used) designed to be read by fscanf. In the binary format, the feature count is an int and the features are floats.


>A seqid1
3
0.1 0.2
>A seqid2
3
1.2 -0.3 1.0

The features may also be written to a file in RDB format, specified with the -rdb, creating a file names runname.fisher-rdb. The -write_dist parameter can be used to generate a runname.dist-rdb score file, which maybe useful in analyzing the results.

For example, the following command will derive Fisher scores vectors from the match states and Dirichlet mixture used to train the model:


get_fisher_scores unused -fisher_feature match_prior -i test.mod -db prots.seq -sw 2


10.7 modelfromalign

Modelfromalign takes a multiple alignment and converts it to a model. As a base model, the program starts with the default regularizer (or a specified regularizer, as for buildmodel), and then calculates node frequencies according to the given multiple alignment. Sequences in the alignment can be weighted according to the alignment_weights file, discussed in Section 9.4.

If a trustworthy hand alignment is available, this is often the best way to build a model: create one from an alignment, and then refine it using buildmodel. If some sections of the alignment are particularly important, it may be desirable to make them fixed nodes, as described in Section 8.4.2.

The alignfile parameter can also be used with buildmodel to specify a seed alignment. See Section 8.3.

The modelfromalign program will read any readseq format, but has a few special interpretations. It follows the align2model convention that lowercase letters are insertions, hyphens are deletions, and dots are simply filler for insertions in other sequences. Additionally, the letter `O' is converted into a FIM, following a convention used in some multiple alignment formats (the other SAM programs will convert `O' to the `X' wildcard). If all sequences do not have the same number of uppercase letters and hyphens, then modelfromalign will try treating all characters as uppercase and all periods as hyphens (i.e., it will try modeling each character as a match column).

The program has one required parameter that must be set on the command line or in an inserted or .samrc file: the alignfile is the name of the file with the alignment. For example,


modelfromalign trna2 -alignfile trna2.align 

will produce a model from the trna2.align alignment. All buildmodel parameters dealing with regularization and prior libraries are used in the conversion from column frequencies to a model. Prior libraries are particularly helpful when converting a small protein alignment to a model.

If the alignment is of a motif, setting the align_fim variable to 1 will cause FIMs to be added to the model before printing it out.


10.8 pathprobs

The pathprobs program can be used to generate the posterior probabilities of each of the characters in an alignment. This is a small subset of the information generated by the grabdp program. In effect, pathprobs uses the alignment to trace through the posterior matrix and decide which character emission probabilities are desired. Optionally, pathprobs can convert all match states with a worse posterior than some threshold to insert states, which can improve alignments.

The program can be run using, for example,


pathprobs 1babA-50p -modelfile 1babA-t2k-w0.5.mod -alignfile 1babA-50.a2m
which will produce tab-separated RDB output alignment probability file 1babA-50p.apr along the lines of:

SequenceID	SeqIndex	ModIndex	ModState	NegLnPosterior	T0:protein
10s	8n	8n	8s	9n	10s
1babA	0	0	I	0.347000	X
1babA	1	2	M	0.461000	M
1babA	2	3	M	0.185000	E
1babA	3	4	M	0.027000	L
1babA	4	5	M	0.007000	S
1babA	5	6	M	0.003000	P
1babA	6	7	M	0.001000	A
Here, the first line identifies the columns. In this case, a single-track model and sequences were used. The second line is field width and type specifiers. The remaining lines show, for each character in the labeled sequence, the state that it was generated by according to the alignment (a model index and an `M' for match or `I' for insert; `D' never occurs because it does not generate any characters), the negative natural log posterior probability according to the forward-backward algorithm of the character appearing in that state, and the (possibly multi-track) character itself.

The program will also generate the 1babA-50p.pa2m file that interleaves sequences and digits corresponding to negative log posterior probabilities multiplied by ppscale (default = 1.0). The values are rounded to a single digit, so 9.0 and higher is reduced to the digit 9. With the default ppscale of $1.0/\ln(e)=1.0$, 0 indicates a completely certain position ($p=1.0$), 1 indicates 1 NAT of probability ($p=0.368$), 2 indicates 2 NATS ($p=0.136$), and so on. For example, here is a prettyaligned excerpt of a .pa2m file for several globins.


;  SAM: prettyalign v3.5 (July 15, 2005) compiled 07/15/05_11:25:33
;  (c) 1992-2001 Regents of the University of California, Santa Cruz
;
;          Sequence Alignment and Modeling Software System
;         http://www.cse.ucsc.edu/research/compbio/sam.html
;
; --------- Citations (SAM, SAM-T2K, HMMs) ----------
; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
;  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
; K. Karplus, et al., What is the value added by human intervention in protein
;  structure prediction, Proteins: Stucture, Function, Genetics 45(S5):86-91, 2001.
; A. Krogh et al., Hidden Markov models in computational biology:
;  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
; -----------------------------------
                           10             20                  30        40         50     
                            |              |                   |         |          |     
1babA      x.......-MELSPADKTNVKAAWGKV.....GA.HAG......EY...GAEALERMFLSFPTTKTYFPHF.DLS....
1babA      x.......-000000000000000000.....00.000......00...0000000000000000000000.000....
BAHG$VITSP m.......--LDQQTINIIKATVPVL.....KE.HGV......TI...TTTFYKNLFAKHPEVRPLFD-.--....
BAHG$VITSP m.......--0000000000000000.....00.000......00...00000000000000000000-.--....
GLB$APLJU  a.......--LSAADAGLLAQSWAPV.....FA.NSD......AN...GASFLVALFTQFPESANFFNDF.KGKslad
GLB$APLJU  a.......--0000000000000000.....00.000......00...0000000000000000000000.000slad
GLB$APLKU  ........-SLSAAEADLVGKSWAPV.....YA.NKD......AD...GANFLLSLFEKFPNNANYFADF.KGKsiad
GLB$APLKU  ........-00000000000000000.....00.000......00...0000000000000000000000.000siad

The pathprobs program can also trim alignments (Cline, Karplus, & Hughey, Bioinformatics 18(2):306-314, 2002) according to a specific posterior threshold indicated with pptrim. All match states with negative log probabilities equal to or above this threshold will be turned into insertions.

It is of course critical that the alignment and the model are of the same length. A typical use of the program would be to generate a Viterbi or Posterior-Decoded alignment and then use pathprobs to determine how strongly that alignment is reflected in the posterior matrix.


10.9 predict_track

The predict track program can be used to predict the secondary structure of a sequence based on a protein model and a set of sequences of known secondary structure. The program can either be provided with models for all tracks or (in the 2-track case) create a model for the second track based on a database of 2-track sequences. The program thus requires a base model or set of base models, a prediction sequence, a multi-track alphabet definition, and (when it generates its own second-track model) a 2-track database. See Section 10.2.6.

For output, the program will produce a .ptrack file of per-seuqence structure predictions or if rdb is set, a .rdb file. If the program created a model for the second track, it will be written to a .mod file, and can be used, for example, with multi-track HMM scoring and alignment.

For example, the command:


predict_track db1.test1 -sw 2 -subtractnull 4  -i db1.test1.mod
         -a protein,ELH -db predict.db1.seq,predict.db1.2d 
         -predictseq db1.test1.seq
will produce as output db1.test1.ptrack, a predicted secondary structure sequence for the first sequence in the db1.test1.seq file based on the protein model and the primary and secondary structure sequences provided to predict_track, and db1.test1.mod, a secondary structure model. The model is created by aligning the predict.db1.seq sequences to db1.test1.mod, weighting the sequences, and then tabulating the frequencies of each letter in each model node according to the generated sequence weights. The frequencies are regularized with the ELH default regularizer, and transitions are copied from db1.test1.mod. See Section 9.4.3.

17: A model and its predicted secondary struction (makelogo db1.test1.eps -modelfile db1.test1.mod -logo_captionf db1.test1.ptrack -logo_sections 1-32).
\begin{figure}\begin{center}
\psfig{figure=db1.test1.eps,width=0.8\textwidth}
\end{center}\end{figure}

The model and the prediction can be visualized with makelogo using the logo_captionf option (Figure 17).

In creating a secondary track model, FIM_method_train and Insert_method_train are used to set the FIM and Insert tables. It is highly recommended to set these to 0 to use the regularizer (equivalent to the generic node) rather than the buildmodel letter-count defaults.

10.10 Model manipulation


10.10.1 checkmodel

The checkmodel program reports various information about a SAM HMM, including alphabet, name, and length, as well as the number of each type of special node the model includes. Also, the total number of labels are reports (See Section 9.7.) and the number of environment codes (See Section 11.1.). Its output is to stdout, but a run name must be provided anyway:


checkmodel test -modelfile test.mod


10.10.2 drawmodel

The drawmodel program is a means of generating a postscript drawing from a model, regularizer, or frequency count data (created with the print_frequencies option). The program is run with two arguments, the first being a model file, and the second the output file.


drawmodel model.mod drawing.ps

The program will scan the file, looking for models, regularizers, and frequency counts, and query whether or not each one should be printed, after presenting a line from the file.

There are two drawing options: overall and local. Overall is the correct option for frequency counts -- the outgoing transitions of each state are drawn in different styles depending on the fraction of all sequences that use that transition. The circular delete states show the node number, while the diamond insert states show average number of characters, rounded up, inserted by each sequence that used the insert state.

In the local option, suitable for models and regularizers, transitions from a given state are drawn according to what percentage of sequences in that given state take each transition. Also, the diamond insert states have the percent of sequences which, once within the insert state, remain in the insert state. (See Figure 4 on page [*].)

The drawmodel program has several command-line options: -landscape to draw models in landscape format, and -scale num, to change the scale of the drawing to an arbitrary floating-point number. The default is portrait mode with a scale 0.235, which fits six row of 19 protein model nodes (plus one ghost node, the first model of the next row) on each page. Larger scale settings increase the size of the model nodes, and cause fewer to be placed on each line. For additional customization, the postscript file, which is readable, can be modified (e.g., to change print on a different size of paper).

If -mod n, -freq n, or -reg n is specified on the command line, the $n$th model, frequency count, or regularizer will be selected for printing, and the interactive queries on which model to print will not occur.

Europeans and other people who like the A4 paper size can change `/A4 false def' to `/A4 true def' near the bottom of the header commands in the output of drawmodel.


10.10.3 hmmconvert

The hmmconvert program takes a model file as input and outputs a new model file in the `opposite' format (binary or text) from the original.

The command syntax is as follows:


hmmconvert runname -model_file test.mod

If test.mod is in text format, runname.mod will be in binary format and vice versa. If you wish to destroy your original model file, use a runname of the file's name without the .mod extension.


hmmconvert test -model_file test.mod

If test.mod is in text format, this command will create a new file called test.mod in binary format. If test.mod is binary, the new test.mod will be text.

To preserve your original modelfile, enter a new runname.


hmmconvert new -model_file test.mod

This will create a file called new.mod in the opposite format of test.mod. test.mod will be left alone.


10.10.4 makelogo

Sequence logos (Schneider and Stephens, NAR 18:6097-6199, 1990) are an excellent means of viewing model match state distributions (Figures 3 and 17). Each HMM match state is represented by a bar. The bar height corresponds to the negative entropy in bits of that match state distribution. Each bar is composed of letters in the target alphabet and the letter `X'. The size of the letters correpond to the relative freqeuncies of the letters; letters of low frequency are lumped together into the `X' character. FIMs are represented with a large letter `O'. The makelogo program has default internal colors for protein, nucleotide, and secondary structure alphabets.

The simplest way to run the program is with just a model file to create the postscript file test.eps, as:


makelogo test -modelfile test.mod

18: Primary structure model of 3chy with sequence and secondary structure
\begin{figure}\centerline{\psfig{figure=advancedlogo1.eps,width=\textwidth}}\end{figure}

19: Predicted secondary structure model of 3chy with sequence and true secondary structure
\begin{figure}\centerline{\psfig{figure=advancedlogo.eps,width=\textwidth}}
\end{figure}

The logos are colored according to the alphabet of the model; new color schemes can be specified with the logo_color file name. The internal defaults are in the example files protein.colors, nucleotide.colors, and stride.colors included in the SAM distribution, and can be modified to create new color schemes. Each line of the file is either a comment (preceeded by an asterix) or includes a single letter and a triple of RGB values between 0.0 and 1.0. The makelogo file will look for color files in the current directory as well as the PRIOR_PATH. See Section 11.1.

The logo diagram can be annotated with a title (logo_title), and also with sequences in three ways. First, a single sequence (with dots, but not lower-case letters, ignored), such as a protein that the model is based on, can displayed under the graph using logo_under_file (and the color file logo_under_color if black is not desired). If a sequence (or sequences, when logo_captionf_numseq is set to a number larger than 1) of structure is specified with logo_captionf (and the corresponding logo_captionf_color), repeated occurrences of the secondary structure character will be indicated as a range of positions with that the character for each sequence in the file. Finally, a file that alternates lines of start and end positions, and annotations, can be specified with logo_caption. All three of these options can be used simultaneously.

For example, Figure 18 shows a SAM-T2K (the pre-release sucessor to SAM-T2K) model for the PDB sequect 2chy. The logo diagram includes the 2chy sequence under the sequence, as well as the secondary structure, both input from separate FASTA-format files. The following Figure 19 shows a SAM-T2K secondary structure prediction HMM calculated without the aid of the secondary structure sequence for 2chy. The logo again includes the 2chy sequence as a logo_under_file, and the true secondary structure as a logo_captionf.

The program has a number of other options (all beginning with logo_). See Section 12.


10.10.5 modifymodel

modifymodel is a utility program for modifying SAM models. It has an interactive type in interface as well as command line arguments. This program will grow over time to meet the needs of users. It contains extensive built in help with examples and a question mark (?) will list the commands:


AddNodes nodeNumber count  #adds generic nodes after specified
Change subCommand...       #has sub commands, change ? to see
DeleteNodes nodeSpec       #deletes the specified nodes
DuplicateNode nodeNumber count  #copies the specified node
NewModel modelLength       #make new model from default generic
PrependToCurrentModel fileName  #reads model from file, pastes on front
PostpendToCurrentModel fileName #reads model from file, puts on end
ReadModel fileName
Show subCommand...
WriteModel [fileName]   #defaults to name of file last read in
Quit
Help

Most (at the moment all) operations act on the current model: which is the last read in model. The basic usage is to read a model in, do what you want to it (add nodes, change types, paste models together, etc.) and write it back out.

The currently supported functions are

To print all the types of the nodes (use this to see how long the model is):


show type all

To change all nodes that do not have a type into KEEP nodes,


change type _ K ALL

To change node 9 into a FIM:


change type + F 9

To shorten a model by deleting nodes 4 through the end:


delete 4,end

Below is an example of reading a model, deleting the first 20 nodes, adding a new node to the front and changing it to a FIM, then pasting it on to the end of a second model, and finally writing the modified model out.


read one.mod
delete 1,20
add 0 1
change type + F 1
prepend two.mod
writemodel both.mod
quit


10.10.6 sam2psi and psi2sam

The sam2psi program converts the character probabilities in the match states to a PSI-BLAST (Altschul et al., NAR 25:3389-3402, 1997) profile. This file can then be loaded into the PSI-BLAST program for searching.

The program creates two files: a PSI-BLAST checkpoint profile, based on the format used by version 6.60 (5/30/2003) of posit.c, the PSI-BLAST module of blastpgp. The checkpoint is called runname.ckp. PSI-BLAST also requires a query sequence that exactly matches the sequence specified in the checkpint. SAM HMMs do not have a single sequence that can describe the generation of the model. For these profiles, sam2psi creates a runname.cks checkpoint sequence file for a synthetic sequence based on the most likley amino acid to be generated by each match state. This sequence is also present in the checkpoint file.

The commands are thus:


sam2psi test -modelfile test.mod
blastpgp -i test.cks -d nrp -R test.ckp

The psi2sam program will read a PSI-BLAST checkpoint file and produce a SAM model. It will use the default or specified transition regularizer, and the current train_reset_inserts to set the insertion state probabilities. See Section 8.6. `X' states begin as the background distribution of the regularizer (the PSI-BLAST checkpoint files do not include probability tables for `X' positions), and are similarly changed according to train_reset_insert. The program will output (to stdout) the PSI-BLAST query sequence for verification.


psi2sam testp -modelfile test.ckp


10.10.7 Conversion between SAM and HMMer

SAM and HMMer are the two most widely used HMM sequences alignment and modeling systems. Even though they both employ hidden Markov models, their file formats are vastly different. Provided with this release are the conversion programs sam2hmmer and hmmer2sam that make almost complete conversion between SAM and HMMer v1.7 possible.

Martin Madera and Julian Gough have written a perl converter between SAM and HMMer 2.0 formats, available from http://www.mrc-lmb.cam.ac.uk/genomes/julian/convert/convert.html and the SAM WWW site. Use of this program is strongly suggested.

The conversion is ``almost'' because there is some information loss, and this information loss is due to the structural differences between the two systems. For those users who wish to use the conversion utilities, we discuss these issues below.

10.10.7.1 Internal HMM Structure

Both SAM and HMMer use the same linear model of nodes that contain 3 states - match, delete, insert. The number of transitions per node is the same, 9. The internal difference is in how transitions between nodes are viewed. SAM views transitions INTO a node, thus each node contains the transitions FROM the previous node. HMMer is the opposite, transitions are TO the next node. HMMer's convention makes more sense because each triple of transitions stored in a node sums to one. SAM's convention, slightly helpful in the internal dynamic programming loop, is maintained for historical reasons. This becomes a problem at the beginning and end of the model.

The begin node of HMMer has a valid transition probability distribution out of the MATCH state (in that the transition probabilities sum to one), but has the null distribution for matching characters. This means HMMer always starts in the insert state, and the begin node match state is not used. The begin node of SAM can start in either non-character-generating match or an insert. So the SAM to HMMer conversion is fine, but HMMer to SAM conversion leaves SAM's begin node match state character table zero. The net result is that SAM is forced to start in the insert state.

The end node of HMMer also has a zero match state character table.

The two programs are used as follows:


hmmer2sam hmmerfile.hmmer  newsamfile.mod
sam2hmmer newhmmerfile -i samfile.mod
Above, the output of sam2hmmer is stored in the file newhmmerfile.hmmer.

10.10.7.2 Extra Information in HMMer

HMMer allows other information on a per node basis. This information seems to be solvent access and consensus information. This information cannot be used in SAM at this time, and is silently dropped. A note of the global presence of this extra information is noted as a comment in the SAM model, solely for documentation purposes.

10.10.7.3 Extra Information in SAM

SAM has constructs in the model that are not supported in HMMer, and these constitute the greatest difference in the two systems. SAM has per node ``types'' that control learning and evaluation parameters. The problematic type is the Free Insertion Module (or FIM). This special type allows SAM to have a single node give equal cost for all characters by effectively matching one or more characters. These FIMs can be positioned anywhere in a model, but are most commonly positioned at the ends (the begin node and the node before the end node). HMMer has no such concept, and handling FIMs in a conversion is a problem. FIMs at the begin and end are are not too critical, because HMMer's system always treats the begin and end just like they are free inserts (also, HMMer re-normalizes all nodes so any strange distributions are ``fixed''). FIMs in the middle of a model can NOT be converted in an exact manner. Due to the structure in SAM, some probabilities in a FIM node do not sum to one and the match state transition probabilities are all zero. During conversion, these nodes have to be ``fixed'' so that valid (sum to one) probability distributions are given to HMMer.

One way to fix up a FIM is to remove it: it maps to nothing in a HMMer model. This method is not too bad if the FIM does not usually match many characters, but is very bad if the FIM is used a lot. A second way is fabricate some normal cheap insert states and hope that they match the average length of the FIM. This method is hard, since there is not enough information to guess the number of insert nodes to add. (If the frequencies are present in the SAM model, then this can be done. However this information is not always present.). sam2hmmer utilizes the former option. The transition probabilities for the match state are computed from the exit probabilities of the FIM (a chain of FIMs in a row are removed as they were one). A SAM FIM has exits from the insert and delete states, and the previous non-FIM node has a transition from match to insert. The new transitions are computed from these numbers by breaking the match to FIM transition into two parts that have the same ratio as the exit probabilities of the FIM. This same computation is done to compute the new transitions from delete and insert.

It is not possible to record any extra information from SAM into a HMMer model as HMMer does not support comments in the file.


10.11 Plotting Programs

SAM has several plotting programs to assist in data analysis. The programs require gnuplot http://www.cs.dartmouth.edu/gnuplot_info.html to be in the user's path. In general, the programs create one or more .data files, a .plt file of gnuplot commands, and a postscript file.

Options common to the programs include

plotcolumn -- Column of score file to use in calculating plots. Length (0), simple null model (1), complex or reverse null model (2), or Evalue (3, the default). If Evalues are requested but not present, then column 1 is used.

plotps -- Creates a postscript file runname.ps using gnuplot if set to 1, 2, or 3. When set to 0, only a .plt file and one or two .data files are generated. A setting of 1 generates a plot in gnuplot's default rectangular shape, while a setting of 2 generates a square plot. For options 1 and 2, the .data and .plt files used to create the postscript file are deleted. When set to 3, the postscript file is generated and the .data and .plt files are retained. The default setting is 1.

plotleft -- Lowest X axis value on a graph generated by gnuplot. The X axis is calculated internally if plotleft=plotright. The default setting is 0.0.

plotright -- Highest X axis value on a graph generated by gnuplot. The X axis is calculated internally if plotleft=plotright. The default setting is 0.0.

plotmax -- Highest Y axis value on a graph generated by gnuplot. The Y axis is calculated internally if plotmax=plotmin. The default setting is 0.0

plotmin -- Lowest Y axis value on a graph generated by gnuplot. The Y axis is calculated internally if plotmax=plotmin. The default setting is 0.0.

plotline -- Creates a vertical line at this value in a graph generated by gnuplot if plotline is nonzero. The default setting is 0.0.

plotnegate -- Negates the scores on a graph generated by gnuplot if set to 1. The default setting is 0 (off).


10.11.1 makehist

The makehist program can be used to make postscript histograms from one or two .dist or .mstat distance files created by hmmscore. Histograms are most useful to discover and examine the separation between family members and non-family members during a database search. The makehist program requires a distfile as an argument, in addition to the runname. For example, to make a histogram from test.dist, the following command would be used:


makehist test -distfile test.dist
Then, to view the histogram, use ghostview or a similar postscript viewer to view test.ps. If you would like to use gnuplot directly, set plotps to 3 to keep all the intermediate files. Then, run the gnuplot program and enter the command load "test.plt" to view the results. The test.plt file may be modified to change the graph. If the two marked lines in the file are uncommented, a postscript file will be generated.

20: A histogram generated by makehist and gnuplot.
\begin{figure}\centerline{\psfig{figure=testhist.ps}}
\par\end{figure}

If a second file is specified, using the distfile2 option, a second histogram is placed above the first one. The makehist program has an optional argument, histbins, which can be used to set the number of bins between which scores should be divided. The histogram in Figure 20 was generated using five bins.


10.11.2 makeroc

The makeroc program generates a postscript graph file using the gnuplot plotting package. It takes two .dist distance files created by hmmscore as input and plots false negatives/ false positives on the axis Score vs. Counts where Score is the NLL-NULL score of a sequence and Counts is the number of sequences in the files with a particular score.

To plot false negatives vs. false positives in two distance files globins.dist and nonglobins.dist, the following command would be used:


makeroc test -distfile globins.dist -distfile2 nonglobins.dist

The program will output a file test.ps which you can view with ghostview or other postscript viewer.

21: A plot generated by makeroc and gnuplot.
\begin{figure}\centerline{\psfig{figure=makeroc.ps}}
\par\end{figure}


10.11.3 makeroc2

The makeroc2 program generates a postscript graph file using the gnuplot plotting package. It takes two .dist distance files created by hmmscore as input and plots false negative vs. false positive scores, each of which are a function of the setting of threshold score. At each integral point on, for example, the false positive axis, the value of the false negative coordinate is a linear interpolation between the two nearest integral false negative values based on the scores of the two false negative points and the false positive point in question. This means, for example, that if two negative examples score 10 and 20, and two positive examples score 19 and 21, the point (1,1.1) will be plotted for the score of 19, indicating that if the threshold is set at 19, there will be one false negative (the sequence that scores 21) and a little over one false positive (the sequence scoring 19 is 10two negative examples).

To plot false negatives vs. false positives in two distance files globins.dist and nonglobins.dist, the following command would be used:


makeroc2 test -distfile globins.dist -distfile2 nonglobins.dist

The program will output a file test.ps which you can view with ghostview or other postscript viewer.

To plot multiple pairs of distance files on the same axis, use the perl script multi_roc2.pl.


multi_roc2.pl true1.dist false1.dist title1 true2.dist false2.dist title2
true3.dist false3.dist title3 . . .

The titles appear in a legend in the upper right hand corner of the plot and identify the individual curves.

22: A plot generated by multi_roc2.pl, makeroc2 and gnuplot.
\begin{figure}\centerline{\psfig{figure=makeroc2.ps}}
\par\end{figure}

10.11.4 makeroc3

The third of the series is makeroc3. This program is given two data files, as with makeroc2 and prints numerical information about a positive distance file and a negative distance file. With the command


makeroc2 test -distfile true.dist -distfile2 false.dist
the program will output a test.dat file of the form

flips between x:2.000000, y:2.280622 score:-36.376999 and x:3.000000, y:2.263981 score:-36.743000
3 sequences at FP=FN
7 FN at FP=0
2 FP at FN=0

10.12 Sequence manipulation

The programs described in this section can be used to perform many helpful tasks.

23: Output of listalphabets test -a protein,EHL.
\begin{figure}\begin{showfile}
\char93 \char93 \char93 \char93 \char93 \char93 \...
...ar93 \char93 \char93 \char93 \char93 \char93 \char93
\end{showfile}\end{figure}


10.12.1 listalphabets

The listalphabets program reports default information for the internal alphabets. If run with only an (ignored) runname, the program will list all alphabets, valid characters, and default regularizers. If the program is provided an alphabet argument, such as


listalphabets test -a protein,EHL
it will provide information about the multi-track alphabets listed, as shown in Figure 23. The program provides entries for all possible tracks, showing the maximum number of sequence tracks that can be used at one time. The entries listed under ``Convert'' are pairs of letters, where the first one is automatically converted to the second, such as the merging of various types of coils, including `L', into the single `C' class of the (slightly) misnamed EHL alphabet. See Section 7.1 and Section 8.1.


10.12.2 checkseq

The checkseq program will read a sequence file and list various pieces of information about it. For example,


checkseq unused_runname -db trna10.seq
will produce the output

Alphabet:	RNA
Input Format:	pearson
Alignment:	no
AlignType:	unaligned
AlignColumns:	0
Num Sequences:	10
Average Length:	74.70
Max Length:	77
Min Length:	72
Total Length:	747

SAM internally recognizes several types of alignments. The check proceeds as follows: if the file is an HSSP file, it is considered an alignment. Otherwise, if the file is an a2m format file with the same number of upper case (match) and hyphen (delete) characters in each line, it is an a2m format alignment. Otherwise, if all characters have the same total number of upper case, lower case, hyphen, and period characters, it is an ``all positions alignment'' -- if all positions are regarded as match columns, the file can be viewed as an alignment.

If the sort variable is set on the command line, checkseq will indicate whether or not all IDs are unique and whether or not all sequences are unique. This can be time consuming for large data files, which is why checkseq will ignore the default setting of sort if it is non-zero.


10.12.3 genseq

The genseq program generates random sequences from either a standard 1-component regularizer, such as SAM's default per-alphabet regularizers, or a Dirichlet mixture regularizer. In this case, the Dirichlet mixture must be a mixture of sequence compositions, not a mixture of alignment column distributions, as is used in buildmodel.


genseq rseqs -nseq 20 -a protein

The SAM system includes default sequence generation multi-component regularizers for the protein alphabet (rsdb-comp2.32comp.gen) the EHL2 secondary structure alphabet (t99-2d-comp.9comp.gen), and the EBGHTL secondary structure alphabet (t99-ebghtl-comp.6comp.gen), which are the default settings of the genprot_prior, genehl2_prior, and genebghtl variables. If a track_prior is not specified as a single prior file (genseq cannot generate multi-track sequences) and alphabet is protein or EHL2, the genprot_prior or genrhl2_prior is used. Otherwise, the regularizer_file is used, or the default internal regularizer.

The length distribution of the generated sequences is based on the normal distribution. The mean and standard deviation are presented in (natural) log format as gs_mean_log_len (default 5.4151) and gs_sd_log_len (default 1.03632564). The default values are intended for protein sequences.

The sophisticated distribution sampling source code (designed and written by Kevin Karplus) at the core of these routines is availble at:
http://www.cse.ucsc.edu/research/compbio/dirichlets/index.html


10.12.4 permuteseq

The permuteseq program requires a run name and a database file. Its output will be a file of sequences that are permuted copies of the sequences in the database file. If Nseq is specified, the sequence file will be looped through multiple times until the requested total number of sequences are created. That is, in a file of 10 sequences, if Nseq is 15, the output .seq file will contain 10 random sequences, one for each original sequence, followed by 5 random sequences, one for each of the first 5 sequences in the file.

The permuteseq program can be used to verify scoring results. For example, the command


permuteseq permuted -db trna10.seq -Nseq 8 -id TRNA1 -id TRNA2 -id TRNA3
hmmscore permuted -i test.mod -db permuted.seq -sw 2

produces the following distance file, which shows all the permuted tRNA sequences as having poor scores.


Rand5-TRNA2        76        -7.31       -0.34     3.32e+00
Rand7-TRNA1        72        -7.94       -0.28     3.44e+00
Rand1-TRNA1        72        -7.94       -0.27     3.47e+00
Rand2-TRNA2        76        -7.78       -0.24     3.51e+00
Rand4-TRNA1        72        -7.03        0.21     4.41e+00
Rand3-TRNA3        76        -7.47        0.31     4.61e+00
Rand6-TRNA3        76        -7.48        0.67     5.29e+00
Rand8-TRNA2        76        -7.22        0.83     5.57e+00


10.12.5 randseq

The randseq program can be used to randomly sample a database file. It requires a database file (db), and a number of sequences (Nseq), and can optionally be provided a random number seed (trainseed). For example,


randseq rseqs -db nrp -nseq 1000
would place 1000 sequences randomly selected from nrp into the file rseqs.seq.

The randseq program performs the random selection in memory.


10.12.6 splitseq

The splitseq program can be used to split a database according to a sequence length threshold. It requires a database file (db), and a maximum sequence length (max_seq_length). For example,


splitseq split -db nrp -max_seq_length 1000
would place sequences of length less than or equal to 1000 in splitA.seq and sequences of length greater than 1000 in splitB.seq.

The splitseq program performs the selection in memory. It is particularly useful with posterior-decoded alignment styles because they are not yet implemented with a reduced space algorithm.


10.12.7 sampleseqs

The sampleseqs program will, given a model, randomly generate sequences according to the model's probabilities. For example,


sampleseqs sample -i test.mod -Nseq 5
will produce the following sample.seq file of synthetic tRNAs:

>SampleSeq0
AUUAGCCCAUUUGCGACCGCACCGGACUCGCGAGCCUGUAACGCAGGUUCGAAUCCCCCCCGGGCCACCAAUGACCGGUGUG
>SampleSeq1
GUCAGUAGCUUACGUCGUACCAUACUGCUGCGUGAACCCUCGUGUCGAGUCCCUUCGACUCGCGCAAAUCUGCGGCGGCU
>SampleSeq2
GUCCCCCGUGGCGCAGCCGGUAAGACACUCGCGUUGCAUUCGGGAAGUACGUGGGGUUUACAUCGCCGCAGGGCCACUUCGGCCCGGUUGUCCGUGUUGACCUGUCGAUGUUCCCUAG
>SampleSeq3
GCGGUGUGGCGACGGUGUUACAGAACAUGUCCCGCAAUCUAGUAUCCACGGUUCAAUUCUCGGUAGCUGAAUCCACGG
>SampleSeq4
CCGUAGCCUAACUGAGCAGGCUUUAGCUCGCGUAUUGGGCAGGCAGGACUCGCGCGUUCAACCCCAAUCGGGGGCGCGAUC
>SampleSeq5
UCGUGGGUUAGGGUCGAGGGCUGGCUUGUCAGGCUGCAGUCGCAAGGAGACAGGCUCCAGCCGGGGCCUCCAAUCG
>SampleSeq6
CACGUCGUUCAGUGUCAACACCUGGACUUGCAAUCGGGAACCACCGGUUCAAAGCCCUCCGAGUCCAGGACGCU
>SampleSeq7
GUCCGAGCCUAGGUCGUACGGGUAGUCUCGCUCUUUCGGAAGCACGGUUUCGAUUCCCUGUGGCCCGGCGCGCCGG
>SampleSeq8
CGGGGAUAUAGGAGGGAGCGGCGCAAGCUUAGCAAGGGGGGCUGCCGGGGUCAAAUCCUGCCGUGUCCCCCAUAUCC
>SampleSeq9
CACAGAGCGCCGCGGGUAUAGCAUCGGUUUUUCAACCACGGGCGCUCCGGUCGAACCCGGCUCCCGCCCCUUCCAGCAAGCG

Run with no arguments for a usage message.


10.12.8 sortseq

The sortseq program takes as input a database file(s) of sequences (*.sel or *.seq) or an input -alignfile (*.a2m or *.mult) and a distance file (*.dist) of scores generated by hmmscore. It outputs a file of sequences or alignment (depending on input) in the same order as they occur in the distance file. Sortseq provides an enhancement to hmmscore. It outputs a .seq or .a2m file that contains the actual sequences in sorted order while hmmscore outputs a file only containing sorted sequence id's and scores.

For example,


sortseq sort -db trna10.seq -distfile test.dist
will produce the following sort.seq file of sequences:

>TRNA7
GGGCACAUGGCGCAGUUGGUAGCGCGCUUCCCUUGCAAGGAAGAGGUCAUCGGUUCGAUUCCGGUUGCGUCCA
>TRNA2
GCGGCCGUCGUCUAGUCUGGAUUAGGACGCUGGCCUCCCAAGCCAGCAAUCCCGGGUUCGAAUCCCGGCGGCCGCA
>TRNA4
GGGCGAAUAGUGUCAGCGGGAGCACACCAGACUUGCAAUCUGGUAGGGAGGGUUCGAGUCCCUCUUUGUCCACCA
>TRNA9
CGGCACGUAGCGCAGCCUGGUAGCGCACCGUCCUGGGGUUGCGGGGGUCGGAGGUUCAAAUCCUCUCGUGCCGACCA
>TRNA10
UCCGUCGUAGUCUAGGUGGUUAGGAUACUCGGCUUUCACCCGAGAGACCCGGGUUCAAGUCCCGGCGACGGAACCA
>TRNA1
GGGGAUGUAGCUCAGUGGUAGAGCGCAUGCUUCGCAUGUAUGAGGCCCCGGGUUCGAUCCCCGGCAUCUCCA
>TRNA8
GGGCCCGUGGCCUAGUCUGGAUACGGCACCGGCCUUCUAAGCCGGGGAUCGGGGGUUCAAAUCCCUCCGGGUCCG
>TRNA5
GCCGGGAUAGCUCAGUUGGUAGAGCAGAGGACUGAAAAUCCUCGUGUCACCAGUUCAAAUCUGGUUCCUGGCA
>TRNA3
GGCCCUGUGGCUAGCUGGUCAAAGCGCCUGUCUAGUAAACAGGAGAUCCUGGGUUCGAAUCCCAGCGGGGCCUCCA
>TRNA6
GGGGCCUUAGCUCAGCUGGGAGAGCGCCUGCUUUGCACGCAGGAGGUCAGCGGUCGACCCGCUAGGCUCCACCA


10.12.9 uniqueseq

The uniqueseq program takes as input a database file(s) of sequences (*.sel or *.seq) or a single alignment file. It sorts the sequences and outputs a file containing every sequence with a unique ID (i.e. no dupes are copied to the output file). It also outputs messages to the user informing of the following conditions in the database file: same id for different sequences; same sequence for different ids; and duplicate id found.

If an alignment file is specified the fractional percent_id of the sequence's match characters as aligned are present in another sequence. If two sequences mutually satisfy this threshold, the one that is more identical to the other is removed, or the one that has the fewest total number of match columns if their relative percentages are the same. The order in which this is performed is undefined, so that when percent_id is less than the default 1.0, it is possible to get to different results with different orderings of an alignment due to transitivity considerations. If percent_id is negative, a message is printed about each sequence that is dropped.

If an alignment file is specified and aligncheckonly is cleared to 0 (from the default 1), then duplicate sequences are removed from the alignment based only on the sequence data (not the alignment) prior to thinning the alignment.

If the optional -train mytrainfile.seq is used, the program checks the sequences in the trainfile and writes any sequence not in the database file to the output file. In this case, the following user information messages are output: training sequence (ID number) is in database with different ID (ID number); training sequence (ID number) is in database with same ID; training sequence (ID number) not in database (note: the program currently checks only one training file).

For example,


uniqueseq unique -db testme.seq -train trna10.seq
uniqueseq unique1 -alignfile testme.a2m
will produce unique.seq and unique1.a2m, respectively.

Messages about which sequences are dropped are printed to standard error.


10.12.10 mk_kestrel_db

The mk_kestrel_db command creates databases for use with Kestrel hmmscore. These databases are in a format optimized for scoring with the Kestrel processor and parallel the standard sequence databases. For each db parameter specified, mk_kestrel_db will create a .kids file and either a .kseq file or a .krseq if subtract_null is set to 4. These files will be in the same directory as the db file and will have the same name, with the extension appended. The runname argument is ignored.


next up previous contents
: 11 System installation : SAM (Sequence Alignment and : 9 The buildmodel estimation   ܼ
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group