next up previous contents
: ʸ : SAM (Sequence Alignment and : 3 Quick overview   ܼ



4 SAM-T2K

The SAM-T99 method is an iterative HMM method. It is a refinement of the methods used at UCSC for the 1996 CASP2 protein structure prediction contest:

K. Karplus, K. Sjölander, C. Barrett, M. Cline, D. Haussler, R. Hughey, L. Holm, and C. Sander, ``Predicting Protein Structure Using Hidden Markov Models,'' in Proteins: Structure, Function, and Genetics, Supplement 1:134-139, 1997.

as well as the 2000 CASP4 and 1998 CASP3 contests:

K. Karplus, R. Karchin, C. Barrett, S. Tu, M. Cline, M. Diekhans, L. Grate, J. Casper, and R. Hughey, ``What is the value added by human intervention in protein structure prediction?,'' invited for Proteins: Structure, Function, and Genetics, 2001.

K. Karplus, C. Barrett, M. Cline, M. Diekhans, L. Grate, R. Hughey, ``Predicting Protein Structure using Only Sequence Information'' Proteins: Structure, Function, and Genetics, Supplement 3:121-125, 1999.

The CASP3 SAM-T98 method is described in detail in the following article, which should be cited along with the SAM paper whenever the SAM-T2K is used:

K. Karplus, C. Barrett, and R. Hughey, ``Hidden Markov Models for Detecting Remote Protein Homologies,'' Bioinformatics, 14(10):846-856, 1998.

SAM-T98 is shown to be better for remote homology detection than BLAST, FASTA, ISS, and PSI-BLAST in

J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia, ``Sequence Comparisons Using Multiple Sequences Detect Twice as many Remote Homologues as Pairwise Methods,'' Journal of Molecular Biology, 284(4):1201-1210, 1998.

6: Performance of SAM-T2K and other methods on SCOP super-familily discrimination.
\begin{figure}\centerline{\psfig{figure=sam-t98-t99.eps,width=0.6\textwidth}}
\end{figure}

The performance of SAM-T2K, SAM-T98, Double-BLAST (finding a set of close BLAST hits to a sequence and useing them for a second BLAST search), and Smith and Waterman with a reverse-sequence null model are shown in Figure 6.

This section illustrates the use of the SAM 3.0 programs and scripts to perform a remote homology search.

The method has several options for handling somewhat different tasks. The next sections will describe a few such tasks:


4.1 SAM-T2K for superfamily modeling

The SAM-T2K method has been encapsulated in a single perl script: target2k. This script accepts a single protein sequence as input (the target sequence), does a search of a non-redundant protein database, and returns a multiple alignment of sequences similar to the target. The default parameters have been adjusted to give good performance at recognizing sequences related at the superfamily level of the SCOP database [7].

For example, let's start with a single hemoglobin (say PDB sequence 1babA), and try to find as many other globins as we can. The initial protein sequence should be in FASTA format, and all uppercase, since it is used by modelfromalign to build an HMM. The 1babA sequence is in the demos subdirectory as 1babA.seq.

The command


target2k -seed 1babA.seq -out 1babA-t2k
will search the non-redundant protein database (see Section 4.11) and return a multiple alignment of protein sequences similar to 1babA. The seed sequence is provided first, followed by the similar sequences sorted in order of how well they fit the HMM implied by the multiple alignment. The best-scoring sequence is quite frequently not the sequence that was used as the seed.

Since target2k produces voluminous output to the standard error output (so you can see the progress of the run), you may want to redirect the standard-error output to a file (the syntax for this depends on which UNIX shell you run). Note that the multiple alignment, with over 1250 sequences, contains not only hemoglobins, but also myoglobins and leghemoglobins.

You may want to create an HMM from the multiple alignment, in order to search another database (for example, the PDB database), or to score sequences that were selected by some other method. Although you can use modelfromalign directly to do the HMM construction, we have provided some simple scripts that set the parameters appropriately. The build-weighted-model script is a simple wrapper for uniqueseq and modelfromalign, and scripts like w0.5 and w0.5 set the parameters of build-weighted-model.

7: Globin score histogram for SAM-T2K model trained on 1babA.
\begin{figure}\centerline{\psfig{figure=1babA-globins.ps,width=0.45\textwidth}}
\end{figure}

To build a model for searching for globins, try the following command:


w0.5 1babA-t2k.a2m 1babA-t2k-w0.5.mod
This model can then be used for scoring a set of sequences:

hmmscore 1babA-globins -i 1babA-t2k-w0.5.mod -db globins50.seq -sw 2
producing 1babA-globins.dist. Note that all the globins in this small database score extremely well (Figure 7) with this model (E-values 1.6e-11 or smaller, NLL$-$reverse $\leq
-28.79$)--compare with the results presented in Section 8.1.1 on page [*] for models built from a few globin sequences.

In a 1995 test set of 12,219 sequences based on the PIR database [10], the 1babA-t2k-w0.5 model scored all 505 globins with E-values $\leq 0.0024$ (and $\leq$9.2e-12, if you exclude the 45-residue fragment GGNW3B) and all non-globins with E-values $\geq 25$ providing a very clean separation of globins from non-globins. Note that this HMM does much better than prior reports of HMMs constructed automatically for globin recognition, which did not achieve perfect separation even when given 400 or 10 known globins [3,6]. Warning: globins are a fairly easy protein domain to recognize, and this clean a separation cannot be expected for more difficult domains.

On the SCOP pdb90d domain database (version 1.37 with 2466 sequences), the model scored all globins with E-values $\leq$ 4.0e-07, and non-globins with E-values $\geq$ 0.82. The phycocyanins, which are in the same superfamily as globins, but a different family, had E-values ranging from 630 to 2000, and were emphatically not recognized by the HMM. Thus the model generated from 1babA should be regarded as a family model, and not as a superfamily model.

One other important warning: the E-value reported is an estimate of how many sequences would score that well with the final model in a random database. A very low E-value means that the sequence or a very similar one was included in the training set from which the model was built--it does not necessarily mean that the sequence is very similar to the seed sequence. With any iterated search method (SAM-T98, SAM-T2K, PSIBLAST [1], ISS [9], ...) including a false positive on any iteration results in much too strong a score for that sequence and similar sequences on subsequent iterations.

Since the default thresholds for SAM-T2K include sequences with E-values as large as 0.01 (measured in the non-redundant database) for the last iteration, E-values less than

\begin{displaymath}
0.01 \frac{\mbox{size}(\mbox{searched database})}
{\mbox{size}(\mbox{nonredundant protein database})}
\end{displaymath}

are roughly equivalent. For the PIR test set, this threshold is about $0.01\ 12,216/415,133 = 0.0004$ and for the pdb90d, the threshold is about $0.01\ 2466/415,133 = 0.00006$. Since all the globins score much better than that, about all we can say is that they fit the model--there are no sequences being found that are very different from the ones used in creating the HMM.

On a test of 935 SCOP domains (the PDB40D Jong Families [8]), the E-values can be seen to be somewhat conservative for values larger than 0.1, but the false-positive rate remains above 0.01 even for very small E-values (see Figure 8). This ``fat-tail'' phenomenon occurs not only for SAM-T2K, but for almost all homology-detection methods--there are some statistically significant sequence similarities that do not cause proteins to fold the same way. The strongest ``false positive'' in this test set is actually a mislabeling of one of the domains, fixed in more recent releases of the SCOP domains.

8: This graph compares the theoretically expected number of false positives per query (the E-value), with the observed number of false positives per query in an all-vs-all test of 935 SCOP domains. The different curves reflect the different levels of similarity one can get from the SCOP hierarchy. The multiple alignments were built with the default values of the target99 script, and the HMMs were built using fw0.7, and scored with hmmscore -sw 2 -fimtrans 1.
\begin{figure}\begin{center}
\parbox{0pt}{\psfig{figure=t99-fw0.7-calibrate.eps,width=0.9\textwidth}}
\end{center}
\end{figure}

4.2 Improved verification of homology

If we have a conjectured homology between two sequences, perhaps as a result of a search using a SAM-T2K model, we can improve our confidence in the result by combining the results from two different .dist files. For each sequence, we build a multiple alignment using target2k, then create an HMM from the alignment and use it to score the other sequence. The two ``Reverse'' scores can be averaged and converted to an E-value (See Section 10.2.). Figures 9 and 10 show the improvement one gets from averaging scores in this way.

9: This graph compares the theoretically expected number of false positives per query (the E-value), with the observed number of false positives per query in an all-vs-all test of 935 SCOP domains. False positives are sequences in different folds (in the SCOP hierarchy), true positives are sequences in the same superfamily. Note that the averaging technique makes the E-value estimates considerably more conservative. The multiple alignments were built with the default values of the target2k script, and the HMMs were built using w0.5, and scored with hmmscore -sw 2 -fimtrans 1.
\begin{figure}\begin{center}
\parbox{0pt}{\psfig{figure=t99-fw0.7-avg-vs-rev.eps,width=0.9\textwidth}}
\end{center}
\end{figure}

10: This graph the number of false positives (different folds in the SCOP hierarchy) versus the number of true positives (same superfamily) in an all-vs-all test of 935 SCOP domains. The multiple alignments were built with the default values of the target99 script, and the HMMs were built using fw0.7, and scored with hmmscore -sw 2 -fimtrans 1. The averaging method finds considerably more superfamily relationships at every level of false positives accepted.
\begin{figure}\begin{center}
\parbox{0pt}{\psfig{figure=t99-fw0.7-fp-vs-tp.eps,width=0.9\textwidth}}
\end{center}
\par
\end{figure}


4.3 Family-level multiple alignments

The default parameters for the target2k script have been set for fairly good performance on recognizing superfamilies. If you want to separate one family from another (as Pfam does [12]), this level of generalization is usually too extreme.

By specifying the option -family, you can request a different set of parameters intended for building family-level multiple alignments from a seed sequence. These parameters have not been tested yet, and are offered only as a starting point for further tweaking.

The Pfam database creates family-level HMMs from hand-curated seed alignments [12]. The SAM-T2K method can also be started with a seed alignment, rather than a single sequence--simply provide the alignment as an a2m file with target2k's -seed option.

The -seed option is also useful for superfamily recognition--especially if a structural alignment is available to use as a seed.


4.4 Modeling non-contiguous domains

Some protein domains are formed from non-contiguous pieces of the backbone. For example, the SCOP database [7] has a GroES-like fold for alcohol dehydrogenase (d2ohxa1) consisting of residues 1-174 and 325-374, with a Rossmann-fold domain in the middle.

To model a domain like this, the HMM should have a FIM (See Section 8.5.) for the long insert. The target2k script can handle this--simply replace the amino acids of the inserted domain by a single O, which will create a FIM (See Section 10.7.).

The demo file d2ohxa1.seq has the appropriate seed alignment for the multi-helical domain,


>d2ohxa1 2.22.1.2.1 (1-174,325-374) Alcohol dehydrogenase [horse (Equus caballus)]
STAGKVIKCKAAVLWEEKKPFSIEEVEVAPPKAHEVRIKMVATGICRSDDHVVSGTLVTP
LPVIAGHEAAGIVESIGEGVTTVRPGDKVIPLFTPQCGKCRVCKHPEGNFCLKNDLSMPR
GTMQDGTSRFTCRGKPIHHFLGTSTFSQYTVVDEISVAKIDAASPLEKVCLIGC
O
KDSVPKLVADFMAKKFALDPLITHVLPFEKINEGFDLLRSGESIRTILTF
and the command

target2k -seed d2ohxa1.seq -out d2ohxa1-t2k
will build the multiple alignment.

The FIM will be created in the HMMs that are used for aligning the sequences found by the SAM-T2K method. Furthermore, since the first sequence of the resulting multiple alignment will have an O, the FIM will be created in any HMMs built from the output alignment.

The HMM d2ohxa1-t2k-w0.5.mod scores the alcohol dehydrogenase family members (in pdb90d version 1.37) with E-values $\leq$ 6.0e-31. The best-scoring sequence outside the superfamily is the central domain of the alcohol dehydrogenase with E-value 2.0e-08, which creeps into the model due to misalignments despite the FIM. The best-scoring completely unrelated domain is d2tmda3 (Trimethylamine dehydrogenase, middle, ADP-binding domain) with E-value 0.027

The chaperonin-10 sequences in the GroES-like superfamily but not the alcohol dehydrogenase family have E-values 16 and 270, so this HMM is best viewed as a family-level model, not a superfamily model. The E-values are reasonable estimates of the number of false positives (46 and 255) in the pdb90d database for this search.

The central domain of 2ohxA is an NAD(P)-binding Rossmann-fold domain, a large superfamily of alpha/beta/alpha units with a parallel beta-sheet of 6 strands (order 321456). This superfamily is fairly diverse, and difficult to capture in a single HMM. Let's start with d2ohxa2.seq:


>d2ohxa2 3.19.1.1.1 (175-324) Alcohol dehydrogenase [horse (Equus caballus)]
GFSTGYGSAVKVAKVTQGSTCAVFGLGGVGLSVIMGCKAAGAARIIGVDINKDKFAKAKE
VGATECVNPQDYKKPIQEVLTEMSNGGVDFSFEVIGRLDTMVTALSCCQEAYGVSVIVGV
PPDSQNLSMNPMLLLSGRTWKGAIFGGFKS
and build the HMM as before

target2k -seed d2ohxa2.seq -out d2ohxa2-t2k 
w0.5 d2ohxa2-t2k.a2m d2ohxa2-t2k-w0.5.mod

Scoring the SCOP pdb90d (version 1.37) database gives us the first false positive at E-value 0.022 (Dihydrolipoamide dehydrogenase) and at the minimum-error point (E-value 1.1), there are 35 true positives, 4 false positives, and 19 true negatives. All seven members of the same family as d2ohxa2 score extremely well (E-values $\leq$9.0e-25), but there are 20 other members of the superfamily before the first false positive, so this can genuinely be viewed as a superfamily model, and not just a family model.


4.5 Building an HMM from a structural alignment

We can try to make a better model for the Rossmann-fold superfamily containing d2ohxa2, by starting from a structural alignment of two of its members. The model built using just d2ohxa2 had particular difficulty recognizing the lactate dehydrogenases and the malate dehydrogenases, so perhaps we could improve the model by including one of them in the alignment. In the FSSP [5] alignment for 2ohxA, the highest Z-score of these is for 2cmd.

From the FSSP alignment, we can extract the aligned section that seems to match the domain we are interested in:


>2ohxA-domain Alcohol dehydrogenase (holo form) complex with nadh and
STCAVFGLGGVGLSVIMG
CKAAGAARIIGVDINKDKFAKAKEVGATECVNPQDYKKPIQEVLTEMSNGGVDFSFEVIGRLDTMVTALS
CCQEAYGVSVIVGVPPDSQNLSMNPMLLLSGRTWKGAIFGGFKSKDS
>2cmd-domain Malate dehydrogenase
MKVAVLGaAGGIGQALAL
LLKTQLpsg-SELSLYDIAPVTPGVAVDLshiptavk-IKGFSGE---DATPAL--EGADVVLIS
AGvrrkpgmdrsdlfnvNAGIVKNLVQQVakTCPKACIGIITNPvnttvaiaa---EVLKkaGVYDk
n--KLFGVT-TLDIiRSNT

The a2m file can be a bit unreadable, but we can use prettyalign to make sure that the alignment is the one we intend:


                      10        20           30        40             
                       |         |            |         |             
2ohxA-domain STCAVFG.LGGVGLSVIMGCKAAG...AARIIGVDINKDKFAKAKEV........GA
2cmd-domain  MKVAVLGaAGGIGQALALLLKTQLpsg-SELSLYDIAPVTPGVAVDLshiptavk-

                50        60        70                       80                         |         |         |                        |        2ohxA-domain TECVNPQDYKKPIQEVLTEMSNGGVDFSFEVIG...............RLDTMVTAL 2cmd-domain  IKGFSGE---DATPAL--EGADVVLISAGvrrkpgmdrsdlfnvNAGIVKNLV

              90         100                110           120       13                |           |                  |             |          2ohxA-domain SCC..QEAYGVSVIVGVP.........PDSQNLSMNP..MLLL..SGRTWKGAIFGG 2cmd-domain  QQVakTCPKACIGIITNPvnttvaiaa---EVLKkaGVYDkn--KLFGVT-TL

             0                    |       2ohxA-domain FK.SKDS 2cmd-domain  DIiRSNT

If we build the HMM as before


target2k -seed 2ohxA-2cmd.a2m -out 2ohxA-2cmd-t2k
w0.5 2ohxA-2cmd-t2k.a2m 2ohxA-2cmd-t2k-w0.5.mod
and score the SCOP pdb90d (version 1.37) database, we get the first false positive at E-value at 3.2e-04 and at the minimum-error point (E-value 9e-04), there are 31 true positives, 1 false positive, and 23 false negatives.

Although this model scores the superfamily well (40 of the 54 members have E-values less than 1.0), it also scores 19 incorrect domains as well--14 of them dihyrodolipoamide dehydrogenase and related folds. If we selected a threshold for the d2ohxa2-t2k-w0.5.mod HMM that accepted 40 of the superfamily members, we would have had 54 false positives, so adding a structurally-aligned homolog has clearly let us create a better model of the superfamily.

If we add one more poorly-found sequence to the structural alignment (say 2pgd), we can improve the model further. Starting from the alignment


               10        20            30        40            50     
                |         |             |         |             |     
2pgd  DIALIG.LAVMGQNLILNMNDHG...F.VVCAFNRTVSKVDDFLANEAKGT....KVLGAH..S
2cmd  KVAVLGaAGGIGQALALLLKTQLpsgS.ELSLYDIA-PVTPGVAVD-LSHIptavKIKGFSgeD
2ohxA TCAVFG.LGGVGLSVIMGCKAAG...AaRIIGVDINKDKFAKAKEV---ga..-TECVN..P

                60         70             80            90       100                    |          |              |             |         |   2pgd  .....LEEMVSKLKKPR.RIILLVK.....AGQAVDNFIEKL....VPLLDIGDIIIDGGNSEY 2cmd  .....ATPALEGA--D.VVLISAGv15nvNAGIVKNLVQQV....AKTCP-KACIGIITNPVN 2ohxA qd7piQEVLTEMSNGGVdFSFEVIG.....--RLDTMVTAlscc--QEAYGVSVIVGVPPD

                110           120                        |             |      2pgd  RD.....TMRRCRDLK..DKGI..LFVGSGVS 2cmd  TT.....VAIAAEVLKkaGVYDknKLFGVTTL 2ohxA SQnlsmn---PMLL..LSGR..TWKGAIFG

and building a SAM-T2K alignment as before, the 2pgd-2cmd-2ohxA-t2k-w0.5.mod HMM finds 24 true positives before the first false positive, and 19 false positives at 40 true positives. There are still some bad true negatives--the worst is d1scua2 (Succinyl-CoA synthetase, alpha-chain, N-terminal (CoA-binding) domain) at E-value 570, with 346 incorrect sequences scoring better.

If we have a structural alignment of dissimilar homologs, it may not seem necessary to run the target2k method. If we build a model from the structural alignment with the command


w0.5 2pgd-2cmd-2ohxA.a2m 2pgd-2cmd-2ohxA-w0.5.mod
we can get 23 true positives before any false positives, 26 true positives with only 4 false positives, and 40 true positives with 129 false positives.

For very low false-positive rates, our best results on this domain were with the SAM-T2K method applied to a single sequence, with 27 true positives before a false positive. If we want to find the remote homologs, accepting a few false positives, we get better results by using a structural alignment as the seed (40 true positives with only 19 false positives). The SAM-T2K method does get better results on this example than using just a structural alignment without searching for more homologs.


4.6 Improving existing multiple alignments

You can clean up an existing multiple alignment by using it as a seed and specifying the -tuneup option. This option turns off the search for similar sequences and turns off the -force_seed 1 option that normally copies the seed alignment without modification.

The seed alignment will be used to create an HMM, then the sequences in the alignment will be used as the set of potential homologs to search and align. The output alignment may contain only a subset of the original sequences, if some of the sequences score too poorly to meet the thresholds. If you want to include all the sequences, set the -thresholds variable to have a very large final threshold.

The -tuneup option can also be used to add unaligned sequences to an existing multiple alignment with the -homologs option. For example,


target2k -seed 1babA.seq -homologs globins50.seq -tuneup 

         -db_size 400000 -out 1babA-50
adds 50 globin sequences to the single 1babA globin alignment, creating a multiple alignment of 51 sequences. The -db_size option says what size database should be assumed for computing E-values. If it is omitted, the size of the non-redundant database that would normally be searched is used.


4.7 Creating a multiple alignment from unaligned sequences

It isn't really necessary to specify a seed for the SAM-T2K method--if the -close or -homologs option is used, then an initial multiple alignment can be created by buildmodel from the specified unaligned sequences. If both -close and -homologs are given, then only the close set are used for the initial alignment, but the full set of homologs are used for subsequent iterations.

For example,


target2k -homologs globins50.seq -tuneup -db_size 400000 -out globins50-tuned
will align the 50 globin sequences without using a seed sequence, leaving the alignment in globins50-tuned.a2m.

4.8 Parameters for target2k perl script

The parameters of the target2k perl script are not described in the parameters section (See Section 6.), since they are not parameters to the SAM programs and are not parsed by the same routines. One important difference is that the target2k parameters are interpreted in the order they are seen on the command line--if two options conflict, the one that is set later overrides the earlier one.

The target2k parameters can be roughly grouped into four classes:

required parameters
The -out parameter is always required, to specify where the results are to be put, and at least one of -seed, -close, and -homologs (usually -seed) must be specified to tell the script what to build a model for. The -seed, -close, and -homologs files may be gzipped.

modes
The -superfamily, -family, and -tuneup parameters set many internal variables for three of the most common tasks expected for the script. The superfamily settings are the default values if they are not overridden by explicit parameter settings.

major parameters
There are many parameters that can be used to override the settings given by the major modes. The most commonly used parameters are probably -iter, -thresholds, -db_size, -homologs, -close, and -db.

minor parameters
Many of the other parameters are provided mainly for developers who want to tweak various internal parameters to get better performance.

Here are the individual parameters to target2k in alphabetical order:

-all
Requests that the intermediate multiple alignments created on each iteration be kept as root_1.a2m, root_2.a2m, ... . This option is very useful for determining when contaminating sequences were introduced into a multiple alignment, so that thresholds can be appropriately modified to exclude them.

-aweight_bits 0.5
How many bits/column the HMM should save relative tot he background distribution after regularizing a multiple alignment with the Dirichlet mixture specified by -mixture.

We have gotten the best results with about 0.8 bits/column for the mixture recode4.20comp, and around 0.5 bits/column for recode3.20comp. You may want to increase the bits/column for short models or for doing family-level models.

-aweight_exponent 0.5
This exponent is interpreted in two different ways, depending on the setting of -aweight_method. (See Section 9.4.2.) For -aweight_method 1 (the EntropyWeight method), the exponent should be large (around 10 works well).
For -aweight_method 2 (the Henikoffs' method[4]), the exponent only affects the initial values, and a value around 0.5 works well.

-aweight_method 2
The -aweight_method is used with -aweight_exponent and -aweight_bits to specify how sequence weighting is done: 1=EntropyWeight, 2=HenikoffWeight, 3=FlatWeight. See Section 9.4.3..

-blast_max_report 20000
Maximum number sequences reported by wu-blastp or blast2, passed to the wu-blast-prefilter or ncbi-blast-prefilter script. You can set this parameter smaller to speed up the script, at the cost of losing a fair amount of generality on some models (such as immunoglobulins and zinc fingers), where there are many very similar sequences.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-close filename
File containing close homologs of the seed sequence (may be compressed with gzip). These close homologs are used to train the model on the first training iteration. If this parameter is not specified, but -homologs is specified, then the file from the -homologs option is used for the close homologs. If neither -close nor -homologs are specified, then the close homologs and potential homologs are searched for in the non-redundant protein database using blastp, but with a much tighter threshold for the close set than the potential-homolog set.

It is a good idea to set -db_size explicitly when no search is being done.

-constraints 1

Should constraints be used? Zero means no constraints; one means use constraints generated from the seed alignment. See Section 9.7. Even if the -force_seed option is set, and the seed alignment will be copied directly to the multiple alignment on each iteration, the constraints are useful to keep buildmodel from modifying the seed alignment while it works.

Not only does -constraints 1 keep the seed alignment intact, it also propagates .cst files through the entire process, so that they can be used with the final multiple alignment.

-cut_match 0.2
This parameter is passed to buildmodel to control how surgery is done. Since -nsurgery is normally set to 0 to turn surgery off, this parameter is usually irrelevant.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-cut_insert 0.25
This parameter is passed to buildmodel to control how surgery is done. Since -nsurgery is normally set to 0 to turn surgery off, this parameter is usually irrelevant.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-db nr

Specifies the filename for the non-redundant database to search and to use for determining -db_size for E-value calculations. The database must be in FASTA format and have a wu-blastp index. Because it is searched by wu-blastp, it cannot be compressed with gzip.

The script provided has ``nr'' as a legal value for -db, which expands to the filename of the non-redundant protein database specified when the script was installed. It is straightforward to add more such shorthand names for databases (in the process_command_line subroutine).

If -db is not specified, then the filename that ``nr'' expands to is used.

-db_size <int>
Specifies number of sequences to assume are being searched for E-value calculations. If not specified, then size of database specified in -db is assumed.

Note: when using -tuneup, -no_search, -homologs, or -close, no search of the database is done, but the count of the default database (or database specified with -db) is still done to get the appropriate size for E-value calculations unless -db_size is explicitly set.

-family
Sets options appropriately for modeling a family. These options have not been carefully tuned, and are just a guess at initial parameters. Individual parameters can be overridden by specifying them later in the command.

-fimstrength 1.0
Probability multiplier for letters in FIMs. This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-fimtrans -1.0
FIM and standard insert-to-insert transitions. Absolute value is the cost (negative-log probability) of FIM insert-to-insert transition divided by average match-to-match cost. When negative, non-FIM insert-to-insert transitions are set to $p-(1-f)p^2$, where $p$ is the regularized and normalized frequency counts for the transition and $f$ is the FIM insert-to-insert transition. This parameter controls how much the model prefers to absorb letters into FIMs rather than accepting poor matches at the ends of the matched region. Values around 1.0 and $-1.0$ generally work best.

tt -final_adpstyle 5
Alignment style to use for making final alignment from trained HMM. 1 indicates Viterbi, and 5 indicates the much slower but more accurate posterior-decoded style. Alignment style is controled by init_adpstyle, search_adpstyle, realign_adpstyle, and final_adpstyle.

-final_trans_reg fssp-trained.regularizer
Transition regularizer used for the last iteration. See Section 8.1.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-force_seed 1
Set to 1 (default) to force all of the seed alignment into the training set and the multiple alignment at each iteration.

The -tuneup option implies -force_seed 0, with the expectation that the seed alignment is to be modified. It is possible to override -tuneup's setting, which might be desirable when adding a new homologs to a seed alignment that isn't to be changed.

-frac_insert 0.3
This parameter is passed to buildmodel to control how surgery is done. Since -nsurgery is normally set to 0 to turn surgery off, this parameter is usually irrelevant.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-full_seq_align 0
Set to 1 to use full-sequence alignment, not multiple-domain alignment for final alignment. If you want to preserve the original sequence names, this option is essential, since the normal action of multiple-domain alignment modifies the names to include information about which domain was found.

-homologs filenname
specifies a file containing potential homologs of the seed (may be compressed with gzip). When -homologs is set, no search of the database is done, as it is assumed that all potential homologs have been identified. If this parameter is not specified, but -close is specified, then the file from the -close option is used for the potential homologs as well as for the close homologs. If neither -close nor -homologs are specified, then the close homologs and potential homologs are searched for in the non-redundant protein database using blastp, but with a much tighter threshold for the close set than the potential-homolog set.

If the option is not used, the database specified in -db (or the default non-redundant database) is searched with blastp to find potential homologs.

It is a good idea to set -db_size explicitly when no search is being done.

include_all 0
In each call to select_seq, include all sequences from the homologs set, no matter how bad the scores. Used for the -tuneup option.

-init_trans_reg gap1.5.regularizer
Transition regularizer used for the first iteration. See Section 8.1.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

tt -init_adpstyle 1
Initial alignment style. 1 indicates Viterbi, and 5 indicates the much slower but more accurate posterior-decoded style. Alignment style is controled by init_adpstyle, search_adpstyle, realign_adpstyle, and final_adpstyle.

-iter 4
The number of iterations of hmmscore and buildmodel to perform. Normally, -iter is set to match the number of E-value thresholds set--if -iter is smaller, the extra thresholds are ignored, and if -iter is larger the final threshold is repeated for subsequent iterations.

-jump_in 0.2
Probability associated with jump-in on local alignment.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-jump_out 1.0
Probability associated with jump-out on local alignment.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-keep_temporary
Don't delete temporary directories. The temporary directories can get fairly large, so they are deleted when the target2k script finished normally. When a part of the script fails badly, the temporaries are usually kept to help track down the problem.

When the script runs to completion, but produces a poor result, finding the problem can be difficult, so the -keep_temporary option is available to aid debugging. This option is not only useful to script developers, but also to users who want to look ``under-the-hood'' to see how the script is producing the results it gets.

-mid_trans_reg stiff-gap5.regularizer
Transition regularizer for all iterations except the first (-init_trans_reg) and last (-final_trans_reg) iteration. See Section 8.1.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

-mixture recode3.20comp
Dirichlet mixture to use for regularizing amino acid distributions. We have gotten the best results on aligning remote homologs by using recode3.20comp. See Section 8.1.1.

-ncbiblast
Use NCBI BLAST2 for initial prefilter via the ncbi-blast-prefilter script. The -wublast option specifies use of WU-BLAST, when available. See Section 4.11.

-no_search
Use only the sequences from -seed, -homologs, or -close as potential homologs-don't look for more.

The -no_search option is implied by -homologs -close, and -tuneup. There is currently no option for turning the search back on.

It is a good idea to set -db_size explicitly when no search is being done.

-no_sort
Turn of sorting of sequences by score.

-nsurgery 0
Specifies the number of surgery steps to allow on each iteration (default 0). Surgery is always allowed when creating an initial alignment (-seed not specified).

Setting -nsurgery positive is not compatible with -force_seed 1--try -force_seed 0 and -constraints 1 instead.

-out root
creates root.a2m for output (also root.cst if constraints are turned on). This is the only absolutely essential parameter.

prefilter_thresholds 0.01, 1.0, 10.0, 400.0
E-value thresholds used with each iteration with the blast prefilter.

tt -realign_adpstyle 1
Alignment style to use for making alignment from trained HMM on intermediate iterations. 1 indicates Viterbi, and 5 indicates the much slower but more accurate posterior-decoded style. Alignment style is controled by init_adpstyle, search_adpstyle, realign_adpstyle, and final_adpstyle.

-reverse_diff 4
How much tighter is the simple threshold than the reversed-sequence threshold, in nats The selection of sequences to include in the multiple alignment at each iteration requires passing four tests:
  1. NLL$-$null score for all-paths local scoring better than some threshold.
  2. NLL$-$reversed E-value for all-paths local scoring better than some threshold.
  3. NLL$-$null score for Viterbi local scoring of found domain better than some threshold.
  4. NLL$-$reversed E-value for Viterbi local scoring of found domain better than some threshold.
The simple tests (steps 1 and 3) require a threshold in nats. This threshold is computed from the E-value threshold used for steps 2 and 4, then tightened by -reverse_diff.

Making -reverse_diff very small (say $-10$) essentially turns off steps 1 and 3, slowing down the script but possibly increasing the number of things found (both true and false positives).

Making -reverse_diff very large requires very good matches to the model before more detailed scoring is done, speeding the script up, but possibly losing some true positives.

This is a minor parameter, intended only to allow developers to tweak parameters without modifying the installed script.

tt -search_adpstyle 1
Alignment style to use for selecting training set for retraining HMM. 1 indicates Viterbi, and 5 indicates the much slower but more accurate posterior-decoded style. Alignment style is controled by init_adpstyle, search_adpstyle, realign_adpstyle, and final_adpstyle.

-seed file.a2m
a guide sequence (or multiple alignment) in a2m format (may be compressed with gzip). See Section 10.1. The seed is usually expected, but if none is provided and either -homologs or -close is specified, then an initial alignment will be built from the homologs (the close ones being preferred if both are specified).

-superfamily
Sets many options appropriately for modeling a superfamily given a single sequence. These are the default options. Individual parameters can be overridden by specifying them later in the command.

-thresholds 0.00001, 0.0002, 0.001, 0.005
specifies the E-value thresholds to use for each iteration (E-value is expected number of false positives). The values should be given as a comma-separated list with no spaces. If the number of thresholds is more than the -iter parameter, then the extra ones are ignored. If there are fewer thresholds than iterations, the last threshold is repeated as needed. The E-values are computed for a database the size of the non-redundant database searched (unless -db_size is specified).

tmp_dir
Directory in which to place the temporary directory used to hold intermediate files. The default is set by the configuration program sam-t2k.conf.

-tuneup
Sets many options, under the assumption that what is wanted is an improvement of an exiting alignment, or the alignment of an already selected set of homologs. Some of the important parameters it sets are -no_search, -force_seed 0, -full_seq_align, and the thresholds. Eliminates search, assumes that all Individual parameters can be overridden by specifying them later in the command, except that there is no way to turn searching back on.

-wublast
Use WU-BLAST for initial prefilter vie the wu-blast-prefilter script. The -ncbiblast option specifies use of NCBI BLAST2, when available. See Section 4.11.


4.9 The model building scripts

The model building scripts primarily differ on the alignment weighting methods that is used. The primary parameter is the number of bits to save in each column of the alignment in comparison to the background. Saving more bits will result in a more specialized model, while saving fewer bits will result in a more general model. The various scripts default to certain transition regularizers that also affect the generality of the model. Readers may wish to read the perl source code to find out more about these routine, for example to change the parameter settings passed to build-weighted-model to only drop of sequences with 100% identity. See Section 9.4.3 and Section 8.1.

11: Sequence logo of the 2ohxA model built using the w0.5 model building script.
\begin{figure}\begin{center}
\psfig{width=0.8\textwidth,figure=d2ohxa1-t2k-w0.5.eps}
\end{center}\end{figure}

12: Sequence logo of the 2ohxA model build using the w0.7 model building script. The specificity of the model is increased, as indicated by the higher bits per column.
\begin{figure}\begin{center}
\psfig{width=0.8\textwidth,figure=d2ohxa1-t2k-w0.7.eps}
\end{center}\end{figure}

: Sequence logo of the 2ohxA model build using the w1.0 model building script. The specificity of the model is greatly increased. The large `O' indicates that one of the sequences in the alignment had been cut (such as in FSSP), resulting in an internal FIM position. (See Section 10.7.) This sequence is trimmed by the w0.5 script as having greater than 80% identity to another sequence.
\begin{figure}\begin{center}
\psfig{width=0.8\textwidth,figure=d2ohxa1-t2k-w1.0.eps}
\end{center}
\end{figure}

Our favored all-around model building script is w0.5, which builds a SAM HMM with a target entropy of 0.5 bits per column. The w1.0 script produces an HMM that is more general than w0.5, and thus will have more positives at a given scoring threshold. Figure 12 shows the 2ohxA model built using w0.5, while Figure 13 shows the w1.0 model. Note the shorter columns in the second case, indicating less negative entropy in the distribution. Searchers with this more general model will find more false postives, and may also find more true positives.

The model building scripts include:

build-weighted-model
A general script for building models using most of the features of SAM. The default Dirichlet mixture prior recode2.20comp. The default transition regularizers is fssp-trained.regularizer, uniform alignment waiting, 0.5 bits per column is the target. This script is called by the various scripts discussed next, each of which will build a model from a multiple alignment (possibly with constraints).
w0.5
Build a model with target entropy weighting of 0.5 bits per column using recode3.20comp and fssp-trained.regularizer. The alignment is trimmed to sequences with less than 80% identity. This script is our standard for building HMMs to find remote potential homologs.
w0.7
Build a model with target entropy weighting of 0.7 bits per column using recode3.20comp and fssp-trained.regularizer. The alignment is trimmed to sequences with less than 85% identity. This script will create a model suitable for finding close homologs.
w1.0
Build a model with target entropy weighting of 1.0 bits per column using recode3.20comp and fssp-trained.regularizer. The alignment is trimmed to sequences with less than 90% identity. This script makes a strongly specialized model; it primarily appropriate to use with makelogo as an aid to visualizing models.


4.10 Support scripts

Two support scripts are also included with SAM-T2K.

The select-by-key-residues and select-by-gapless-regions scripts can be used to modify a2m alignments. They both take an a2m file as input, and produce an a2m file as output, with the output, with the output being a subset of the lines of the input. That is, they both delete lines in the alignment that do not meet certain criteria.

The select-by-key-residues script takes a description of the key residues that must be present, and removes all sequences from the alignment that do not include those residues. For example,


select-by-key-residues -residues a.res < b.a2m > c.a2m
will create the file c.a2m based on the alignment b.a2m and the specification on a.res. The residue file is a white-space separated list of residues that must be present. For example,

N150 AS180 P-225
# Comments can be prefixed with #
would mean that alignment column 150 (with 1 as the first column) must have Asparagine, 180 must have Alanine or Serine, and 225 must have Proline or a gap. The script has two optional parameters, used as -mismatch m.a2m, -err m.err, or both, to create an a2m file containing all rejected sequences and an error file that explains why each sequence was rejected.

The select-by-gapless-regions script takes a description of the regions that must have no indels, and removes all sequences from the alignment that have insertions or deletions in those positions. For example,


select-by-gapless-regions  -nogaps a.nogap < b.a2m > c.a2m
will create the file c.a2m based on the alignment b.a2m and the specification on a.nogop. The nogap file is a white-space separated list of pairs of residues between which there should be no insertions For example,

40 60
74 95   # Comment
would mean that alignment columns 40...60 and 74...95 must have no deletions, and that there cannot be insertions after columns 40...59 and 74...94.


4.11 Installing SAM-T2K

To install SAM-T2K, you must have version 5 of PERL installed, as target2k is a PERL5 script. The sam-t2k-configure script can take care of this when run in the directory in what the scripts are installed.

The other changes needed are in the in the sam-t2k.conf perl module.

The scripts build-weighted-model, w1.0, w0.7, and w0.5 use the Setpaths.pm perl module, and so should not need any changes except for the location of perl.

4.12 SAM-T2K references


next up previous contents
: ʸ : SAM (Sequence Alignment and : 3 Quick overview   ܼ
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group