SAM T02 Frequently Asked Questions

(Last Update: 08:06 PST 29 January 2006 )

This page answers common questions concerning SAM and SAM-T02. If you do not find a solution to your problem here, please inform us at sam-info@cse.ucsc.edu.

  1. How do I cite SAM-T99 or SAM-T2K?

    Here are the main paper citations (in BibTeX format):

    
        @string{prosfg= "Proteins: Structure, Function, and Genetics"}
        @string{jmb= "Journal of Molecular Biology"}
        @string{bioinf="Bioinformatics"}
    
    @article{SAMT98,
    	author="Kevin Karplus and Christian Barrett and Richard Hughey",
    	title="Hidden {Markov} Models for detecting Remote Protein Homologies",
    	journal=bioinf,
    	year="1998",
    	volume=14, number=10,
    	pages="846-856",
    	annotate="This paper provides a fairly detailed presentation
    	of the SAM-T98 method for finding remote homologs, including
    	both the method and the results on FSSP, SCOP, and PIR test sets."
    	}
    	
    @article{Parketal98,
    	author="J. Park and K. Karplus and C. Barrett and R. Hughey and D. Haussler and T. Hubbard and C. Chothia",
    	title="Sequence Comparisons Using Multiple Sequences Detect
    	Three Times as Many Remote Homologues As Pairwise Methods",
    	year="1998",
    	journal=jmb,
    	volume=284, number=4, pages="1201-1210",
    	note="Paper available at {\def\xx{\discretionary{}{}{}}
    		{\tt http://www.mrc-lmb.cam.ac.uk/{\xx}genomes/{\xx}jong/{\xx}assess\_paper/{\xx}assess\_paperNov.html}}"
    }
    @comment{The xx definition is an attempt to keep BibTex from inserting
    	%, which adds an extra space---it is not entirely
    	successful as BibTeX STILL inserts one of the extraneous spaces.
    	}
    
    @article{SAMT2K-CASP4-proteins,
    	author="Kevin Karplus and Rachel Karchin and 
    		Christian Barrett and Spencer Tu and Melissa Cline and 
    		Mark Diekhans and Leslie Grate and Jonathan Casper and
    		Richard Hughey",
    	title="What is the value added by human intervention
    		in protein structure prediction?",
    	journal=prosfg,
    	year=2001,
    	volume=45,
    	number="S5",
    	pages="86--91"
    }
    
    @article{SAM-T02,
    	author="Kevin Karplus and
    		Rachel Karchin and
    		Jenny Draper and
    		Jonathan Casper and
    		Yael Mandel-Gutfreund and
    		Mark Diekhans and
    		Richard Hughey",
    	title="Combining local-structure, fold-recognition, and new-fold
    		methods for protein structure prediction",
    	journal=prosfg,
    	year="2003",
    	note="in press, special CASP5 edition"
    	}
    

  2. The results page I got back doesn't even have a link to the sequence I provided. What happened?

    There are several possible causes. To find out more, go up one level in the URL (omitting the "summary.html"). This gives you the full directory of result files, including the error messages. Sometimes this will help you find the problem, sometimes not. The most common problem is one that is hard to figure out even with the error message files. It is one we should be detecting automatically, but currently are not---having an all-lower-case sequence. The SAM-T02 method uses a script that assumes that lowercase characters are insertions (not to be included in the HMM). If all the letters are lowercase, the HMM has 0 length, and nothing is being modeled.

    The fix is to convert your sequence to all upper-case and resubmit.

  3. What to do all the files coming out of SAM-T02 mean?

    There are a lot of outputs from our SAM-T02 web server, and we haven't had time to write an interpretation guide for all of them. (Maybe when we get some funding, and don't have to rely on volunteer labor for everything ...)

    The ones to concentrate on are

    multiple alignment
    What homologs are there for the target? What organisms are they from? What annotation exists for homologs? Do any of the homologs have experimental structures in PDB? The pretty html format provides long names and links to other databases.
    Sequence Logos
    The logos give a quick graphical view of what the target hidden Markov models were looking for. The first logo is based on the amino acids of the multiple alignment. The height of each bar shows the conservation at that position in the alignment (expressed as a relative entropy in bits).

    The secondary structure logos show how confident the prediction is for each position. The height is the information gain (the relative entropy of the predicted probabilities and the background probabilities). Where the confidence is high, the predictions tend to be much more accurate. We have found the structure logos to be the most informative way to view the predictions.

    Top Hits
    What are the most likely templates and how good are they? We express the goodness of fit as a E-value (see below for an explanation).
    Top Models
    What are our most-favored alignments to the templates? We have generated many alignments for each target-template pair, and report in this table the ones that seem most likely, based on the alignment scores and our alignment tests of different alignment methods. You can get more alignments by going up one level in the URL (dropping the "summary.html") to get to the directory of all the results, then going to the subdirectory for the template you are interested in. The a2m files in that directory are other alignments.
    PDB files
    We don't currently provide a 3D model to look at for sequences submitted by a web user user, but we do provide some for pre-computed web pages. These models are created from the top alignment (or alignments) by doing sidechain substitution on the template. There is no optimization done, no loop modeling, no gap closing, so these should really only be considered as crude backbone models, not full 3D models.

    There may be multiple models in a single PDB file, corresponding to different alignments or different templates. Usually the first model is the one most likely to be right.

    Don't blindly trust these 3D models---at the very least look at the E-value for the template in the best-scores files. If the E-value is poor (greater than 1.0e-02, for example), then the model should be regarded as speculative.

    For sequences submitted to the web server, or for alignments other than the ones we provide models for, can create a crude model from an alignment using the server at http://predictioncenter.org/local/al2ts/al2ts.html (For more details, see below.)

  4. How do I interpret the E-value?

    The E-value is an estimate of approximately how many sequences would score this well by chance in the database searched. For SAM-T02, E-values less than about 1.0E-5 are very good hits and are very likely to have a domain of the same fold as the target. E-values larger than about 0.1 are very speculative---if your best hit is in this range then the correct fold is likely to be one of the top ten or twenty hits (unless the target is a new fold), but it is difficult to tell which of the top hits is the right one.

    Between 1.0E-5 and 0.1 the goodness of the match will vary somewhat from target to target, but will often be a good match.

    When you get an extremely small E-value (say 1.e-10 or smaller), then the alignments you get from SAM-T02 may not be any better than alignments that you get from sequence-sequence aligners like Smith-Waterman, FASTA, or BLAST. SAM-T02 is designed to to do good fold recognition and alignment in the difficult cases, and it may give up some performance on the "easy" ones.

  5. What do the secondary structure predictions mean?

    We report predictions for three alphabets (DSSP, STRIDE, STR [or STR2]) and one reduced alphabet (DSSP_EHL). For some of our predictions, we also predict residue burial. The DSSP alphabet is defined by the DSSP program (except that we combine the rare Pi-helix letter "I" in with the alpha helices "H"). The STRIDE alphabet is defined by the STRIDE program [Frishman D, Argos P. Knowledge-based protein secondary structure assignment. Proteins. 1995 Dec;23(4):566-79.] (again, we combine I with H).

    The STR (Strand) alphabet is an enhanced version of DSSP currently being developed at University of California, Santa Cruz. The concept was originated by post-doc Yael Mandel-Gutfreund. We have found that two-track hidden Markov models built with a STR secondary track are particularly good at fold recognition and target-template alignments.

    The original DSSP alphabet uses the letters "H" (alpha helix), "B" (isolated beta-bridge), "E" (extended strand in beta ladder), "G" (3/10 helix), "I" (pi helix), "T" (H-bonded turn) and "S" (bend).

    STR subdivides DSSP letter "E" into 6 letters, according to properties of a residue's relationship to its strand partners. (We also group the rare pi helix class "I" with the alpha helix class "H".)

    In the diagram, dots indicate the strand of the residue being assigned. In a beta sheet, this strand is either surrounded by two parallel partners "P", two anti-parallel partners "A" or one anti-parallel and one parallel partner "M". Edge strands (that have only one beta strand partner) have either a parallel partner "Q" or an anti-parallel partner "Z". Finally, we retain the "E" label for strand residues to which DSSP assigns no partners (generally beta bulges).

    We have also defined STR2 and STR3 alphabets:

    The STR2 alphabet further divides the anti-parallel edge strand class (Z) into two classes---those residues that are hydrogen-bonded to the sheet (Y) and those that are not (Z).
    The STR3 alphabet has Y and Z as in STR2, but also splits the parallel edge strand class into Q and R. (Question: which is the hydrogen-bonded one?)

    The ALPHA prediction is not currently provided by the web server, but is provided on some of our pre-computed web pages. The Alpha angle is the torsion angle of C_alpha(i-1), C_alpha(i), C_alpha(i+1), C_alpha(i+2). We have divided the range up into 11 classes (not mnemonically named):

    namerange
    A165<=alpha<-170
    B-136<=alpha<-103
    C-103<=alpha<-68
    D-68<=alpha<-17
    E-170<=alpha<-136
    F-17<=alpha<8
    G8<=alpha<31
    H31<=alpha<58
    I58<=alpha<85
    S85<=alpha<140
    T140<=alpha<165

    The DSSP_EHL alphabet is used in CASP and EVA evaluations of secondary-structure prediction. It combines all helix types (G, H, I) into one class (H), and both beta bridges and beta strands into one class (E), with everything else in an "other" class (variously called either C or L). Currently, we do not predict DSSP_EHL directly but combine our predictions for the more detailed alphabets to get a DSSP_EHL prediction. (We have not yet done extensive tests to see if this is better or worse than predicting DSSP_EHL directly.)

    Our burial predictions use various alphabets using letters A-G or A-K (for 7 or 11 levels of burial). In all the burial alphabets, A is the most exposed and burial gradually increases to G or K, which are fully buried. Currently, the web servers do not provide burial predictions, but we have included them on the SARS (and soon the yeast) pre-computed predictions. The alphabet we are currently using counts the number of C-beta atoms in a sphere of radius 14 around the C-beta atom of the residue (excluding itself), as Rachel Karchin found this alphabet to have good conservation and predicatability. [Rachel Karchin and Melissa Cline and Kevin Karplus, "Evaluation of local structure alphabets based on residue burial", Proteins: Structure, Function, and Genetics, in press. ]

    namerange
    Acount<27
    B27<=count<34
    C34<=count<40
    D40<=count<47
    E47<=count<55
    F55<=count<66
    G66<=count

  6. What sequence formats are accepted as input?

    Currently, the SAM-T02 query page only accepts a single sequence in FASTA format. In FASTA format, a sequence must have a unique name identifying the sequence in addition to the sequence residues. The name starts with a > (greater-than) character at the beginning of the line and continues to the first white space on the line. The rest of the name line is a comment, which is ignored. The sequence itself starts on the next line following the name line. The FASTA file should have the sequence itself in uppercase, thoug all-lowercase sequences will be accepted. Evetually we'll accept a2m alignments as input, in which case upper and lowercase distinctions will matter.

  7. What sequence formats are returned as output?

    The SAM-T02 query page returns multiple alignments in FASTA, pretty-printed, and HTML formats, and pairwise alignments of the target to the best-scoring template candidates in FASTA and .al (CASP) format. The FASTA format is really our a2m format, which the SAM tools understand but some conversion tools misinterpret. The SAM package includes the "prettyalign" program, which can be used to add extra dots to an alignment, making it easier for tools that don't understand the a2m format to convert to other programs.

  8. How can I view SAM-T02 results in graphical format?

    The only graphical results from SAM-T02 are the sequence logos, which are all in EPS (Encapsulated Postscript) format. There are many programs available for viewing EPS files---which one you chose is largely a question of what platform you run on. The most popular viewer for Unix machines is the free "ghostview" program (also available for MS-Windows and Macs) http://www.cs.wisc.edu/~ghost/gsview/index.htm

    If you wish to see a 3D model of the predicted protein, you have to convert the alignment to a 3D structure. Although we are working on tools to do this, they are not yet ready for release. In the meantime, your best bet is to take the .al files and submit them to the AL2TS server: http://predictioncenter.org/local/al2ts/al2ts.html That server only accepts their own ".al" format, which we provide for just the alignments based on the 2-track-protein-STR hidden Markov models (these are usually the best alignments).

  9. What do the sequence IDs in a SAM-T02 a2m file mean?

    Most of the sequence IDs in a SAM-T02 a2m file come from the IDs in the NR database. The sequence IDs may be modified by SAM to indicate the first and last sequence positions that matched the SAM-T02 HMM.

    For example in the following sequence ID taken from a SAM-T02 alignment,

    >gi|16080670|ref|NP_391498.1|_1:234 (NC_000964) similar to hypothetical
    proteins [Bacillus subtilis] gi|7450240|pir||G70067 conserved hypothetical
    protein ywqL - Bacillus subtilis gi|1894750|emb|CAB07450.1| (Z92952)
    product similar to E.coli YjaF protein [Bacillus subtilis]
    gi|2636142|emb|CAB15634.1| (Z99122) similar to hypothetical proteins
    [Bacillus subtilis]
            
    the original sequence name gi|16080670|ref|NP_391498.1| has had _1:234 appended to indicate that the SAM-T02 HMM for the alignment matched the sequence starting a sequence position 1 and ending at sequence position 234.

  10. I found homologs with BLAST (or PSI-BLAST or FASTA) that are not reported by a SAM-T02 database search. Are they BLAST (or PSI-BLAST, FASTA) more sensitive than SAM-T02?

    You mentioned that FASTA, BLAST, and PSI-BLAST found a high-scoring similar sequence that SAM-T02 did not find. This happens fairly often---the most common causes are composition bias and large helices (particularly coiled-coils). The programs FASTA, BLAST, and PSI-BLAST can all be fooled into reporting very strong scores for sequences whose only similarity is that they both have long amphipathic helices. SAM-T02's reverse-sequence-null model cancels this signal (as well as composition bias and length signals), resulting in a method with many fewer false positives. A few true positives are lost, but not too many.

    As an example, the leucine zipper 1ce0A gets only 25 sequences in the 1ce0A.t02.a2m alignment. The 19 PDB sequences in the alignment are all homologs (at least, similar structure and somewhat similar sequence). Other methods are likely to get almost any coiled-coil as a strong hit. This is an example of the reverse-sequence-null model removing a lot of trash (and possibly some good stuff) due to helicity signals.

    Another common problem is with the cysteine-rich metallothionein appearing in searches for proteins that had highly conserved cysteines---even ones with very different structure and function. SAM-T02 only includes metallothionein when almost all the cysteines line up.

    Note: the compositional corrections to PsiBlast in August 2002 made the PsiBlast multiple alignments almost as good as the SAM-T02 alignments---the contamination by unrelated sequences was greatly reduced.

  11. What do I do if I have a large protein?

    If your protein is a large, multi-domain protein, your best bet is to break it up into pieces (near domain boundaries is best, if you can guess where those are). Protein structure prediction generally works better on single domains in any case.

  12. When I wanted to use your program for detection of homology in a multi-domain protein, the alignments I got with database-sequences seemed to depend very much upon the length of the input sequence (depends upon the guess of domain boundaries), and it was difficult to guess the correct input to give. I thus begin with a long sequence as input and always shorten it to see what happens with the alignment. Is there a better way to handle this?

    The SAM-T02 method builds models the size of the input sequence. Finding domain boundaries when no structure is known is an art that we have not attempted to automate (though other researchers have).

    We have generally found it best to do a search first with the full-length protein, then remove any domains that are strongly predicted, and do the prediction again on what is left. A weaker prediction for a second domain may be masked by strong predictions for the more easily found domain in the full-length protein.

  13. The active site of my query protein is not conserved in the alignments returned by the SAM-T02 server. Are alignments reliable, or can they be improved by other alignment programs, e.g. CLUSTAL?

    Failure to conserve an active-site residue could mean several things:

    1. the residue is not in an otherwise conserved region, and the sequence-based aligners could not figure out that the residue alignment was important.
    2. the residues really are not in the same place relative to the rest of the fold.
    3. the function is different.

    We have done some tests of SAM-T99 (which is very similar to SAM-T02 for constructing multiple alignments) as a multiple aligner (using the BAliBase test suite), and found that the alignments produced by SAM-T99 are about as good at those produced by CLUSTAL. You can try realigning them with other multiple aligners (such as CLUSTAL, PRRP, or DIALIGN), but it is probably a good idea to thin the alignment to a few diverse sequences first, since those aligners get very slow when given many sequences. If the alignment changes dramatically, then there is good reason to suspect the alignment.

    Currently, the best multiple alignment method we know of is T-Coffee, but its run time is proportional to the cube of the number of sequences being run, so most SAM-T02 alignments would need to be drastically pruned before being realigned with T-coffee. Another one that we have heard is quite good and fast (supposedly better than T-coffee, fast enough to run on alignments of 1000s of sequences) is Muscle at http://www.drive5.com/muscle.

  14. Is there a way to search a database of SAM HMMs with another HMM?

    We have not yet had much success with our attempts to score HMMs against HMMs, though we have not, of course, tried all possible algorithms.

    Our best method so far is to combine the results of scoring all template sequences against a target HMM and the target sequence against all template HMMs. There is probably a better method, and some people have had success with profile-profile alignment, but (so far) it has not worked well in our hands.

  15. Can I obtain a copy of the SAM software for use on local use on my computer?

    Yes, we encourage you to download a copy of the SAM software. It's available free for academic use at the SAM web site. You may find additional functionality you need with the full SAM package that is not currently accessible through our web site.

  16. Is SAM-T02 suitable for prediction of transmembrane domains?

    SAM-T99 and SAM-T02 have not been optimized for transmembrane predictions. They are "ok" on transmembrane predictions, but not nearly as good as tools optimized for that task. We've been told that the TMHMM server is currently the best predictor for transmembrane helices, but we've not done any tests ourselves.

  • SAM-T02 returns what appears to be probabilities a residue belongs to one of seven DSSP defined secondary structure states (in that all scores sum to 1). Are these numbers really probabilities that the given residue truly is the given secondary structure type (as verified by testing on a validation set), or are the numbers something more akin to probabilities that sequence homologs will have that secondary structure type at that particular location?

    The probabilities returned by the SAM-T02 server are from neural nets. The neural nets were trained to maximize
    sum_examples log Phat(correct letter | example)
    where Phat is the neural net output of predicted probability for a letter.

    The calibration has been checked and is pretty good. That is, in the cases where the neural net has said that the probability of helix is 0.80 about 80% of the time there really is a helix there. The cost function used in training makes the calibration very tight on the training set.

    Of course, the neural net is using a multiple alignment as an input, so if the target sequence is misaligned or has a different structure from the sequences it is aligned to, the neural net can produce a confident, but incorrect, secondary structure prediction.