SAM T06 Frequently Asked Questions

Last updated on April 21, 2006.

This page answers many common questions concerning SAM (Sequence Alignment and Modeling System) and SAM-T06. If you do not find a solution to your problem here, please inform us at sam-info@cse.ucsc.edu.

  1. How do I cite SAM?

    Here are the main paper citations (in BibTeX format):

    @string{prosfg= "Proteins: Structure, Function, and Genetics"}
    @string{prosfb= "Proteins: Structure, Function, and Bioinformatics"}
    @string{jmb="Journal of Molecular Biology"}
    @string{bioinf="Bioinformatics"}
    
    @article{SAMT98,
    	author="Kevin Karplus and Christian Barrett and Richard Hughey",
    	title="Hidden {Markov} Models for detecting Remote Protein Homologies",
    	journal=bioinf,
    	year="1998",
    	volume=14, number=10,
    	pages="846-856",
    	annotate="This paper provides a fairly detailed presentation
    	of the SAM-T98 method for finding remote homologs, including
    	both the method and the results on FSSP, SCOP, and PIR test sets."
    }
    	
    @article{SAMT2K-CASP4-proteins,
    	author="Kevin Karplus and Rachel Karchin and 
    		Christian Barrett and Spencer Tu and Melissa Cline and 
    		Mark Diekhans and Leslie Grate and Jonathan Casper and
    		Richard Hughey",
    	title="What is the value added by human intervention
    		in protein structure prediction?",
    	journal=prosfg,
    	year=2001,
    	volume=45,
    	number="S5",
    	pages="86--91"
    }
    
    @article{SAMT02-CASP5,
    	author="Kevin Karplus and
    		Rachel Karchin and
    		Jenny Draper and
    		Jonathan Casper and
    		Yael Mandel-Gutfreund and
    		Mark Diekhans and
    		Richard Hughey",
    	title="Combining local-structure, fold-recognition, and new-fold
    		methods for protein structure prediction",
    	journal=prosfg,
    	year="2003", month="15~"#oct,
    	volume="53",
    	number="Suppl.~6",
    	pages="491-496"
    	}
             
    @article{SAMT04-CASP6,
        author="Kevin Karplus and Sol Katzman and George Shackleford and
        Martina Koeva and Jenny Draper and Bret Barnes and Marcia Soriano
        and Richard Hughey",
        title="{SAM-T04}: what's new in protein-structure prediction for {CASP6}",
        journal=prosfb,
        year=2005,
        volume="61",
        number="S7",
        pages="135-142"
    }
    
    

  2. The results page has loaded, but it only has one link (or only a few links). Where are the results?

    The output page is printed as tests are completed, so you are viewing an incomplete page. The page always appears complete, even though more data is still being appended to the file. You will need to hit the RELOAD button on your browser to refresh the file. Keep in mind that it can take several minutes for all the tests to complete (longer if the server is exceptionally busy).

    A complete file will have "this page has finished loading" written at the end. If you have waited a long time, check for such a line. If that line is not present, then either your query is still not complete (the server may be under high load) or something has crashed. The recommended solution is to simply wait longer, or, after a very long time, to resubmit the query. You may also manually inspect the error logs (delete "summary.html" from the URL in order to get a list of files in your query directory—see the question below for more information on this procedure).

  3. The results page I got back doesn't even have a link to the sequence I provided. What happened?

    There are several possible causes. To find out more, go up one level in the URL (that is, delete the "summary.html" text from your browser's address bar, and load that directory). This gives you the full directory of result files, including the error messages. Sometimes you will be able to easily diagnose the problem from the error messages in the error logs.

    The most common problem is having a sequence with a lower-case sequence. We currently detect and capitalize an entirely-lower-case sequence, but if there is even one upper-case letter in the sequence, then your sequence will be shorter than expected. SAM-T06 uses a script that assumes that lowercase characters are insertions (not to be included in the HMM). The solution to this problem is convert your sequence to all upper-case and resubmit.

  4. What to do all the files coming out of SAM-T06 mean?

    There are a lot of outputs from our SAM-T06 web server, and we haven't had time to write an interpretation guide for all of them.

    The ones to concentrate on are:

  5. How do I interpret the E-value?

    The E-value is an estimate of approximately how many sequences would score this well by chance in the database searched. For SAM-T06, E-values less than about 1.0E-5 are very good hits and are very likely to have a domain of the same fold as the target. E-values larger than about 0.1 are very speculative—if your best hit is in this range then the correct fold is likely to be one of the top ten or twenty hits (unless the target is a new fold), but it is difficult to tell which of the top hits is the right one.

    Between 1.0E-5 and 0.1 the goodness of the match will vary somewhat from target to target, but will often be a good match.

    When you get an extremely small E-value (say 1.e-10 or smaller), then the alignments you get from SAM-T06 may not be any better than alignments that you get from sequence-sequence aligners like Smith-Waterman, FASTA, or BLAST. SAM-T06 is designed to to do good fold recognition and alignment in the difficult cases, and it may give up some performance on the "easy" ones.

  6. What do the secondary structure predictions mean?

    We report predictions for three alphabets (DSSP, STRIDE, STR [or STR2]) and one reduced alphabet (DSSP_EHL). For some of our predictions, we also predict residue burial. The DSSP alphabet is defined by the DSSP program (except that we combine the rare Pi-helix letter "I" in with the alpha helices "H"). The STRIDE alphabet is defined by the STRIDE program [Frishman D, Argos P. Knowledge-based protein secondary structure assignment. Proteins. 1995 Dec;23(4):566-79.] (again, we combine I with H).

    The STR (Strand) alphabet is an enhanced version of DSSP currently being developed at University of California, Santa Cruz. The concept was originated by post-doc Yael Mandel-Gutfreund. We have found that two-track hidden Markov models built with a STR secondary track are particularly good at fold recognition and target-template alignments.

    The original DSSP alphabet uses the letters "H" (alpha helix), "B" (isolated beta-bridge), "E" (extended strand in beta ladder), "G" (3/10 helix), "I" (pi helix), "T" (H-bonded turn) and "S" (bend).

    STR subdivides DSSP letter "E" into 6 letters, according to properties of a residue's relationship to its strand partners. (We also group the rare pi helix class "I" with the alpha helix class "H".)

    In the diagram, dots indicate the strand of the residue being assigned. In a beta sheet, this strand is either surrounded by two parallel partners "P", two anti-parallel partners "A" or one anti-parallel and one parallel partner "M". Edge strands (that have only one beta strand partner) have either a parallel partner "Q" or an anti-parallel partner "Z". Finally, we retain the "E" label for strand residues to which DSSP assigns no partners (generally beta bulges).

    We have also defined STR2 and STR3 alphabets:

    The STR2 alphabet further divides the anti-parallel edge strand class (Z) into two classes—those residues that are hydrogen-bonded to the sheet (Y) and those that are not (Z).
    The STR3 alphabet has Y and Z as in STR2, but also splits the parallel edge strand class into Q and R. (Question: which is the hydrogen-bonded one?)

    The ALPHA prediction is not currently provided by the web server, but is provided on some of our pre-computed web pages. The Alpha angle is the torsion angle of C_alpha(i-1), C_alpha(i), C_alpha(i+1), C_alpha(i+2). We have divided the range up into 11 classes (not mnemonically named):

    namerange
    A165<=alpha<-170
    B-136<=alpha<-103
    C-103<=alpha<-68
    D-68<=alpha<-17
    E-170<=alpha<-136
    F-17<=alpha<8
    G8<=alpha<31
    H31<=alpha<58
    I58<=alpha<85
    S85<=alpha<140
    T140<=alpha<165

    The DSSP_EHL alphabet is used in CASP and EVA evaluations of secondary-structure prediction. It combines all helix types (G, H, I) into one class (H), and both beta bridges and beta strands into one class (E), with everything else in an "other" class (variously called either C or L). Currently, we do not predict DSSP_EHL directly but combine our predictions for the more detailed alphabets to get a DSSP_EHL prediction. (We have not yet done extensive tests to see if this is better or worse than predicting DSSP_EHL directly.)

    Our burial predictions use various alphabets using letters A-G or A-K (for 7 or 11 levels of burial). In all the burial alphabets, A is the most exposed and burial gradually increases to G or K, which are fully buried. Currently, the web servers do not provide burial predictions, but we have included them on the SARS (and soon the yeast) pre-computed predictions. The alphabet we are currently using counts the number of C-beta atoms in a sphere of radius 14 around the C-beta atom of the residue (excluding itself), as Rachel Karchin found this alphabet to have good conservation and predicatability. [Rachel Karchin and Melissa Cline and Kevin Karplus, "Evaluation of local structure alphabets based on residue burial", Proteins: Structure, Function, and Genetics, in press. ]

    namerange
    Acount<27
    B27<=count<34
    C34<=count<40
    D40<=count<47
    E47<=count<55
    F55<=count<66
    G66<=count

  7. What sequence formats are accepted as input?

    Currently, the SAM-T06 query page only accepts a single sequence in FASTA format. In FASTA format, a sequence must have a unique name identifying the sequence in addition to the sequence residues. The name starts with a > (greater-than) character at the beginning of the line and continues to the first white space on the line. The rest of the name line is a comment, which is ignored. The sequence itself starts on the next line following the name line. The FASTA file should have the sequence itself in uppercase, thoug all-lowercase sequences will be accepted. Evetually we'll accept a2m alignments as input, in which case upper and lowercase distinctions will matter.

  8. What sequence formats are returned as output?

    The SAM-T06 query page returns multiple alignments in FASTA, pretty-printed, and HTML formats, and pairwise alignments of the target to the best-scoring template candidates in FASTA and .al (CASP) format. The FASTA format is really our a2m format, which the SAM tools understand but some conversion tools misinterpret. The SAM package includes the "prettyalign" program, which can be used to add extra dots to an alignment, making it easier for tools that don't understand the a2m format to convert to other programs.

  9. How can I view SAM-T06 models interactively?

    The graphical results from SAM-T06 are the sequence logos, which are in EPS (Encapsulated Postscript) format and PDF (Portable Document Format), and the jpeg images of the incomplete model built from the first alignment and of the automatically built complete model.

    If you wish to see interact with a model of the predicted protein, you have to download the pdb files. Currently these are kept in gzip format (to save space and transmission time). Some tools (such as rasmol) will handle gzipped pdb files directly, others will need to have the files uncompressed (with tools such as gunzip or Stiffit Expander).

  10. What do the sequence IDs in a SAM-T06 a2m file mean?

    Most of the sequence IDs in a SAM-T06 a2m file come from the IDs in the NR database. The sequence IDs may be modified by SAM to indicate the first and last sequence positions that matched the SAM-T06 HMM.

    For example in the following sequence ID taken from a SAM-T06 alignment,

    >gi|16080670|ref|NP_391498.1|_1:234 (NC_000964) similar to hypothetical
    proteins [Bacillus subtilis] gi|7450240|pir||G70067 conserved hypothetical
    protein ywqL - Bacillus subtilis gi|1894750|emb|CAB07450.1| (Z92952)
    product similar to E.coli YjaF protein [Bacillus subtilis]
    gi|2636142|emb|CAB15634.1| (Z99122) similar to hypothetical proteins
    [Bacillus subtilis]
            
    the original sequence name gi|16080670|ref|NP_391498.1| has had _1:234 appended to indicate that the SAM-T06 HMM for the alignment matched the sequence starting a sequence position 1 and ending at sequence position 234.

  11. I found homologs with BLAST (or PSI-BLAST or FASTA) that are not reported by a SAM-T06 database search. Are they BLAST (or PSI-BLAST, FASTA) more sensitive than SAM-T06?

    This happens fairly often—the most common causes are composition bias and large helices (particularly coiled-coils). The programs FASTA, BLAST, and PSI-BLAST can all be fooled into reporting very strong scores for sequences whose only similarity is that they both have long amphipathic helices. SAM-T06's reverse-sequence-null model cancels this signal (as well as composition bias and length signals), resulting in a method with many fewer false positives. A few true positives are lost, but not too many.

    As an example, the leucine zipper 1ce0A gets only 25 sequences in the 1ce0A.t02.a2m alignment. The 19 PDB sequences in the alignment are all homologs (at least, similar structure and somewhat similar sequence). Other methods are likely to get almost any coiled-coil as a strong hit. This is an example of the reverse-sequence-null model removing a lot of trash (and possibly some good stuff) due to helicity signals.

    Another common problem is with the cysteine-rich metallothionein appearing in searches for proteins that had highly conserved cysteines—even ones with very different structure and function. SAM-T06 only includes metallothionein when almost all the cysteines line up.

    Note: the compositional corrections to PsiBlast in August 2002 made the PsiBlast multiple alignments almost as good as the SAM-T02 alignments—the contamination by unrelated sequences was greatly reduced.

  12. What do I do if I have a large protein?

    If your protein is a large, multi-domain protein, your best bet is to break it up into pieces (near domain boundaries is best, if you can guess where those are). Protein structure prediction generally works better on single domains in any case.

  13. When I wanted to use your program for detection of homology in a multi-domain protein, the alignments I got with database-sequences seemed to depend very much upon the length of the input sequence (depends upon the guess of domain boundaries), and it was difficult to guess the correct input to give. I thus begin with a long sequence as input and then always shorten it to see what happens with the alignment. Is there a better way to handle this?

    The SAM-T06 method builds models the size of the input sequence. Finding domain boundaries when no structure is known is an art that we have not attempted to automate (though other researchers have).

    We have generally found it best to do a search first with the full-length protein, then remove any domains that are strongly predicted, and do the prediction again on what is left. A weaker prediction for a second domain may be masked by strong predictions for the more easily found domain in the full-length protein.

  14. The active site of my query protein is not conserved in the alignments returned by the SAM-T06 server. Are alignments reliable, or can they be improved by other alignment programs, e.g. CLUSTAL?

    Failure to conserve an active-site residue could mean several things:

    1. the residue is not in an otherwise conserved region, and the sequence-based aligners could not figure out that the residue alignment was important.
    2. the residues really are not in the same place relative to the rest of the fold.
    3. the function is different.

    We have done some tests of SAM-T99 (which is very similar to SAM-T02 for constructing multiple alignments) as a multiple aligner (using theBAliBase test suite), and found that the alignments produced by SAM-T99 are about as good at those produced by CLUSTAL. You can try realigning them with other multiple aligners (such as CLUSTAL, PRRP, or DIALIGN), but it is probably a good idea to thin the alignment to a few diverse sequences first, since those aligners get very slow when given many sequences. If the alignment changes dramatically, then there is good reason to suspect the alignment.

    Currently, the best multiple alignment method we know of is Muscle at http://www.drive5.com/muscle. It, however, does global alignments, which can cause severe over-alignment, especially wiht multidomain proteins.

  15. Is there a way to search a database of SAM HMMs with another HMM?

    We have not yet had much success with our attempts to score HMMs against HMMs, though we have not, of course, tried all possible algorithms.

    Our best method so far is to combine the results of scoring all template sequences against a target HMM and the target sequence against all template HMMs. There is probably a better method, and some people have had success with profile-profile alignment, but (so far) it has not worked well in our hands.

  16. Can I obtain a copy of the SAM software for use on local use on my computer?

    Yes, we encourage you to download a copy of the SAM software. It's available free for academic use at the SAM web site. You may find additional functionality you need with the full SAM package that is not currently accessible through our web site.

  17. Is SAM-T06 suitable for prediction of transmembrane domains?

    SAM-T99 and SAM-T02 have not been optimized for transmembrane predictions. They are "ok" on transmembrane predictions, but not nearly as good as tools optimized for that task. We've been told that the TMHMM server is currently the best predictor for transmembrane helices, but we've not done any tests ourselves.

  18. SAM-T06 returns what appears to be probabilities—a residue belongs to one of seven DSSP defined secondary structure states (the scores sum to 1). Are these numbers really probabilities that the given residue truly is the given secondary structure type (as verified by testing on a validation set), or are the numbers something more akin to probabilities that sequence homologs will have that secondary structure type at that particular location?

    The probabilities returned by the SAM-T06 server are from neural nets. The neural nets were trained to maximize
    sum_examples log Phat(correct letter | example)
    where Phat is the neural net output of predicted probability for a letter.

    The calibration has been checked and is pretty good. That is, in the cases where the neural net has said that the probability of helix is 0.80 about 80% of the time there really is a helix there. The cost function used in training makes the calibration very tight on the training set.

    Of course, the neural net is using a multiple alignment as an input, so if the target sequence is misaligned or has a different structure from the sequences it is aligned to, the neural net can produce a confident, but incorrect, secondary structure prediction.