This page answers common questions concerning SAM, SAM-T99 query and SAM-T99 alignment tuneup. If you do not find a solution to your problem here, please inform us at email@example.com.
The E-value is an estimate of approximately how many sequences would score this well by chance in the database searched. For SAM-T99, E-values less than about 0.01 are roughly equivalent, as sequences that score that well have most likely already been included in the SAM-T99 multiple alignment.
If you have SAM installed locally, you can check out very good scores by re-running SAM-T99 with tighter thresholds. E-values better than the loosest threshold in the set are generally not much more meaningful than the threshold. Our web service does not currently support this sort of experimentation.
At a minimum, your sequence should be in a readseq compatible format, preferably FASTA format. In FASTA format, each sequence must have a unique name identifying the sequence in addition to the sequence residues. Each name must start with a > (less than) character at the beginning of the line, and continues to the end of the line. Note that only the first word of the name line (up to whitespace or punctuation such as a comma) is used as the name of the sequence, so the first word of each name line must be unique when an alignment is submitted for a query. On the next line following the name line, the actual sequence residues corresponding to the name should start.
For the SAM-T99 query page , merely satisfying FASTA format is acceptable if only a single sequence is submitted as a query sequence. However, if you submit an multiple alignment as query, it must be in a2m format described here . In addition, names for submitted sequences should not be names occurring in NR.
For the SAM-T99 alignment tuneup page , the seed multiple alignment must also be in a2m format, but any homolog sequences submitted need only be in FASTA format since they are not assumed to be aligned.
The SAM-T99 query page can return alignments in FASTA, pretty-printed, and HTML formats. The FASTA format is really our a2m format, which the SAM tools understand but some conversion tools misinterpret. The SAM package includes the "prettyalign" program, which can be used to add extra dots to the alignment, making it easier for tools that don't understand the a2m format to conver to other programs.
The prettyalign alignment formatting program compresses long insertions by showing just the initial segment of the insertions along with the length of the entire insertion segment (the digits). For more details see the SAM documentation for prettyalign .
Most of the sequence IDs in a SAM-T99 a2m file come from the IDs in the NR database. The sequence IDs may be modified by SAM to indicate the first and last sequence positions that matched the SAM-T99 HMM. For example in the following sequence ID taken from a SAM-T99 alignment,
>gi|16080670|ref|NP_391498.1|_1:234 (NC_000964) similar to hypothetical proteins [Bacillus subtilis] gi|7450240|pir||G70067 conserved hypothetical protein ywqL - Bacillus subtilis gi|1894750|emb|CAB07450.1| (Z92952) product similar to E.coli YjaF protein [Bacillus subtilis] gi|2636142|emb|CAB15634.1| (Z99122) similar to hypothetical proteins [Bacillus subtilis]the original sequence name
gi|16080670|ref|NP_391498.1|has been appended with
_1:234to indicate that the SAM-T99 HMM for the alignment matched the sequence starting a sequence position 1 and ending at sequence position 234.
Searching large databases with HMMs is moderately expensive---if you want to do it on your computers, feel free to get the SAM programs and run them. We'll continue to run the our server only on fairly small databases. The SAM-T99 method does do a preliminary search of all of NR for possible homologs, so the T99 alignment usually contains all the OBVIOUS homologs, though it won't necessarily contain more remote ones that could be found with the HMM.
You mentioned that FASTA, BLAST, and PSI-BLAST found a high-scoring similar sequence that SAM-T99 did not find. This happens fairly often---the most common causes are composition bias and large helices (particularly coiled-coils). The programs FASTA, BLAST, and PSI-BLAST can all be fooled into reporting very strong scores for sequences whose only similarity is that they both have long amphipathic helices. SAM-T99's reverse-sequence-null model cancels this signal (as well as composition bias and length signals), resulting in a method with many fewer false positives. A few true positives are lost, but not too many.
As an example, the leucine zipper 1ce0A gets only 25 sequences in the 1ce0A.t99.a2m alignment. The 19 PDB sequences in the alignment are all homologs (at least, similar structure and somewhat similar sequence). Other methods are likely to get almost any coiled-coil as a strong hit. This is an example of the reverse-sequence-null model removing a lot of trash (and possibly some good stuff) due to helicity signals.
Another common problem is with metallothionein appearing in searches for other cysteine-rich proteins. SAM-T99 only includes metallothionein when almost all the cysteines line up---we get are much more selective on cys-rich proteins than others. (Try a scorpion toxin like PDB structure 1aho.)
The response time for the server varies enormously, depending on the load and on the complexity of the request. We have a 4-processor DEC Alpha (4100 5/466) dedicated to running the SAM web services, so there are generally 4 or 5 jobs running at once. We have queueing of the requests (not quite first-come-first-serve, due to bugs in the UNIX "batch" command). Small proteins with few homologs may take only 10-20 minutes to run once they get to the front of the queue. Typical single-domain proteins may take half an hour to an hour. Large multiple-domain proteins with many homologs can take days---if we see a job that has been running for a very long time we may kill it, to give some other user a chance at the web service.
If your protein is a large, multi-domain protein, your best bet is to break it up into pieces (near domain boundaries is best, if you can guess where those are). Protein structure prediction generally works better on single domains in any case.
The SAM-T98 and SAM-T99 methods both build models the size of the input sequence. Finding domain boundaries when no structure is known is an art that we have not attempted to automate (though other researchers have).
Failure to conserve an active-site residue could mean several things:
We have done some tests of SAM-T99 as a multiple aligner (using the BAliBase test suite), and found that the alignments produced by SAM-T99 are about as good at those produced by CLUSTAL. You can try realigning them with other multiple aligners (such as CLUSTAL, PRRP, or DIALIGN), but it is probably a good idea to thin the alignment to a few diverse sequences first, since those aligners get very slow when given many sequences. If the alignment changes dramatically, then there is good reason to suspect the alignment.
We have not yet had much success with our attempts to score HMMs against HMMs, though we have not, of course, tried all possible algorithms. Our best method so far is to combine the results of scoring all template sequences against a target HMM and the target sequence agains all template HMMs. There is probably a better method, and some people have had success with profile-profile alignment, but (so far) it has not worked well in our hands.
Here's the buildmodel command used by the dna tuneup page:
buildmodel $build_runname -train $seed_file -sequence_models 1.0 -Nmodels 10 -alphabet DNA -aweight_method 2 -aweight_bits 0.2 -aweight_exponent 0.5
For the final alignment the command is:
hmmscore score.dna.run -i $build_runname.mod -dpstyle 0 -adpstyle 5 -sw 2 -select_align 8 -db $seed_file -alphabet DNA -db $homologs_file
Yes, we encourage you to download a copy of the SAM software. It's available free for academic use at the SAM web site. You may find additional functionality you need with the full SAM package that is not currently accessible through our web site.
Last modified: March 31, 2000