Description of A2M alignment format

The A2M format is used as the primary format for multiple alignments of protein or nucleic-acid sequences in the SAM suite of tools. It is a small modification of FASTA format for sequences and is compatible with most tools that read FASTA.

The main advatanages of A2M format over other multiple-alignment formats are

A file consists of any number of sequences, each of which starts with an identifying line. The identifying line must have a ">" character in the first column, followed immediately by an identifier for the sequence. The identifier is terminated by white space or a comma---the identifier should be unique for each sequence. The rest of the line is treated as a comment, but is preserved by many of the tools in the SAM suite.

After the identifying line, the sequence is given. Although the A2M format may be used with any alphabet (at least, any that has both upper-case and lower-case letters for every letter of the alphabet), SAM uses two special alphabets

For proteins, the legal alphabet is

For nucleic acids, the legal alphabet in SAM is Unknown letters (including the other nucleic acid wild cards) are handled like the general wildcards X and N. (Note: this handling of wildcards is specific to SAM's limited nucleic-acid alphabet, not a property of the A2M format.) White space (including line breaks) and periods are ignored.

The alignment information is encoded using uppercase and lowercase characters, and the special gap character "-". Uppercase characters and "-" represent alignment columns, and there must be exactly the same number of alignment columns in each sequence. Lowercase characters (and spaces or ".") represent insertion positions between alignment columns or at the ends of the sequence. The spaces or periods in the multiple alignments are only for human readability, and may be omitted.

The multiple-alignment output from our web servers usually omits the dots from the alignments, since they carry no information and can increase the size of the output many-fold. (Also, some e-mail software has trouble dealing with lines that start with a dot.) Some conversion programs misinterpret dotless a2m files, so conversion to other multiple-alignment formats can be difficult. The SAM tool suite includes the prettyalign program, which can add the dots to a dotless a2m file:

    prettyalign foo.a2m -f > foo.a2m_with_dots
Most conversion programs have no trouble with the a2m_with_dots format.

Here is an example of a small multiple alignment:

>2crd
.XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS.
>gi|786430|bbs|159192 potassium channel blocking toxin 15-1 [Leiurus quinquestriatus=scorpions, ssp. hebreus, venom, Peptide Partial, 32 aa]
.-----SCTASNQCWSICKRLHNTNRGKCMNKKCRCYS.
>gi|2500706|sp|P55928|SCKB_PANIM POTASSIUM CHANNEL BLOCKING TOXIN PITX-K-BETA
t----ISCTNEKQCYPHCKKETGYPNAKCMNRKCKCFGr