The inner loop of the algorithm is an
dynamic programming
algorithm that calculates the forward and backward values for each of
the three states at a given pair of model and sequence indices. The
serial implementation of this algorithm is straightforward. For Cray
vectorization,
the two inner loops were modified to calculate counter-diagonals in
order (i.e.,
) to make
effective use of the vector pipelines.
The MasPar parallel processor features a 2-dimensional mesh of processing elements (PEs) and a global router [Nickolls, 1990]. For large numbers of sequences, the best algorithm mapping can be to perform a complete model-sequence dynamic programming in each PE; unfortunately, the 64 KBytes of local memory per PE is too small for such coarse-grain parallelism. Instead, the array is treated as collection of linear arrays of PEs, where each linear array contains one model. Thus, if the models are of length 100, forty sequence-against-model calculations are performed at a time using 4000 of the 4096 processing elements.
The linear arrays are used as follows. During the forward part of the
dynamic programming, the (i,j) value is computed during step i+j
in i-th PE of one of the linear arrays. This calculation depends on
values from (i-1,j), (i,j-1), and (i-1,j-1), all of which have
already been calculated in either PE
or PE
. The
characters of the sequence are also shifted through the linear array,
one PE at a time, to ensure that character j is in PE
at time
step i+j. During the backward calculation, this process is
reversed, calculating each diagonal at time
. After all
sequences have been compared to the model, the frequencies are
combined using log-time reduction, and uploaded to the host computer
for high-level processing such as surgery and noise injection.
Performance can be rated in terms of dynamic programming cell updates per second, or CUPS. Performing both the forward and backward calculations at a single dynamic programming cell (one character from one sequence against one model node) is a single cell update. The total number of cell updates, which depends greatly on parameter settings, is the sum of model lengths over all reestimation cycles multiplied by the total number of characters in all the sequences. For the experiments described in the next section, a typical training run on 50 globins of average length 146 and a model length of 145 with 40 reestimatation cycles, required 50 seconds on our 4096-PE MasPar MP-2 or about 7 minutes on a DEC Alpha 3000/400 (longer runs runs heighten the difference between the two machines to factors of 10-15). Experiments based on 400 training sequences and an earlier version of the code are summarized in Table 1.
Table 1: HMM training on various machines.