next up previous
Next: When is a subsequence Up: Using compression models to Previous: Using compression models to

Efficient search using a cost map

 

The simplest algorithm for finding sequences using a model would be to enumerate all possible sequences and compute the cost of each. Unfortunately, this crude algorithm is too expensive. In a database of d characters, up to tex2html_wrap_inline679 different sequences may be found. If computing the cost of a sequence takes time proportional to the length of the sequence, then the entire search algorithm would be order tex2html_wrap_inline681. Furthermore, if several overlapping sequences have a low cost, we would like the algorithm to pick out the best of them--deferring for the moment exactly what we mean by ``best''.

Therefore, we need a search technique whose execution time is linear in d and which returns only disjoint sequences. First, let's assume that the model used to compute the probabilities provides not just an overall cost for a sequence, but assigns a cost to each position of the sequence. Second, let's assume that the cost of any given sequence is the sum of the costs for each of the characters of the sequence. Section 2.4  gif discusses the ways in which the models actually used violate these assumptions and why the violations are not serious in this context.

With the above assumptions, we can compute the costs for each character of the database once and save the costs in an array, called a cost map. From the cost map we can compute the cost for any subsequence of the database by taking the sum of costs in the array starting at the beginning of the subsequence and stopping at the end. Hence, given a cost map, finding interesting sequences becomes independent of the model used to create the cost map. This separation of the model from the search technique is one of the main advantages of the techniques described in this paper.

Our problem then is to find non-overlapping subsequences of a cost map whose cost is significantly lower than what we would expect from random strings of characters. Furthermore, we would like to do this in time proportional to the number of positions in the cost map.


next up previous
Next: When is a subsequence Up: Using compression models to Previous: Using compression models to

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996