next up previous
Next: Efficient search using a Up: Using Markov Models and Previous: Using Markov Models and

Using compression models to find significant sequences

 

The problem of finding interesting sequences can be broken into several subproblems: defining what sequences are interesting, devising an algorithm to find them efficiently, and determining whether the sequences found are statistically significant or just chance variations.

The approach used here is to define the interesting sequences by using a model m that assigns probabilities to sequences (tex2html_wrap_inline669). The probabilities are assigned so that all the probabilities for sequences of a given length sum to 1--the length of the sequence is not predicted by the model. A sequence is interesting if the probability assigned by the model is significantly higher than the probability of the sequence using a null model. Normally we use the negative log probability of the sequence as a cost measure, since the probabilities become extremely small for longer sequences. Using base-2 logarithms gives us the encoding cost in bits of sequence s using model m:

displaymath675

Section 2  gif will talk about two ways that models can be constructed automatically, but first let's discuss how they are used for searching efficiently.




next up previous
Next: Efficient search using a Up: Using Markov Models and Previous: Using Markov Models and

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996