next up previous
Next: References Up: Using Markov Models and Previous: Hard-to-find REPs

Conclusions and future work

I have shown how cost maps can be used effectively to search for interesting DNA sequences using two different types of models: simple Markov models and hidden Markov models. The HMMs provide a more sensitive search technique, but both types of model are quite effective at finding REPs in the E. coli genome--as effective as the best previously known techniques [2, Table I,].

Several improvements are planned including better techniques for building HMMs (perhaps using simple Markov models to define states, but actual sequences to get the connectivity between states), better handling of ambiguity characters in the database and in the seed sequences, interfacing the HMMs to a multiple-alignment program, using a non-constant null model, including null states in the HMMs, allowing the user to specify that only sequences that use particular states of the HMM are interesting, chaning the definition of blurring to blur only in the ``context'' positions and not the predicted position, and modifying the code to handle protein sequences.

There are interesting repeated sequences that are not found by concentrating EcoSeq6. For example, the sequences found by Kunisawa and Nakamura [6] are not in the concentrated file nor in the sets of sequences found by the REP models. Growing their set of five examples finds a total of eight examples in EcoSeq6. Perhaps the threshold for significance could be changed to find repeated elements that are either not quite as common or not as long as the REP and IS sequences.

I'm also interested in studying the HMMs that are produced and using them to characterize and classify the REP sequences. One previous study identified some interesting REP clusters as containing binding sites for integration host factor (IHF), calling them repetitive IHF-binding palindromic elements (RIPs) [9]. As a preliminary step, I examined the HMMs to see if they modeled the IHF binding site of the RIPs. The site has also been referred to as sequence L of a BIME [3]).

  

Figure 5: Automatically produced drawing of the HMM REP99-gxn-hmm400m, which is the most easily understood HMM of the ones listed in Table 4  gif. The thickness of the edges is proportional to the square root of the number of times the edge was used. All edges that seem to connect to or from blank space are actually connections to the junk loop on the middle of the left side of the picture. The two main REP sequences and the REPv variant can be seen in both the forward direction (on the right) and the reverse direction (on the left).

In REP99-gxn.hmm400m, there is a sequence of states matching CAATATATTG (Figure 5  gif, upper-left side), which matches 48 times, while Oppenheim et al. reported only 33 possible binding sites in EcoSeq5, 28 as part of ``RIP'' elements and 5 as parts of ``near-RIPs'' [9]. The HMM missed one of the RIP sites (in REP102--only part of REP102 was found by the HMM) and one of the near-RIPs (REP95 was not found at all). The seventeen locations for possible IHF binding sites newly found by the HMM are in REPs 18 (twice), 36, 44, 45, 46, 64 (twice), 89, 107, 112, 113 (twice), 121, 126, 127. Also, there are two locations in REP34, only one of which is listed as a RIP by Oppenheimer et al. [9, Fig. 4,].

Since the IHF binding sequence is palindromic for the middle 10 bases, the 9-words of the simple Markov model can't determine the direction in the middle, and so the paths for the two directions share the middle two states when built directly from the simple Markov models. State merging results in blurring the two directions still more.


next up previous
Next: References Up: Using Markov Models and Previous: Hard-to-find REPs

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996