ISMB-97: Intelligent Systems for Molecular Biology 1997 Gelfand Tutorial

The Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB-97)

return to main index for ISMB97

Prediction of function in DNA sequence analysis
a tutorial
Misha Gelfand

Striking advances in large scale DNA sequencing resulted in complete sequencing of several bacterial genomes, the yeast genome, megabase sequences of higher eukaryotes, in particular, cosmid-size fragments of human DNA. Thus one of the most important problems of the computational molecular biology is now interpretation and functional mapping of the obtained sequence data.

The tutorial will cover the problems of computer-assisted functional analysis of nucleotide sequences from both the developer's and userr's points of view. It will start with an overview of DNA statisticss. Then the algorithms for recognition of functional sites, protein-coding regions and other DNA features will be considered, as well as the algorithms of database search. Finally, the user's strategy of analysis of a newly sequenced DNA fragment will be discussed.

The supplementary material will include the list of existing algorithms for functional interpretation of DNA sequences as well as locations of e-mail an WWW servers.

  1. DNA statistics.
    1. Oligonucleotide counts.
    2. "Linguistics of DNA": preferred and avoided oligonucleotides.
    3. Information theory analysis (Shannon entropy, Kolmogorov and Lempel-Ziv complexity).
    4. Analysis of periodicitity (Fourier transform, cross-correlation functions etc.).
    5. Fractal analysis.
    6. Zipf law in linguistics and bioinformatics.
  2. Statistical analysis and recognition of protein-coding regions.
    1. Codon usage.
    2. Coding potentials.
    3. Statistical regularities of the exon-intron structure.
  3. Statistical analysis and recognition of functional sites.
    1. Signal detection and algorithms of multiple local alignment.
    2. Consensus. Compilations of transcription factors.
    3. Weight matrices. Probabilistic and statistical- mechanical intepretation.
    4. Neural network recognizers.
  4. Other functional regions.
    1. tRNA genes and self-splicing introns.
    2. CpG islands.
    3. Matrix attachment regions.
    4. Nucleosome positioning.
  5. Database similarity search.
  6. Gene recognition.
    1. Statistical algorithms.
    2. Gene detection by similarity search.
    3. Prediction of gene structure by spliced alignment.
  7. Analysis of complete genomes.
    1. Preiction of protein function.
    2. Comparison of genomes. The minimal gene set.
    3. Combining diverse evidencee. Case study: recognition of restriction-modification system proteins and prediction of their specificity by protein homology and DNA statistics.

References:

  1. M.S.Gelfand. Prediction of function in DNA sequence analysis. J. Computational Biology 2, 87-115 (1995).
  2. M.S.Gelfand. Computer finctional analysis of nucleotide sequences: problems and approaches. DIMACS Series in Discrete and Applied Mathematics, vol. 8 "Mathematical Analysis of Biopolymer Sequences", pp. 19-61 (1992).
  3. J.W.Fickett, C.-S.Tung. Assessment of protein coding measures. Nucleic Acids Res. 20, 6441-6450 (1992).
  4. J.W.Fickett. The gene identification problem: an overview for developers. Comput. Chem.
  5. JJ.W.Fickett. Finding genes by computer: the state of the art. Trends Genet. 12, 316-320 (1996).
  6. P.Bork. Go hunting in sequence databases but watch out for traps. Trends Genet. 12, 425-427 (1996).