Understanding Long-range Correlations in DNA Sequences

doi:10.1016/0167-2789(94)90294-1

Home
/
Papers
/
Understanding Long-range Correlations in DNA Sequences

Journal Article•DOI•

Understanding Long-range Correlations in DNA Sequences

Wentian Li¹, Thomas G. Marr¹, Kunihiko Kaneko²•Institutions (2)

Cold Spring Harbor Laboratory¹, University of Tokyo²

22 Mar 1994-arXiv: Chaotic Dynamics-

TL;DR: A review of the literature on statistical long-range correlation in DNA sequences can be found in this paper, where the authors conclude that a mixture of many length scales (including some relatively long ones) is responsible for the observed 1/f-like spectral component.

read less

Abstract: In this paper, we review the literature on statistical long-range correlation in DNA sequences. We examine the current evidence for these correlations, and conclude that a mixture of many length scales (including some relatively long ones) in DNA sequences is responsible for the observed 1/f-like spectral component. We note the complexity of the correlation structure in DNA sequences. The observed complexity often makes it hard, or impossible, to decompose the sequence into a few statistically stationary regions. We suggest that, based on the complexity of DNA sequences, a fruitful approach to understand long-range correlation is to model duplication, and other rearrangement processes, in DNA sequences. One model, called ``expansion-modification system", contains only point duplication and point mutation. Though simplistic, this model is able to generate sequences with 1/f spectra. We emphasize the importance of DNA duplication in its contribution to the observed long-range correlation in DNA sequences.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Prediction of probable genes by Fourier analysis of genomic sequences

[...]

Shrish Tiwari¹, Srinivasan Ramachandran¹, Alok Bhattacharya¹, Sudha Bhattacharya¹, Ramakrishna Ramaswamy¹ - Show less +1 more•Institutions (1)

Jawaharlal Nehru University¹

01 Jun 1997-Bioinformatics

TL;DR: The aim is to use Fourier techniques to analyse this periodicity, and thereby to develop a tool to recognize coding regions in genomic DNA, and find that the relative-height of the peak at f = 1/3 in the Fourier spectrum is a good discriminator of coding potential.

...read moreread less

Abstract: Motivation: The major signal in coding regions of genomic sequences is a three-base periodicity. Our aim is to use Fourier techniques to analyse this periodicity, and thereby to develop a tool to recognize coding regions in genomic DNA. Result: The three-base periodicity in the nucleotide arrangement is evidenced as a sharp peak at frequency f — 1/3 in the Fourier (or power) spectrum. From extensive spectral analysis of DNA sequences of total length over 5.5 million base pairs from a wide variety or organisms (including the human genome), and by separately examining coding and non-coding sequences, we find that the relative height of the peak at f = 1/3 in the Fourier spectrum is a good discriminator of coding potential. This feature is utilized by us to detect probable coding regions in DNA sequences, by examining the local signal-to-noise ratio of the peak within a sliding window. While the overall accuracy is comparable to that of other techniques currently in use, the measure that is presently proposed is independent of training sets or existing database information, and can thus find general application. Availability: A computer program GeneScan which locates coding open reading frames and exonic regions in genomic sequences has been developed, and is available on request. Contact: E-mail: rama@jnuniv.emet.in.

...read moreread less

469 citations

Journal Article•DOI•

Genomic signal processing

[...]

D. Anastassiou

01 Jul 2001-IEEE Signal Processing Magazine

TL;DR: Digital signal processing provides a set of novel and useful tools for solving highly relevant problems in genomic information science and technology, in the form of local texture, color spectrograms visually provide significant information about biomolecular sequences which facilitates understanding of local nature, structure, and function.

...read moreread less

Abstract: Genomics is a highly cross-disciplinary field that creates paradigm shifts in such diverse areas as medicine and agriculture. It is believed that many significant scientific and technological endeavors in the 21st century will be related to the processing and interpretation of the vast information that is currently revealed from sequencing the genomes of many living organisms, including humans. Genomic information is digital in a very real sense; it is represented in the form of sequences of which each element can be one out of a finite number of entities. Such sequences, like DNA and proteins, have been mathematically represented by character strings, in which each character is a letter of an alphabet. In the case of DNA, the alphabet is size 4 and consists of the letters A, T, C and G; in the case of proteins, the size of the corresponding alphabet is 20. As the list of references shows, biomolecular sequence analysis has already been a major research topic among computer scientists, physicists, and mathematicians. The main reason that the field of signal processing does not yet have significant impact in the field is because it deals with numerical sequences rather than character strings. However, if we properly map a character string into, one or more numerical sequences, then digital signal processing (DSP) provides a set of novel and useful tools for solving highly relevant problems. For example, in the form of local texture, color spectrograms visually provide significant information about biomolecular sequences which facilitates understanding of local nature, structure, and function. Furthermore, both the magnitude and the phase of properly defined Fourier transforms can be used to predict important features like the location and certain properties of protein coding regions in DNA. Even the process of mapping DNA into proteins and the interdependence of the two kinds of sequences can be analyzed using simulations based on digital filtering. These and other DSP-based approaches result in alternative mathematical formulations and may provide improved computational techniques for the solution of useful problems in genomic information science and technology.

...read moreread less

453 citations

Journal Article•DOI•

Frequency-domain analysis of biomolecular sequences

[...]

Dimitris Anastassiou¹•Institutions (1)

Columbia University¹

01 Dec 2000-Bioinformatics

TL;DR: An optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences is provided and it is demonstrated that color spectrograms can visually provide significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function.

...read moreread less

Abstract: Motivation: Frequency-domain analysis of biomolecular sequences is hindered by their representation as strings of characters. If numerical values are assigned to each of these characters, then the resulting numerical sequences are readily amenable to digital signal processing. Results: We introduce new computational and visual tools for biomolecular sequences analysis. In particular, we provide an optimization procedure improving upon traditional Fourier analysis performance in distinguishing coding from noncoding regions in DNA sequences. We also show that the phase of a properly defined Fourier transform is a powerful predictor of the reading frame of protein coding regions. Resulting color maps help in visually identifying not only the existence of protein coding areas for both DNA strands, but also the coding direction and the reading frame for each of the exons. Furthermore, we demonstrate that color spectrograms can visually provide, in the form of local ‘texture’, significant information about biomolecular sequences, thus facilitating understanding of local nature, structure and function. Availability: All software for techniques described in this paper is available from the author upon request.

...read moreread less

261 citations

Journal Article•DOI•

Positional Dependence, Cliques, and Predictive Motifs in the bHLH Protein Domain

[...]

William R. Atchley¹, W. Terhalle², Andreas W. M. Dress²•Institutions (2)

North Carolina State University¹, Bielefeld University²

01 May 1999-Journal of Molecular Evolution

TL;DR: In this paper, a large number of proteins that contain the highly conserved basic helix-loop-helix domain (bHLH) were analyzed and a predictive motif was constructed that accurately identifies bHLH domain-containing proteins that belong to groups A and B.

...read moreread less

Abstract: Quantitative analyses were carried out on a large number of proteins that contain the highly conserved basic helix–loop–helix domain. Measures derived from information theory were used to examine the extent of conservation at amino acid sites within the bHLH domain as well as the extent of mutual information among sites within the domain. Using the Boltzmann entropy measure, we described the extent of amino acid conservation throughout the bHLH domain. We used position association (pa) statistics that reflect the joint probability of occurrence of events to estimate the ``mutual information content'' among distinct amino acid sites. Further, we used pa statistics to estimate the extent of association in amino acid composition at each site in the domain and between amino acid composition and variables reflecting clade and group membership, loop length, and the presence of a leucine zipper. The pa values were also used to describe groups of amino acid sites called ``cliques'' that were highly associated with each other. Finally, a predictive motif was constructed that accurately identifies bHLH domain-containing proteins that belong to Groups A and B.

...read moreread less

259 citations

Journal Article•DOI•

The study of correlation structures of DNA sequences: a critical review.

[...]

Wentian Li¹•Institutions (1)

Rockefeller University¹

01 Jan 1997-Computational Biology and Chemistry

TL;DR: The study of correlation structure in the primary sequences of DNA is reviewed and a body of work on this topic constitutes a good starting point for future studies.

...read moreread less

234 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A mathematical theory of communication

[...]

Claude E. Shannon

01 Jul 1948-Bell System Technical Journal

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.

...read moreread less

Abstract: In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases.

...read moreread less

65,425 citations

Journal Article•

The mathematical theory of communication

[...]

Claude E. Shannon, Warren Weaver

01 Jan 1949-IEEE Transactions on Instrumentation and Measurement

TL;DR: The Mathematical Theory of Communication (MTOC) as discussed by the authors was originally published as a paper on communication theory more than fifty years ago and has since gone through four hardcover and sixteen paperback printings.

...read moreread less

Abstract: Scientific knowledge grows at a phenomenal pace--but few books have had as lasting an impact or played as important a role in our modern world as The Mathematical Theory of Communication, published originally as a paper on communication theory more than fifty years ago. Republished in book form shortly thereafter, it has since gone through four hardcover and sixteen paperback printings. It is a revolutionary work, astounding in its foresight and contemporaneity. The University of Illinois Press is pleased and honored to issue this commemorative reprinting of a classic.

...read moreread less

15,525 citations

Journal Article•DOI•

Numerical Recipes in C: The Art of Scientific Computing

[...]

Mary C. Seiler, Fritz A. Seiler

01 Sep 1989-Risk Analysis

11,285 citations

Journal Article•

A Mathematical Theory Communication

[...]

Claude E. Shannon