Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

doi:10.1016/J.JTBI.2007.03.038

Home
/
Papers
/
Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

Journal Article•DOI•

Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

Changchuan Yin¹, Stephen S.-T. Yau¹•Institutions (1)

University of Illinois at Chicago¹

21 Aug 2007-Journal of Theoretical Biology (J Theor Biol)-Vol. 247, Iss: 4, pp 687-694

TL;DR: A new method to predict protein coding regions is developed based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature.

read less

About: This article is published in Journal of Theoretical Biology.The article was published on 2007-08-21. It has received 169 citations till now. The article focuses on the topics: Sequence analysis.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Gene prediction based on DNA spectral analysis: a literature review.

[...]

Sajid A. Marhon¹, Stefan C. Kremer¹•Institutions (1)

University of Guelph¹

08 Apr 2011-Journal of Computational Biology

TL;DR: This work provides an accessible introduction and comparative review of DSP methods for the identification of protein-coding regions by breaking down the approaches into four steps, and suggests new combinations that may be worthy of future study.

...read moreread less

Abstract: The identification of regions of DNA sequences that code for proteins is one of the most fundamental applications in bioinformatics. These protein-coding regions are in contrast to other DNA regions that encode functional RNA molecules, provide structural stability of chromosomes, serve as genetic raw materials, represent molecular fossils, or have no known purpose (sometimes called “junk DNA”). A number of approaches have been suggested for differentiating between the protein-coding and non-protein-coding regions of DNA. A selection of these approaches is based on digital signal processing (DSP) techniques. These DSP techniques rely on the phenomenon that protein-coding regions have a prominent power spectrum peak at frequency f = ⅓ arising from the length of codons (three nucleic acids). This article partitions the identification of protein-coding regions into four discrete steps. Based on this partitioning, DSP techniques can be easily described and compared based on their unique implementatio...

...read moreread less

75 citations

Cites background or methods from "Prediction of protein coding region..."

...Yin and Yau (2007) also used the SNR....
[...]
...Other techniques used other tools, but the goal is the same which is analyzing the 3-base periodicity of DNA sequences to differentiate between coding and non-coding regions (Yin and Yau, 2007; Mena-Chalco et al., 2008; Ma and Zhu, 2007; Kahumani et al., 2008)....
[...]
...In general, compared with EPND (Yin and Yau, 2007), the threshold in this technique is more accurate since it is calculated based on the sequence to be predicted....
[...]
...Yin and Yau (2007) used the nucleotide distributions to compute PS(N/3) of a DNA sequence accumulatively....
[...]
...Other DSP-based methods that measure the 3-base periodicity without computing the DFT sometimes do not use a sliding window in the analysis of DNA sequences (Yin and Yau, 2007; Mena-Chalco et al., 2008)....
[...]

Journal Article•DOI•

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering

[...]

Changchuan Yin¹, Ying Chen², Stephen S.-T. Yau²•Institutions (2)

University of Phoenix¹, Tsinghua University²

21 Oct 2014-Journal of Theoretical Biology

TL;DR: This work proposes a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT), and assesses the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences.

...read moreread less

72 citations

Journal Article•DOI•

A new method to cluster DNA sequences using Fourier power spectrum.

[...]

Tung Hoang, Changchuan Yin, Hui Zheng, Chenglong Yu¹, Rong Lucy He², Stephen S.-T. Yau³ - Show less +2 more•Institutions (3)

Flinders University¹, Chicago State University², Tsinghua University³

07 May 2015-Journal of Theoretical Biology

TL;DR: Experimental results on various datasets show that the proposed clustering method provides an efficient tool to classify genes and genomes and is remarkably faster than other multiple sequence alignment and alignment-free methods.

...read moreread less

71 citations

Cites methods from "Prediction of protein coding region..."

...E-mail address: yau@uic.edu (S.-T. Yau). patterns of that sequence, and it has been applied to identify protein coding regions in genomic sequences (Fukushima et al., 2002; Yin and Yau, 2005, 2007)....
[...]

Journal Article•DOI•

A DSP Approach for Finding the Codon Bias in DNA Sequences

[...]

J. Tuqan¹, Ahmad A. Rushdi¹•Institutions (1)

University of California, Davis¹

24 Jun 2008-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A new DSP based model is derived that directly relates the identification of the period-3 component to the detection of nucleotide bias in the codon structure, and completely characterizes the DNA spectrum by a set of numerical sequences termed the filtered polyphase sequences.

...read moreread less

Abstract: The detection of different forms of periodicities in DNA sequences has been an active area of research in recent years. Most of the signal processing based methods have primarily focussed on assigning numerical values to the symbolic DNA sequence and then applying spectral analysis tools such as the short-time discrete Fourier transform (ST-DFT) to locate these repeats. A key application of DNA periodicity finding has been in the identification of the protein coding regions in DNA sequences by tracking the so-called period-3 component using the DNA spectrum. The main problem with this gene detection approach is that it is successful for certain genes but does not work for others. An interesting open research problem is to therefore determine the underlying reasons behind this disparity in performance. This requires, in turn, a solid understanding of the working principles of the period-3 component and the DNA spectrum. In this paper, we present a DSP-based approach that provides a complete analysis of this phenomenon. Specifically, we derive a new DSP based model that 1) clearly explains the underlying mechanism of the period-3 component, 2) directly relates the identification of the period-3 component to the detection of nucleotide bias in the codon structure, and 3) completely characterizes the DNA spectrum by a set of numerical sequences termed the filtered polyphase sequences. Furthermore, by adhering to the specific structure of the derived model, we can show that standard signal processing tools such as digital filtering can substantially enhance the detection of the codon bias. Several performance measures of DNA periodicity detection are also proposed and experimental results are provided to illustrate the key findings of our work.

...read moreread less

65 citations

Cites background from "Prediction of protein coding region..."

...[39] where the authors also study the relationship between the relative abundance of the nucleotides and the period-3 property....
[...]

Journal Article•DOI•

Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison.

[...]

Tung Hoang, Changchuan Yin, Stephen S.-T. Yau¹•Institutions (1)

Tsinghua University¹

01 Oct 2016-Genomics

TL;DR: This research proposes to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship and gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega.

...read moreread less

61 citations

Cites methods from "Prediction of protein coding region..."

...With this characteristic, DFT has been used in numerous DNA researches, such as gene prediction [22], protein coding region [23], and periodicity analysis [24]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Prediction of Complete Gene Structures in Human Genomic DNA

[...]

Christopher B. Burge¹, Samuel Karlin¹•Institutions (1)

Stanford University¹

25 Apr 1997-Journal of Molecular Biology

TL;DR: A general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions is introduced.

...read moreread less

3,709 citations

"Prediction of protein coding region..." refers methods in this paper

...The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences....
[...]
...As examples, GenScan algorithm (Burge and Karlin, 1997) measured distinct statistics features of exons and introns within genomes and employed them in prediction via hidden Markov model (HMM); MZFF method (Zhang, 1997) was developed for predicting protein coding regions using quadratic discriminant…...
[...]

Journal Article•DOI•

Recognition of protein coding regions in DNA sequences

[...]

James W. Fickett¹•Institutions (1)

Los Alamos National Laboratory¹

11 Sep 1982-Nucleic Acids Research

TL;DR: The test has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time.

...read moreread less

Abstract: We give a test for protein coding regions which is based on simple and universal differences between protein-coding and noncoding DNA. The test is simple enough to use without a computer and is completely objective. The test has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time. We predict some new coding and noncoding regions in published sequences.

...read moreread less

875 citations

"Prediction of protein coding region..." refers background in this paper

...Keywords: Exon; Intron; 3-Base periodicity; Fourier transform...
[...]
...=3 at coding regions were addressed by Ficket (Fickett, 1982; Ficket and Tung, 1992)....
[...]
...It was demonstrated that the 3-base periodicity in a DNA sequence is partly caused by the unbalanced nucleotide distributions in the three coding positions in the sequence (Fickett, 1982; Ficket and Tung, 1992; Tiwari et al., 1997; Yin and Yau, 2005)....
[...]
...The 3-base periodicity magnitude and background noise can be directly computed from the nucleotide distributions (Ficket and Tung, 1992; Yin and Yau, 2005)....
[...]
...During the last two decades, a variety of computational algorithms have been developed to predict exons (for reviews, Ficket and Tung, 1992; Fickett, 1996; Zhang, 2002; Mathé et al., 2002)....
[...]

Journal Article•DOI•

Evolution of long-range fractal correlations and 1/f noise in DNA base sequences.

[...]

Richard F. Voss¹•Institutions (1)

IBM¹

22 Jun 1992-Physical Review Letters

TL;DR: Spectral density measurements of individual base positions demonstrate the ubiquity of low-frequency 1/f β noise and long-range fractal correlations as well as prominent short-range periodicities.

...read moreread less

Abstract: A new method of quantifying correlations in symbolic sequences is applied to DNA nucleotides. Spectral density measurements of individual base positions demonstrate the ubiquity of low-frequency 1/${\mathit{f}}^{\mathrm{\ensuremath{\beta}}}$ noise and long-range fractal correlations as well as prominent short-range periodicities. Ensemble averages over classifications in the GenBank databank (primate, invertebrate, plant, etc.) show systematic changes in spectral exponent \ensuremath{\beta} with evolutionary category.

...read moreread less

848 citations

"Prediction of protein coding region..." refers background in this paper

...A symbolic DNA sequence, denoted as, xð0Þ; xð1Þ; ... ; xðN � 1Þ, is first converted to four binary indicator sequences, uAðnÞ; uT ðnÞ; uCðnÞ ,a nduGðnÞ, which indicate the presence or absence of four nucleotides, A, T, C ,a ndG, at the nth position, respectively ( Voss, 1992; Tiwari et al., 1997; Anastassiou, 2000)....
[...]
...…denoted as, xð0Þ; xð1Þ; . . . ; xðN # 1Þ, is first converted to four binary indicator sequences, uAðnÞ; uT ðnÞ; uCðnÞ, and uGðnÞ, which indicate the presence or absence of four nucleotides, A, T, C, and G, at the nth position, respectively (Voss, 1992; Tiwari et al., 1997; Anastassiou, 2000)....
[...]
...Tiwari et al. (1997) explored the measure of spectral content (SC) in DNA sequences based on the fact that the 3-base periodicity, identified as a pronounced peak at the frequency N=3 of the Fourier power spectrum of the DNA sequences (N is the length of the DNA sequence), is prevalent in most protein coding regions, but does not exist in noncoding regions (Tsonis et al., 1991; Voss, 1992; Chechetkin and Turygin, 1995; Dodin et al., 2000)....
[...]
...=3 of the Fourier power spectrum of the DNA sequences (N is the length of the DNA sequence), is prevalent in most protein coding regions, but does not exist in noncoding regions (Tsonis et al., 1991; Voss, 1992; Chechetkin and Turygin, 1995; Dodin et al., 2000)....
[...]

Journal Article•DOI•

Evaluation of gene structure prediction programs.

[...]

Moisès Burset, Roderic Guigó¹•Institutions (1)

University of Barcelona¹

15 Jun 1996-Genomics

TL;DR: The results indicated that the predictive accuracy of the programs analyzed was lower than originally found, which indicates that the programs are overly dependent on the particularities of the examples they learn from.

...read moreread less

749 citations

"Prediction of protein coding region..." refers background in this paper

...=3 of the Fourier power spectrum of the DNA sequences (N is the length of the DNA sequence), is prevalent in most protein coding regions, but does not exist in noncoding regions (Tsonis et al., 1991; Voss, 1992; Chechetkin and Turygin, 1995; Dodin et al., 2000)....
[...]

Journal Article•DOI•

Current methods of gene prediction, their strengths and weaknesses

[...]

Catherine Mathé, Marie-France Sagot, Thomas Schiex, Pierre Rouzé

01 Oct 2002-Nucleic Acids Research

TL;DR: The existing approaches to predicting genes in eukaryotic genomes are reviewed and their intrinsic advantages and limitations are highlighted, showing that improvements are needed and that new directions must be considered.

...read moreread less

Abstract: While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.

...read moreread less

478 citations

"Prediction of protein coding region..." refers background in this paper

...During the last two decades, a variety of computational algorithms have been developed to predict exons (for reviews, Ficket and Tung, 1992; Fickett, 1996; Zhang, 2002; Mathé et al., 2002)....
[...]
...During the last two decades, a variety of computational algorithms have been developed to predict exons (for reviews, Ficket and Tung, 1992; Fickett, 1996; Zhang, 2002; Mathé et al., 2002)....
[...]