scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence.

21 Aug 2007-Journal of Theoretical Biology (J Theor Biol)-Vol. 247, Iss: 4, pp 687-694
TL;DR: A new method to predict protein coding regions is developed based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature.
About: This article is published in Journal of Theoretical Biology.The article was published on 2007-08-21. It has received 169 citations till now. The article focuses on the topics: Sequence analysis.
Citations
More filters
Journal ArticleDOI
TL;DR: This work provides an accessible introduction and comparative review of DSP methods for the identification of protein-coding regions by breaking down the approaches into four steps, and suggests new combinations that may be worthy of future study.
Abstract: The identification of regions of DNA sequences that code for proteins is one of the most fundamental applications in bioinformatics. These protein-coding regions are in contrast to other DNA regions that encode functional RNA molecules, provide structural stability of chromosomes, serve as genetic raw materials, represent molecular fossils, or have no known purpose (sometimes called “junk DNA”). A number of approaches have been suggested for differentiating between the protein-coding and non-protein-coding regions of DNA. A selection of these approaches is based on digital signal processing (DSP) techniques. These DSP techniques rely on the phenomenon that protein-coding regions have a prominent power spectrum peak at frequency f = ⅓ arising from the length of codons (three nucleic acids). This article partitions the identification of protein-coding regions into four discrete steps. Based on this partitioning, DSP techniques can be easily described and compared based on their unique implementatio...

75 citations


Cites background or methods from "Prediction of protein coding region..."

  • ...Yin and Yau (2007) also used the SNR....

    [...]

  • ...Other techniques used other tools, but the goal is the same which is analyzing the 3-base periodicity of DNA sequences to differentiate between coding and non-coding regions (Yin and Yau, 2007; Mena-Chalco et al., 2008; Ma and Zhu, 2007; Kahumani et al., 2008)....

    [...]

  • ...In general, compared with EPND (Yin and Yau, 2007), the threshold in this technique is more accurate since it is calculated based on the sequence to be predicted....

    [...]

  • ...Yin and Yau (2007) used the nucleotide distributions to compute PS(N/3) of a DNA sequence accumulatively....

    [...]

  • ...Other DSP-based methods that measure the 3-base periodicity without computing the DFT sometimes do not use a sliding window in the analysis of DNA sequences (Yin and Yau, 2007; Mena-Chalco et al., 2008)....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT), and assesses the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences.

72 citations

Journal ArticleDOI
TL;DR: Experimental results on various datasets show that the proposed clustering method provides an efficient tool to classify genes and genomes and is remarkably faster than other multiple sequence alignment and alignment-free methods.

71 citations


Cites methods from "Prediction of protein coding region..."

  • ...E-mail address: yau@uic.edu (S.-T. Yau). patterns of that sequence, and it has been applied to identify protein coding regions in genomic sequences (Fukushima et al., 2002; Yin and Yau, 2005, 2007)....

    [...]

Journal ArticleDOI
TL;DR: A new DSP based model is derived that directly relates the identification of the period-3 component to the detection of nucleotide bias in the codon structure, and completely characterizes the DNA spectrum by a set of numerical sequences termed the filtered polyphase sequences.
Abstract: The detection of different forms of periodicities in DNA sequences has been an active area of research in recent years. Most of the signal processing based methods have primarily focussed on assigning numerical values to the symbolic DNA sequence and then applying spectral analysis tools such as the short-time discrete Fourier transform (ST-DFT) to locate these repeats. A key application of DNA periodicity finding has been in the identification of the protein coding regions in DNA sequences by tracking the so-called period-3 component using the DNA spectrum. The main problem with this gene detection approach is that it is successful for certain genes but does not work for others. An interesting open research problem is to therefore determine the underlying reasons behind this disparity in performance. This requires, in turn, a solid understanding of the working principles of the period-3 component and the DNA spectrum. In this paper, we present a DSP-based approach that provides a complete analysis of this phenomenon. Specifically, we derive a new DSP based model that 1) clearly explains the underlying mechanism of the period-3 component, 2) directly relates the identification of the period-3 component to the detection of nucleotide bias in the codon structure, and 3) completely characterizes the DNA spectrum by a set of numerical sequences termed the filtered polyphase sequences. Furthermore, by adhering to the specific structure of the derived model, we can show that standard signal processing tools such as digital filtering can substantially enhance the detection of the codon bias. Several performance measures of DNA periodicity detection are also proposed and experimental results are provided to illustrate the key findings of our work.

65 citations


Cites background from "Prediction of protein coding region..."

  • ...[39] where the authors also study the relationship between the relative abundance of the nucleotides and the period-3 property....

    [...]

Journal ArticleDOI
01 Oct 2016-Genomics
TL;DR: This research proposes to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship and gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega.

61 citations


Cites methods from "Prediction of protein coding region..."

  • ...With this characteristic, DFT has been used in numerous DNA researches, such as gene prediction [22], protein coding region [23], and periodicity analysis [24]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions is introduced.

3,709 citations


"Prediction of protein coding region..." refers methods in this paper

  • ...The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences....

    [...]

  • ...As examples, GenScan algorithm (Burge and Karlin, 1997) measured distinct statistics features of exons and introns within genomes and employed them in prediction via hidden Markov model (HMM); MZFF method (Zhang, 1997) was developed for predicting protein coding regions using quadratic discriminant…...

    [...]

Journal ArticleDOI
TL;DR: The test has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time.
Abstract: We give a test for protein coding regions which is based on simple and universal differences between protein-coding and noncoding DNA. The test is simple enough to use without a computer and is completely objective. The test has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time. We predict some new coding and noncoding regions in published sequences.

875 citations


"Prediction of protein coding region..." refers background in this paper

  • ...Keywords: Exon; Intron; 3-Base periodicity; Fourier transform...

    [...]

  • ...=3 at coding regions were addressed by Ficket (Fickett, 1982; Ficket and Tung, 1992)....

    [...]

  • ...It was demonstrated that the 3-base periodicity in a DNA sequence is partly caused by the unbalanced nucleotide distributions in the three coding positions in the sequence (Fickett, 1982; Ficket and Tung, 1992; Tiwari et al., 1997; Yin and Yau, 2005)....

    [...]

  • ...The 3-base periodicity magnitude and background noise can be directly computed from the nucleotide distributions (Ficket and Tung, 1992; Yin and Yau, 2005)....

    [...]

  • ...During the last two decades, a variety of computational algorithms have been developed to predict exons (for reviews, Ficket and Tung, 1992; Fickett, 1996; Zhang, 2002; Mathé et al., 2002)....

    [...]

Journal ArticleDOI
Richard F. Voss1
TL;DR: Spectral density measurements of individual base positions demonstrate the ubiquity of low-frequency 1/f β noise and long-range fractal correlations as well as prominent short-range periodicities.
Abstract: A new method of quantifying correlations in symbolic sequences is applied to DNA nucleotides. Spectral density measurements of individual base positions demonstrate the ubiquity of low-frequency 1/${\mathit{f}}^{\mathrm{\ensuremath{\beta}}}$ noise and long-range fractal correlations as well as prominent short-range periodicities. Ensemble averages over classifications in the GenBank databank (primate, invertebrate, plant, etc.) show systematic changes in spectral exponent \ensuremath{\beta} with evolutionary category.

848 citations


"Prediction of protein coding region..." refers background in this paper

  • ...A symbolic DNA sequence, denoted as, xð0Þ; xð1Þ; ... ; xðN � 1Þ, is first converted to four binary indicator sequences, uAðnÞ; uT ðnÞ; uCðnÞ ,a nduGðnÞ, which indicate the presence or absence of four nucleotides, A, T, C ,a ndG, at the nth position, respectively ( Voss, 1992; Tiwari et al., 1997; Anastassiou, 2000)....

    [...]

  • ...…denoted as, xð0Þ; xð1Þ; . . . ; xðN # 1Þ, is first converted to four binary indicator sequences, uAðnÞ; uT ðnÞ; uCðnÞ, and uGðnÞ, which indicate the presence or absence of four nucleotides, A, T, C, and G, at the nth position, respectively (Voss, 1992; Tiwari et al., 1997; Anastassiou, 2000)....

    [...]

  • ...Tiwari et al. (1997) explored the measure of spectral content (SC) in DNA sequences based on the fact that the 3-base periodicity, identified as a pronounced peak at the frequency N=3 of the Fourier power spectrum of the DNA sequences (N is the length of the DNA sequence), is prevalent in most protein coding regions, but does not exist in noncoding regions (Tsonis et al., 1991; Voss, 1992; Chechetkin and Turygin, 1995; Dodin et al., 2000)....

    [...]

  • ...=3 of the Fourier power spectrum of the DNA sequences (N is the length of the DNA sequence), is prevalent in most protein coding regions, but does not exist in noncoding regions (Tsonis et al., 1991; Voss, 1992; Chechetkin and Turygin, 1995; Dodin et al., 2000)....

    [...]

Journal ArticleDOI
15 Jun 1996-Genomics
TL;DR: The results indicated that the predictive accuracy of the programs analyzed was lower than originally found, which indicates that the programs are overly dependent on the particularities of the examples they learn from.

749 citations


"Prediction of protein coding region..." refers background in this paper

  • ...=3 of the Fourier power spectrum of the DNA sequences (N is the length of the DNA sequence), is prevalent in most protein coding regions, but does not exist in noncoding regions (Tsonis et al., 1991; Voss, 1992; Chechetkin and Turygin, 1995; Dodin et al., 2000)....

    [...]

Journal ArticleDOI
TL;DR: The existing approaches to predicting genes in eukaryotic genomes are reviewed and their intrinsic advantages and limitations are highlighted, showing that improvements are needed and that new directions must be considered.
Abstract: While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.

478 citations


"Prediction of protein coding region..." refers background in this paper

  • ...During the last two decades, a variety of computational algorithms have been developed to predict exons (for reviews, Ficket and Tung, 1992; Fickett, 1996; Zhang, 2002; Mathé et al., 2002)....

    [...]

  • ...During the last two decades, a variety of computational algorithms have been developed to predict exons (for reviews, Ficket and Tung, 1992; Fickett, 1996; Zhang, 2002; Mathé et al., 2002)....

    [...]