scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats.

03 Nov 2008-BMC Bioinformatics (BioMed Central)-Vol. 9, Iss: 1, pp 466-466
TL;DR: DFT provides a robust detection method for higher order periodicity and is robust with respect to monomer insertions and deletions, random sequence insertions etc.
Abstract: Background Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats.

Content maybe subject to copyright    Report

Citations
More filters
BookDOI
14 Dec 2009
TL;DR: "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.
Abstract: Whereas getting exact data about living systems and sophisticated experimental procedures have primarily absorbed the minds of researchers previously, the development of high-throughput technologies has caused the weight to increasingly shift to the problem of interpreting accumulated data in terms of biological function and biomolecular mechanisms. In "Data Mining Techniques for the Life Sciences", experts in the field contribute valuable information about the sources of information and the techniques used for "mining" new insights out of databases. Beginning with a section covering the concepts and structures of important groups of databases for biomolecular mechanism research, the book then continues with sections on formal methods for analyzing biomolecular data and reviews of concepts for analyzing biomolecular sequence data in context with other experimental results that can be mapped onto genomes. As a volume of the highly successful Methods in Molecular Biology series, this work provides the kind of detailed description and implementation advice that is crucial for getting optimal results. Authoritative and easy to reference, "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.

135 citations


Cites background or methods from "Hierarchical structure of cascade o..."

  • ...More recently, new technologies of third and fourth generation sequencing [5] such as single cell molecule [6], nanopore-based [7] have been applied to whole-transcriptome analysis that opened a possibility for profiling rare or heterogeneous populations of cells....

    [...]

  • ...The frequency of stabilizing and destabilizing mutations in all single mutants [5] showed that most of the mutational experiments have been carried out with hydrophobic substitutions (replacement of one hydrophobic residue with another, e....

    [...]

  • ...The stability data for a set of 180 double mutants have been collected from ProTherm database [3, 5] and related them with sequence based features such as wild-type residue, mutant residue, and three neighboring residues on both directions of the mutant site....

    [...]

  • ...org/) [5], with the aim of coordinating and synchronizing the curation effort of all the participants and to offer a unified, freely available, consistently annotated and nonredundant molecular interaction dataset....

    [...]

  • ...The data come in three different formats: old-style PDB-format files, macromolecular Crystallographic Information File (mmCIF) format [5], and a XMLstyle format called PDBML/XML [6]....

    [...]

Journal ArticleDOI
TL;DR: A review of the literature on statistical long-range correlation in DNA sequences can be found in this paper, where the authors conclude that a mixture of many length scales (including some relatively long ones) is responsible for the observed 1/f-like spectral component.
Abstract: In this paper, we review the literature on statistical long-range correlation in DNA sequences. We examine the current evidence for these correlations, and conclude that a mixture of many length scales (including some relatively long ones) in DNA sequences is responsible for the observed 1/f-like spectral component. We note the complexity of the correlation structure in DNA sequences. The observed complexity often makes it hard, or impossible, to decompose the sequence into a few statistically stationary regions. We suggest that, based on the complexity of DNA sequences, a fruitful approach to understand long-range correlation is to model duplication, and other rearrangement processes, in DNA sequences. One model, called ``expansion-modification system", contains only point duplication and point mutation. Though simplistic, this model is able to generate sequences with 1/f spectra. We emphasize the importance of DNA duplication in its contribution to the observed long-range correlation in DNA sequences.

130 citations

01 Aug 2001
TL;DR: Spectral analyses performed indicate that these measure representations, considered as time series, exhibit strong long-range correlation and the multifractal property of the measure representation and the classification of bacteria.
Abstract: This paper introduces the notion of measure representation of DNA sequences. Spectral analysis and multifractal analysis are then performed on the measure representations of a large number of complete genomes. The main aim of this paper is to discuss the multifractal property of the measure representation and the classification of bacteria. From the measure representations and the values of the Dq spectra and related Cq curves, it is concluded that these complete genomes are not random sequences. In fact, spectral analyses performed indicate that these measure representations, considered as time series, exhibit strong long-range correlation. Here the long-range correlation is for the K-strings with dictionary ordering, and it is different from the base pair correlations introduced by other people. For substrings with length K=8, the Dq spectra of all organisms studied are multifractal-like and sufficiently smooth for the Cq curves to be meaningful. With the decreasing value of K, the multifractality lessens. The Cq curves of all bacteria resemble a classical phase transition at a critical point. But the ‘‘analogous’’ phase transitions of chromosomes of nonbacteria organisms are different. Apart from chromosome 1 of C. elegans, they exhibit the shape of double-peaked specific heat function. A classification of genomes of bacteria by assigning to each sequence a point in two-dimensional space (D_{-1} ,D1) and in three-dimensional space (D_{-1} ,D1 ,D_{-2}) was given. Bacteria that are close phylogenetically are almost close in the spaces (D_{-1} ,D1) and (D_{-1} ,D1 ,D_{-2}).

102 citations

Journal ArticleDOI
TL;DR: This work presents several case studies of GRM use, and presents the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram.
Abstract: The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012 .exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of a-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/ or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).

27 citations


Cites result from "Hierarchical structure of cascade o..."

  • ...These GRM results are in accordance with the pattern of previous results obtained by using heuristic algorithms (96)....

    [...]

Journal ArticleDOI
01 Sep 2011-Genomics
TL;DR: The comparison with available experimental data indicates that promoters with the most pronounced periodicities may be related to the supercoiling-sensitive genes.

23 citations


Cites background from "Hierarchical structure of cascade o..."

  • ...This sum is invariant with respect to complementary inversion of a sequence [34] and is more convenient for comparing the helical periodicities in the complete genome and in the promoter sequences (in the latter case, the promoters on two chains were always compiled as 5 ́–3 ́ sequences)....

    [...]

  • ...through sets of equidistant peaks [31, 34, 35]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach towards identification of tandem repeats in DNA sequences is proposed, a refinement of previously considered method, based on the complex periodicity transform, obtained by mapping of DNA symbols to pure quaternions.
Abstract: Motivation: One of the main tasks of DNA sequence analysis is identification of repetitive patterns. DNA symbol repetitions play a key role in a number of applications, including prediction of gene and exon locations, identification of diseases, reconstruction of human evolutionary history and DNA forensics. Results: A new approach towards identification of tandem repeats in DNA sequences is proposed. The approach is a refinement of previously considered method, based on the complex periodicity transform. The refinement is obtained, among others, by mapping of DNA symbols to pure quaternions. This mapping results in an enhanced, symbol-balanced sensitivity of the transform to DNA patterns, and an unambiguous threshold selection criterion. Computational efficiency of the transform is further improved, and coupling of the computation with the period value is removed, thereby facilitating parallel implementation of the algorithm. Additionally, a post-processing stage is inserted into the algorithm, enabling unambiguous display of results in a convenient graphical format. Comparison of the quaternionic periodicity transform with two well-known pattern detection techniques shows that the new approach is competitive with these two techniques in detection of exact and approximate repeats. Supplementary information: Supplementary data are available at Bioinformatics online.

37 citations


"Hierarchical structure of cascade o..." refers methods in this paper

  • ...Alternatively, mapping of DNA symbolic sequence into a set of quaternions could be utilized via the use of quaternionic Fourier transform [31]....

    [...]

  • ...Different computational techniques have been used: Fourier spectral analysis [4-20], wavelet transform [21], DNA walk analysis [22-25], information theory measures [26-28], informational decomposition [29,30], quaternionic periodicity transform [31], exactly periodic subspace decomposition [32,33], portrait method [34], enhance algorithm for distance frequency distribution [35], etc....

    [...]

Journal ArticleDOI
TL;DR: Using the Key String Algorithm (KSA) to analyze Build 35.1 assembly, consensus alpha satellite higher-order repeats (HOR) and consensus distributions of CENP-B box and pJα motif in human chromosomes 1, 4, 5, 7, 8, 10, 11, 17, 19, and X are determined.
Abstract: Using our Key String Algorithm (KSA) to analyze Build 35.1 assembly we determined consensus alpha satellite higher-order repeats (HOR) and consensus distributions of CENP-B box and pJα motif in human chromosomes 1, 4, 5, 7, 8, 10, 11, 17, 19, and X. We determined new suprachromosomal family (SF) assignments: SF5 for 13mer (2211 bp), SF5 for 13mer (2214 bp), SF2 for 11mer (1869 bp), SF1 for 18mer (3058 bp), SF3 for 12mer (2047 bp), SF3 for 14mer (2379 bp), and SF5 for 17mer (2896 bp) in chromosomes 4, 5, 8, 10, 11, 17, and 19, respectively. In chromosome 5 we identified SF5 13mer without any CENP-B box and pJα motif, highly homologous (96%) to 13mer in chromosome 19. Additionally, in chromosome 19 we identified new SF5 17mer with one CENP-B box and pJα motif, aligned to 13mer by deleting four monomers. In chromosome 11 we identified SF3 12mer, homologous to 12mer in chromosome X. In chromosome 10 we identified new SF1 18mer with eight CENP-B boxes in every other monomer (except one). In chromosome 4 we identified new SF5 13mer with CENP-B box in three consecutive monomers. We found four exceptions to the rule that CENP-B box belongs to type B and pJα motif to type A monomers.

37 citations

Journal ArticleDOI
TL;DR: It is demonstrated here that the long-range correlations are trivially equivalent to the varying ratio R between pyrimidines and purines (or any other nucleotide combinations) in different regions of a DNA sequence.
Abstract: The occurrence of certain long-range correlations between nucleotides in DNA sequences of living organisms has recently been reported. The biological origin of these correlations was unknown. The correlations were proposed to be concerned with fractal structure and differences between intron-containing and intron-less sequences. We and others have reported that no consistent difference exists between intron-containing and intron-less sequences. In agreement with this, we demonstrate here that the long-range correlations are trivially equivalent to the varying ratio R between pyrimidines and purines (or any other nucleotide combinations) in different regions of a DNA sequence. Moreover, we show that this variation of R has simple biological explanations: Differences in base composition occur along most DNA sequences and are associated with (i) simple repeats (ii) differences in codon composition (due to the amino acid composition in the encoded protein), (iii) change of the direction of transcription (and thus also translation), and (iv) differences between protein- and rRNA-encoding segments. Seven biological examples are given.

36 citations

Journal ArticleDOI
TL;DR: It is reported that, through the use of alternative encodings of the DNA sequence in the complex plane, the number of FFTs performed can be traded off against (i) signal-to-noise ratio, and (ii) a certain degree of filtering for local similarity via k-tuple correlation.
Abstract: The detection of similarities between DNA sequences can be accomplished using the signal-processing technique of cross-correlation. An early method used the fast Fourier transform (FFT) to perform correlations on DNA sequences in O(n log n) time for any length sequence. However, this method requires many FFTs (nine), runs no faster if one sequence is much shorter than the other, and measures only global similarity, so that significant short local matches may be missed. We report that, through the use of alternative encodings of the DNA sequence in the complex plane, the number of FFTs performed can be traded off against (i) signal-to-noise ratio, and (ii) a certain degree of filtering for local similarity via k-tuple correlation. Also, when comparing probe sequences against much longer targets, the algorithm can be sped up by decomposing the target and performing multiple small FFTs in an overlap-save arrangement. Finally, by decomposing the probe sequence as well, the detection of local similarities can be further enhanced. With current advances in extremely fast hardware implementations of signal-processing operations, this approach may prove more practical than heretofore.

35 citations


"Hierarchical structure of cascade o..." refers methods in this paper

  • ...Spectral analysis employing Discrete Fourier transform is used to reveal periodicity in symbolic sequences, like genomic and protein sequences [7,9,14,16,17,20,36-53], to investigate long-range correlations [4,5,54,55] and to study the problem of sequence similarity [14,56-62]....

    [...]

Journal ArticleDOI
TL;DR: Novel methods are given which enable the detection of clusters of matching letters, facilitate the insertion of gaps to enhance sequence similarity, and accommodate to varying densities of letters in the input sequences.
Abstract: Novel methods are discussed for using fast Fourier transforms for DNA or protein sequence comparison. These methods are also intended as a contribution to the more general computer science problem of text search. These methods extend the capabilities of previous FFT methods and show that these methods are capable of considerable refinement. In particular, novel methods are given which (1) enable the detection of clusters of matching letters, (2) facilitate the insertion of gaps to enhance sequence similarity, and (3) accommodate to varying densities of letters in the input sequences. These methods use Fourier analysis in two distinct ways. (1) Fast Fourier transforms are used to facilitate rapid computation. (2) Fourier expansions are used to form an 'image' of the sequence comparison.

32 citations


"Hierarchical structure of cascade o..." refers methods in this paper

  • ...Spectral analysis employing Discrete Fourier transform is used to reveal periodicity in symbolic sequences, like genomic and protein sequences [7,9,14,16,17,20,36-53], to investigate long-range correlations [4,5,54,55] and to study the problem of sequence similarity [14,56-62]....

    [...]