scispace - formally typeset

Journal ArticleDOI

Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats.

03 Nov 2008-BMC Bioinformatics (BioMed Central)-Vol. 9, Iss: 1, pp 466-466

TL;DR: DFT provides a robust detection method for higher order periodicity and is robust with respect to monomer insertions and deletions, random sequence insertions etc.

AbstractBackground Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats.

...read more

Content maybe subject to copyright    Report

Citations
More filters

BookDOI
14 Dec 2009
TL;DR: "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.
Abstract: Whereas getting exact data about living systems and sophisticated experimental procedures have primarily absorbed the minds of researchers previously, the development of high-throughput technologies has caused the weight to increasingly shift to the problem of interpreting accumulated data in terms of biological function and biomolecular mechanisms. In "Data Mining Techniques for the Life Sciences", experts in the field contribute valuable information about the sources of information and the techniques used for "mining" new insights out of databases. Beginning with a section covering the concepts and structures of important groups of databases for biomolecular mechanism research, the book then continues with sections on formal methods for analyzing biomolecular data and reviews of concepts for analyzing biomolecular sequence data in context with other experimental results that can be mapped onto genomes. As a volume of the highly successful Methods in Molecular Biology series, this work provides the kind of detailed description and implementation advice that is crucial for getting optimal results. Authoritative and easy to reference, "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.

131 citations


Cites background or methods from "Hierarchical structure of cascade o..."

  • ...More recently, new technologies of third and fourth generation sequencing [5] such as single cell molecule [6], nanopore-based [7] have been applied to whole-transcriptome analysis that opened a possibility for profiling rare or heterogeneous populations of cells....

    [...]

  • ...The frequency of stabilizing and destabilizing mutations in all single mutants [5] showed that most of the mutational experiments have been carried out with hydrophobic substitutions (replacement of one hydrophobic residue with another, e....

    [...]

  • ...The stability data for a set of 180 double mutants have been collected from ProTherm database [3, 5] and related them with sequence based features such as wild-type residue, mutant residue, and three neighboring residues on both directions of the mutant site....

    [...]

  • ...org/) [5], with the aim of coordinating and synchronizing the curation effort of all the participants and to offer a unified, freely available, consistently annotated and nonredundant molecular interaction dataset....

    [...]

  • ...The data come in three different formats: old-style PDB-format files, macromolecular Crystallographic Information File (mmCIF) format [5], and a XMLstyle format called PDBML/XML [6]....

    [...]


Journal ArticleDOI
Abstract: In this paper, we review the literature on statistical long-range correlation in DNA sequences. We examine the current evidence for these correlations, and conclude that a mixture of many length scales (including some relatively long ones) in DNA sequences is responsible for the observed 1/f-like spectral component. We note the complexity of the correlation structure in DNA sequences. The observed complexity often makes it hard, or impossible, to decompose the sequence into a few statistically stationary regions. We suggest that, based on the complexity of DNA sequences, a fruitful approach to understand long-range correlation is to model duplication, and other rearrangement processes, in DNA sequences. One model, called ``expansion-modification system", contains only point duplication and point mutation. Though simplistic, this model is able to generate sequences with 1/f spectra. We emphasize the importance of DNA duplication in its contribution to the observed long-range correlation in DNA sequences.

130 citations


01 Aug 2001
TL;DR: Spectral analyses performed indicate that these measure representations, considered as time series, exhibit strong long-range correlation and the multifractal property of the measure representation and the classification of bacteria.
Abstract: This paper introduces the notion of measure representation of DNA sequences. Spectral analysis and multifractal analysis are then performed on the measure representations of a large number of complete genomes. The main aim of this paper is to discuss the multifractal property of the measure representation and the classification of bacteria. From the measure representations and the values of the Dq spectra and related Cq curves, it is concluded that these complete genomes are not random sequences. In fact, spectral analyses performed indicate that these measure representations, considered as time series, exhibit strong long-range correlation. Here the long-range correlation is for the K-strings with dictionary ordering, and it is different from the base pair correlations introduced by other people. For substrings with length K=8, the Dq spectra of all organisms studied are multifractal-like and sufficiently smooth for the Cq curves to be meaningful. With the decreasing value of K, the multifractality lessens. The Cq curves of all bacteria resemble a classical phase transition at a critical point. But the ‘‘analogous’’ phase transitions of chromosomes of nonbacteria organisms are different. Apart from chromosome 1 of C. elegans, they exhibit the shape of double-peaked specific heat function. A classification of genomes of bacteria by assigning to each sequence a point in two-dimensional space (D_{-1} ,D1) and in three-dimensional space (D_{-1} ,D1 ,D_{-2}) was given. Bacteria that are close phylogenetically are almost close in the spaces (D_{-1} ,D1) and (D_{-1} ,D1 ,D_{-2}).

101 citations


Journal ArticleDOI
TL;DR: This work presents several case studies of GRM use, and presents the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram.
Abstract: The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012 .exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of a-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/ or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).

24 citations


Cites result from "Hierarchical structure of cascade o..."

  • ...These GRM results are in accordance with the pattern of previous results obtained by using heuristic algorithms (96)....

    [...]


Journal ArticleDOI
01 Sep 2011-Genomics
TL;DR: The comparison with available experimental data indicates that promoters with the most pronounced periodicities may be related to the supercoiling-sensitive genes.
Abstract: We analyzed the periodic patterns in E. coli promoters and compared the distributions of the corresponding patterns in promoters and in the complete genome to elucidate their function. Except the three-base periodicity, coincident with that in the coding regions and growing stronger in the region downstream from the transcriptions start (TS), all other salient periodicities are peaked upstream of TS. We found that helical periodicities with the lengths about B-helix pitch ~ 10.2–10.5 bp and A-helix pitch ~ 10.8–11.1 bp coexist in the genomic sequences. We mapped the distributions of stretches with A-, B-, and Z-like DNA periodicities onto E. coli genome. All three periodicities tend to concentrate within non-coding regions when their intensity becomes stronger and prevail in the promoter sequences. The comparison with available experimental data indicates that promoters with the most pronounced periodicities may be related to the supercoiling-sensitive genes.

23 citations


Cites background from "Hierarchical structure of cascade o..."

  • ...This sum is invariant with respect to complementary inversion of a sequence [34] and is more convenient for comparing the helical periodicities in the complete genome and in the promoter sequences (in the latter case, the promoters on two chains were always compiled as 5 ́–3 ́ sequences)....

    [...]

  • ...through sets of equidistant peaks [31, 34, 35]....

    [...]


References
More filters


Journal ArticleDOI
12 Mar 1992-Nature
TL;DR: This work proposes a method for studying the stochastic properties of nucleotide sequences by constructing a 1:1 map of the nucleotide sequence onto a walk, which it refers to as a 'DNA walk', and uncovers a remarkably long-range power law correlation.
Abstract: DNA sequences have been analysed using models, such as an n-step Markov chain, that incorporate the possibility of short-range nucleotide correlations. We propose here a method for studying the stochastic properties of nucleotide sequences by constructing a 1:1 map of the nucleotide sequence onto a walk, which we term a 'DNA walk'. We then use the mapping to provide a quantitative measure of the correlation between nucleotides over long distances along the DNA chain. Thus we uncover in the nucleotide sequence a remarkably long-range power law correlation that implies a new scale-invariant property of DNA. We find such long-range correlations in intron-containing genes and in nontranscribed regulatory DNA sequences, but not in complementary DNA sequences or intron-less genes.

1,247 citations


"Hierarchical structure of cascade o..." refers background or methods in this paper

  • ...Statistical studies of DNA sequences have been instigated by finding of the 1/fβ long-range power-law correlations in human genomic sequences, indicating the presence of scale invariant structure [4,5,22], implying that the underlying system shows fractal properties [25,76,77]....

    [...]

  • ...Different computational techniques have been used: Fourier spectral analysis [4-20], wavelet transform [21], DNA walk analysis [22-25], information theory measures [26-28], informational decomposition [29,30], quaternionic periodicity transform [31], exactly periodic subspace decomposition [32,33], portrait method [34], enhance algorithm for distance frequency distribution [35], etc....

    [...]

  • ...A single binary sequence was used by mapping genomic sequence into purine/pyrimidine representation [22], or into weak bond/strong bond representation [109]....

    [...]

  • ...A sharp peak of period three was found in a search for periodic regularities on a sample set of human exons [5,9,10,22,54,60,64]....

    [...]


Journal ArticleDOI
Ian Dunham1, Nobuyoshi Shimizu1, Bruce A. Roe1, S. Chissoe1  +220 moreInstitutions (15)
02 Dec 1999-Nature
TL;DR: The sequence of the euchromatic part of human chromosome 22 is reported, which consists of 12 contiguous segments spanning 33.4 megabases, contains at least 545 genes and 134 pseudogenes, and provides the first view of the complex chromosomal landscapes that will be found in the rest of the genome.
Abstract: Knowledge of the complete genomic DNA sequence of an organism allows a systematic approach to defining its genetic components. The genomic sequence provides access to the complete structures of all genes, including those without known function, their control elements, and, by inference, the proteins they encode, as well as all other biologically important sequences. Furthermore, the sequence is a rich and permanent source of information for the design of further biological studies of the organism and for the study of evolution through cross-species sequence comparison. The power of this approach has been amply demonstrated by the determination of the sequences of a number of microbial and model organisms. The next step is to obtain the complete sequence of the entire human genome. Here we report the sequence of the euchromatic part of human chromosome 22. The sequence obtained consists of 12 contiguous segments spanning 33.4 megabases, contains at least 545 genes and 134 pseudogenes, and provides the first view of the complex chromosomal landscapes that will be found in the rest of the genome.

1,064 citations


"Hierarchical structure of cascade o..." refers methods in this paper

  • ...The relative height of the corresponding peak in Fourier spectrum is a good discriminator of coding potential and has been used to detect coding regions [9,14,37,45,49,65-75]....

    [...]


Journal ArticleDOI
TL;DR: The test has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time.
Abstract: We give a test for protein coding regions which is based on simple and universal differences between protein-coding and noncoding DNA. The test is simple enough to use without a computer and is completely objective. The test has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time. We predict some new coding and noncoding regions in published sequences.

857 citations


"Hierarchical structure of cascade o..." refers methods or result in this paper

  • ...This is in accordance with previous conclusions that the period-3 feature is usually lacking or is weak in noncoding regions [7,9,37,39,41,66]....

    [...]

  • ...The relative height of the corresponding peak in Fourier spectrum is a good discriminator of coding potential and has been used to detect coding regions [9,14,37,45,49,65-75]....

    [...]


Journal ArticleDOI
24 May 1985-Science
TL;DR: This approach has revealed that the distribution of genes, integrated viral sequences, and interspersed repeats is highly nonuniform in the genome, and that the base composition and ratio of CpG to GpC in both coding and noncoding sequences, as well as codon usage, mainly depend on the GC content of the isochores harboring the sequences.
Abstract: Most of the nuclear genome of warm-blooded vertebrates is a mosaic of very long (much greater than 200 kilobases) DNA segments, the isochores; these isochores are fairly homogeneous in base composition and belong to a small number of major classes distinguished by differences in guanine-cytosine (GC) content. The families of DNA molecules derived from such classes can be separated and used to study the genome distribution of any sequence which can be probed. This approach has revealed (i) that the distribution of genes, integrated viral sequences, and interspersed repeats is highly nonuniform in the genome, and (ii) that the base composition and ratio of CpG to GpC in both coding and noncoding sequences, as well as codon usage, mainly depend on the GC content of the isochores harboring the sequences. The compositional compartmentalization of the genome of warm-blooded vertebrates is discussed with respect to its evolutionary origin, its causes, and its effects on chromosome structure and function.

843 citations


"Hierarchical structure of cascade o..." refers background in this paper

  • ...It has been pointed out that the mosaic structure of genome is presumably responsible for long-range correlations [79,85,86]....

    [...]