scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm

01 Jan 2013-Nucleic Acids Research (Oxford University Press)-Vol. 41, Iss: 1
TL;DR: This work presents several case studies of GRM use, and presents the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram.
Abstract: The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012 .exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of a-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/ or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).
Citations
More filters
Journal ArticleDOI
TL;DR: A novel computational pipeline that circumvents the problem of difficult to assemble satellite DNA characterization by detecting satellite repeats directly from unassembled short reads by employing graph-based sequence clustering to identify groups of reads that represent repetitive elements.
Abstract: Satellite DNA is one of the major classes of repetitive DNA, characterized by tandemly arranged repeat copies that form contiguous arrays up to megabases in length. This type of genomic organization makes satellite DNA difficult to assemble, which hampers characterization of satellite sequences by computational analysis of genomic contigs. Here, we present tandem repeat analyzer (TAREAN), a novel computational pipeline that circumvents this problem by detecting satellite repeats directly from unassembled short reads. The pipeline first employs graph-based sequence clustering to identify groups of reads that represent repetitive elements. Putative satellite repeats are subsequently detected by the presence of circular structures in their cluster graphs. Consensus sequences of repeat monomers are then reconstructed from the most frequent k-mers obtained by decomposing read sequences from corresponding clusters. The pipeline performance was successfully validated by analyzing low-pass genome sequencing data from five plant species where satellite DNA was previously experimentally characterized. Moreover, novel satellite repeats were predicted for the genome of Vicia faba and three of these repeats were verified by detecting their sequences on metaphase chromosomes using fluorescence in situ hybridization.

181 citations


Cites background from "Direct mapping of symbolic DNA sequ..."

  • ...As reviewed by Glunčić and Paar (14), TRF is a representative of string matching algorithms, which are utilized in a number of computational tools for tandem repeat prediction, along with alternative approaches based on nucleotide autocorrelation functions (15,16) and Fourier transforms (17)....

    [...]

  • ...As reviewed by Glunčić and Paar (14), TRF is a representative of string matching algorithms, which are utilized in a number of computational tools for tandem repeat prediction, along with alternative approaches based on nucleotide autocorrelation functions (15,16) and Fourier transforms (17)....

    [...]

BookDOI
14 Dec 2009
TL;DR: "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.
Abstract: Whereas getting exact data about living systems and sophisticated experimental procedures have primarily absorbed the minds of researchers previously, the development of high-throughput technologies has caused the weight to increasingly shift to the problem of interpreting accumulated data in terms of biological function and biomolecular mechanisms. In "Data Mining Techniques for the Life Sciences", experts in the field contribute valuable information about the sources of information and the techniques used for "mining" new insights out of databases. Beginning with a section covering the concepts and structures of important groups of databases for biomolecular mechanism research, the book then continues with sections on formal methods for analyzing biomolecular data and reviews of concepts for analyzing biomolecular sequence data in context with other experimental results that can be mapped onto genomes. As a volume of the highly successful Methods in Molecular Biology series, this work provides the kind of detailed description and implementation advice that is crucial for getting optimal results. Authoritative and easy to reference, "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.

135 citations


Cites background or methods from "Direct mapping of symbolic DNA sequ..."

  • ...Curated databases such as Reference Sequence Collection [14, 15] provide a curated/expert view by compilation and correction of the data....

    [...]

  • ...These residues are often functionally important such as in enzyme catalysis and ligand binding or involved in the formation of protein–protein complexes [15, 16]....

    [...]

  • ...UniProtKB [15] is the protein sequence reference database chosen by the majority of the interaction databases....

    [...]

  • ...nl/cgibin/emboss/skipredundant; [15]) and with the computer program cd-hit (http://weizhongli-lab....

    [...]

  • ...It has been effectively used for understanding the relationship between amino acid properties and stability of protein mutants based on their secondary structure and locations in protein structure [6], inverse hydrophobic effect on the stability of exposed/partially exposed coil mutants [7], the stability of mutant proteins based on empirical energy functions [8, 9], stability scale [10], contact potentials [11], neural networks [12], support vector machines [13, 14], relative importance of secondary structure and solvent accessibility [15], average assignment [16], Bayesian networks [17], distance and torsion potentials [18], decision trees [19], and physical force field with atomic modeling [20]....

    [...]

Journal ArticleDOI
TL;DR: A review of the literature on statistical long-range correlation in DNA sequences can be found in this paper, where the authors conclude that a mixture of many length scales (including some relatively long ones) is responsible for the observed 1/f-like spectral component.
Abstract: In this paper, we review the literature on statistical long-range correlation in DNA sequences. We examine the current evidence for these correlations, and conclude that a mixture of many length scales (including some relatively long ones) in DNA sequences is responsible for the observed 1/f-like spectral component. We note the complexity of the correlation structure in DNA sequences. The observed complexity often makes it hard, or impossible, to decompose the sequence into a few statistically stationary regions. We suggest that, based on the complexity of DNA sequences, a fruitful approach to understand long-range correlation is to model duplication, and other rearrangement processes, in DNA sequences. One model, called ``expansion-modification system", contains only point duplication and point mutation. Though simplistic, this model is able to generate sequences with 1/f spectra. We emphasize the importance of DNA duplication in its contribution to the observed long-range correlation in DNA sequences.

130 citations

Journal ArticleDOI
TL;DR: Advances in computational tools and sequencing technologies now enable identification and quantification of satellite sequences genome-wide and how their applications are furthering knowledge of satellite evolution and function is described.

120 citations

Journal ArticleDOI
TL;DR: Bulk RNA-sequencing is used to describe the unique genetic program of in vivo murine lung primordial progenitors and computationally identify signaling pathways that are involved in their cell-fate determination from pre-specified embryonic foregut.
Abstract: Multipotent Nkx2-1-positive lung epithelial primordial progenitors of the foregut endoderm are thought to be the developmental precursors to all adult lung epithelial lineages. However, little is known about the global transcriptomic programs or gene networks that regulate these gateway progenitors in vivo. Here we use bulk RNA-sequencing to describe the unique genetic program of in vivo murine lung primordial progenitors and computationally identify signaling pathways, such as Wnt and Tgf-β superfamily pathways, that are involved in their cell-fate determination from pre-specified embryonic foregut. We integrate this information in computational models to generate in vitro engineered lung primordial progenitors from mouse pluripotent stem cells, improving the fidelity of the resulting cells through unbiased, easy-to-interpret similarity scores and modulation of cell culture conditions, including substratum elastic modulus and extracellular matrix composition. The methodology proposed here can have wide applicability to the in vitro derivation of bona fide tissue progenitors of all germ layers.

41 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations


"Direct mapping of symbolic DNA sequ..." refers background or methods in this paper

  • ...Using BLAST (75) we have checked that there are no other sub-sequences in NT_011903....

    [...]

  • ...In the standard string matching algorithms, like TRF (30) and BLAST (75), the optimal strings for analysis are determined locally by using statistical methods, so the chosen strings can differ from segment to segment within genomic sequence....

    [...]

  • ...Some ideas incorporated in TRF have been used in earlier homology detection program BLAST (75), but the goals and methods differ (30)....

    [...]

Journal ArticleDOI
Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

22,269 citations

Journal ArticleDOI
TL;DR: A new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size is presented and its ability to detect tandem repeats that have undergone extensive mutational change is demonstrated.
Abstract: A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm’s speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human β T cell receptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface at c3.biomath.mssm.edu/trf.html has been established for automated use of the program.

6,577 citations

Journal ArticleDOI
TL;DR: The synthesis of enzymes in bacteria follows a double genetic control, which appears to operate directly at the level of the synthesis by the gene of a shortlived intermediate, or messenger, which becomes associated with the ribosomes where protein synthesis takes place.

5,588 citations


"Direct mapping of symbolic DNA sequ..." refers background in this paper

  • ...(1) In the case of dispersed repeats, the fragment length is equal to a distance between start positions of dispersed repeat copies....

    [...]

  • ...The distance between the starts of the first and of the second Alu element is 282+DA(1), where DA(1) denotes the length of poly-A tail in the first Alu element (on the left hand side)....

    [...]

  • ...The lengths of A-tail in the first and second Alu element are denoted DA(1) and DA(2), respectively....

    [...]

  • ...Positions of duplicated sub-regions (in megabases): (1) 0....

    [...]

  • ...The distance between the starts of the first and second Alu element is 282+DA(1)....

    [...]

Journal ArticleDOI
TL;DR: The rapidly advancing field of long ncRNAs is reviewed, describing their conservation, their organization in the genome and their roles in gene regulation, and the medical implications.
Abstract: In mammals and other eukaryotes most of the genome is transcribed in a developmentally regulated manner to produce large numbers of long non-coding RNAs (ncRNAs). Here we review the rapidly advancing field of long ncRNAs, describing their conservation, their organization in the genome and their roles in gene regulation. We also consider the medical implications, and the emerging recognition that any transcript, regardless of coding potential, can have an intrinsic function as an RNA.

4,911 citations