Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm

doi:10.1093/NAR/GKS721

Home
/
Papers
/
Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm

Journal Article•DOI•

Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm

Matko Glunčić¹, Vladimir Paar¹•Institutions (1)

University of Zagreb¹

01 Jan 2013-Nucleic Acids Research (Oxford University Press)-Vol. 41, Iss: 1

TL;DR: This work presents several case studies of GRM use, and presents the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram.

read less

Abstract: The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012 .exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of a-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/ or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads

[...]

Petr Novák, Laura Ávila Robledillo, Andrea Koblížková, Iva Vrbová, Pavel Neumann, Jiří Macas - Show less +2 more

07 Jul 2017-Nucleic Acids Research

TL;DR: A novel computational pipeline that circumvents the problem of difficult to assemble satellite DNA characterization by detecting satellite repeats directly from unassembled short reads by employing graph-based sequence clustering to identify groups of reads that represent repetitive elements.

...read moreread less

Abstract: Satellite DNA is one of the major classes of repetitive DNA, characterized by tandemly arranged repeat copies that form contiguous arrays up to megabases in length. This type of genomic organization makes satellite DNA difficult to assemble, which hampers characterization of satellite sequences by computational analysis of genomic contigs. Here, we present tandem repeat analyzer (TAREAN), a novel computational pipeline that circumvents this problem by detecting satellite repeats directly from unassembled short reads. The pipeline first employs graph-based sequence clustering to identify groups of reads that represent repetitive elements. Putative satellite repeats are subsequently detected by the presence of circular structures in their cluster graphs. Consensus sequences of repeat monomers are then reconstructed from the most frequent k-mers obtained by decomposing read sequences from corresponding clusters. The pipeline performance was successfully validated by analyzing low-pass genome sequencing data from five plant species where satellite DNA was previously experimentally characterized. Moreover, novel satellite repeats were predicted for the genome of Vicia faba and three of these repeats were verified by detecting their sequences on metaphase chromosomes using fluorescence in situ hybridization.

...read moreread less

181 citations

Cites background from "Direct mapping of symbolic DNA sequ..."

...As reviewed by Glunčić and Paar (14), TRF is a representative of string matching algorithms, which are utilized in a number of computational tools for tandem repeat prediction, along with alternative approaches based on nucleotide autocorrelation functions (15,16) and Fourier transforms (17)....
[...]
...As reviewed by Glunčić and Paar (14), TRF is a representative of string matching algorithms, which are utilized in a number of computational tools for tandem repeat prediction, along with alternative approaches based on nucleotide autocorrelation functions (15,16) and Fourier transforms (17)....
[...]

Book•DOI•

Data Mining Techniques for the Life Sciences

[...]

Oliviero Carugo, Frank Eisenhaber

14 Dec 2009

TL;DR: "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.

...read moreread less

Abstract: Whereas getting exact data about living systems and sophisticated experimental procedures have primarily absorbed the minds of researchers previously, the development of high-throughput technologies has caused the weight to increasingly shift to the problem of interpreting accumulated data in terms of biological function and biomolecular mechanisms. In "Data Mining Techniques for the Life Sciences", experts in the field contribute valuable information about the sources of information and the techniques used for "mining" new insights out of databases. Beginning with a section covering the concepts and structures of important groups of databases for biomolecular mechanism research, the book then continues with sections on formal methods for analyzing biomolecular data and reviews of concepts for analyzing biomolecular sequence data in context with other experimental results that can be mapped onto genomes. As a volume of the highly successful Methods in Molecular Biology series, this work provides the kind of detailed description and implementation advice that is crucial for getting optimal results. Authoritative and easy to reference, "Data Mining Techniques for the Life Sciences" seeks to aid students and researchers in the life sciences who wish to get a condensed introduction into the vital world of biological databases and their many applications.

...read moreread less

135 citations

Cites background or methods from "Direct mapping of symbolic DNA sequ..."

...Curated databases such as Reference Sequence Collection [14, 15] provide a curated/expert view by compilation and correction of the data....
[...]
...These residues are often functionally important such as in enzyme catalysis and ligand binding or involved in the formation of protein–protein complexes [15, 16]....
[...]
...UniProtKB [15] is the protein sequence reference database chosen by the majority of the interaction databases....
[...]
...nl/cgibin/emboss/skipredundant; [15]) and with the computer program cd-hit (http://weizhongli-lab....
[...]
...It has been effectively used for understanding the relationship between amino acid properties and stability of protein mutants based on their secondary structure and locations in protein structure [6], inverse hydrophobic effect on the stability of exposed/partially exposed coil mutants [7], the stability of mutant proteins based on empirical energy functions [8, 9], stability scale [10], contact potentials [11], neural networks [12], support vector machines [13, 14], relative importance of secondary structure and solvent accessibility [15], average assignment [16], Bayesian networks [17], distance and torsion potentials [18], decision trees [19], and physical force field with atomic modeling [20]....
[...]

Journal Article•DOI•

Understanding Long-range Correlations in DNA Sequences

[...]

Wentian Li¹, Thomas G. Marr¹, Kunihiko Kaneko²•Institutions (2)

Cold Spring Harbor Laboratory¹, University of Tokyo²

22 Mar 1994-arXiv: Chaotic Dynamics

TL;DR: A review of the literature on statistical long-range correlation in DNA sequences can be found in this paper, where the authors conclude that a mixture of many length scales (including some relatively long ones) is responsible for the observed 1/f-like spectral component.

...read moreread less

Abstract: In this paper, we review the literature on statistical long-range correlation in DNA sequences. We examine the current evidence for these correlations, and conclude that a mixture of many length scales (including some relatively long ones) in DNA sequences is responsible for the observed 1/f-like spectral component. We note the complexity of the correlation structure in DNA sequences. The observed complexity often makes it hard, or impossible, to decompose the sequence into a few statistically stationary regions. We suggest that, based on the complexity of DNA sequences, a fruitful approach to understand long-range correlation is to model duplication, and other rearrangement processes, in DNA sequences. One model, called ``expansion-modification system", contains only point duplication and point mutation. Though simplistic, this model is able to generate sequences with 1/f spectra. We emphasize the importance of DNA duplication in its contribution to the observed long-range correlation in DNA sequences.

...read moreread less

130 citations

Journal Article•DOI•

Satellite DNA evolution: old ideas, new approaches.

[...]

Sarah Sander Lower¹, Michael P McGurk¹, Andrew G. Clark¹, Daniel A. Barbash¹•Institutions (1)

Cornell University¹

23 Mar 2018-Current Opinion in Genetics & Development

TL;DR: Advances in computational tools and sequencing technologies now enable identification and quantification of satellite sequences genome-wide and how their applications are furthering knowledge of satellite evolution and function is described.

...read moreread less

120 citations

Journal Article•DOI•

The in vivo genetic program of murine primordial lung epithelial progenitors

[...]

Laertis Ikonomou¹, Laertis Ikonomou², Michael J. Herriges¹, Michael J. Herriges², Sara L. Lewandowski², Sara L. Lewandowski¹, Robert Marsland¹, Carlos Villacorta-Martin², Ignacio S. Caballero², David B. Frank³, Reeti M. Sanghrajka², Reeti M. Sanghrajka¹, Keri Dame¹, Keri Dame², Maciej M. Kańduła¹, Julia Hicks-Berthet¹, Matthew L. Lawton², Matthew L. Lawton¹, Constantina Christodoulou¹, Attila J. Fabian⁴, Eric D. Kolaczyk¹, Xaralabos Varelas¹, Edward E. Morrisey⁵, John M. Shannon⁶, Pankaj Mehta¹, Darrell N. Kotton², Darrell N. Kotton¹ - Show less +23 more•Institutions (6)

Boston University¹, Boston Medical Center², Children's Hospital of Philadelphia³, Biogen Idec⁴, University of Pennsylvania⁵, Cincinnati Children's Hospital Medical Center⁶

31 Jan 2020-Nature Communications

TL;DR: Bulk RNA-sequencing is used to describe the unique genetic program of in vivo murine lung primordial progenitors and computationally identify signaling pathways that are involved in their cell-fate determination from pre-specified embryonic foregut.

...read moreread less

Abstract: Multipotent Nkx2-1-positive lung epithelial primordial progenitors of the foregut endoderm are thought to be the developmental precursors to all adult lung epithelial lineages. However, little is known about the global transcriptomic programs or gene networks that regulate these gateway progenitors in vivo. Here we use bulk RNA-sequencing to describe the unique genetic program of in vivo murine lung primordial progenitors and computationally identify signaling pathways, such as Wnt and Tgf-β superfamily pathways, that are involved in their cell-fate determination from pre-specified embryonic foregut. We integrate this information in computational models to generate in vitro engineered lung primordial progenitors from mouse pluripotent stem cells, improving the fidelity of the resulting cells through unbiased, easy-to-interpret similarity scores and modulation of cell culture conditions, including substratum elastic modulus and extracellular matrix composition. The methodology proposed here can have wide applicability to the in vitro derivation of bona fide tissue progenitors of all germ layers.

...read moreread less

41 citations

1
2
3
4
…
5
6

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Basic Local Alignment Search Tool

[...]

Stephen F. Altschul¹, Warren Gish¹, Webb Miller², Eugene W. Myers³, David J. Lipman¹ - Show less +1 more•Institutions (3)

National Institutes of Health¹, Pennsylvania State University², University of Arizona³

01 Oct 1990-Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

88,255 citations

"Direct mapping of symbolic DNA sequ..." refers background or methods in this paper

...Using BLAST (75) we have checked that there are no other sub-sequences in NT_011903....
[...]
...In the standard string matching algorithms, like TRF (30) and BLAST (75), the optimal strings for analysis are determined locally by using statistical methods, so the chosen strings can differ from segment to segment within genomic sequence....
[...]
...Some ideas incorporated in TRF have been used in earlier homology detection program BLAST (75), but the goals and methods differ (30)....
[...]

Journal Article•DOI•

Initial sequencing and analysis of the human genome.

[...]

Eric S. Lander¹, Lauren Linton¹, Bruce W. Birren¹, Chad Nusbaum¹ +245 more•Institutions (29)

15 Feb 2001-Nature

TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.

...read moreread less

Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

...read moreread less

22,269 citations

Journal Article•DOI•

Tandem repeats finder: a program to analyze DNA sequences

[...]

Gary Benson¹•Institutions (1)

Icahn School of Medicine at Mount Sinai¹

01 Jan 1999-Nucleic Acids Research

TL;DR: A new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size is presented and its ability to detect tandem repeats that have undergone extensive mutational change is demonstrated.

...read moreread less

Abstract: A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm’s speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human β T cell receptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface at c3.biomath.mssm.edu/trf.html has been established for automated use of the program.

...read moreread less

6,577 citations

Journal Article•DOI•

Genetic regulatory mechanisms in the synthesis of proteins.

[...]

François Jacob¹, Jacques Monod¹•Institutions (1)

Pasteur Institute¹

01 Jun 1961-Journal of Molecular Biology

TL;DR: The synthesis of enzymes in bacteria follows a double genetic control, which appears to operate directly at the level of the synthesis by the gene of a shortlived intermediate, or messenger, which becomes associated with the ribosomes where protein synthesis takes place.

...read moreread less

5,588 citations

"Direct mapping of symbolic DNA sequ..." refers background in this paper

...(1) In the case of dispersed repeats, the fragment length is equal to a distance between start positions of dispersed repeat copies....
[...]
...The distance between the starts of the first and of the second Alu element is 282+DA(1), where DA(1) denotes the length of poly-A tail in the first Alu element (on the left hand side)....
[...]
...The lengths of A-tail in the first and second Alu element are denoted DA(1) and DA(2), respectively....
[...]
...Positions of duplicated sub-regions (in megabases): (1) 0....
[...]
...The distance between the starts of the first and second Alu element is 282+DA(1)....
[...]

Journal Article•DOI•

Long non-coding RNAs: insights into functions

[...]

Tim R. Mercer¹, Marcel E. Dinger¹, John S. Mattick¹•Institutions (1)

University of Queensland¹

01 Mar 2009-Nature Reviews Genetics

TL;DR: The rapidly advancing field of long ncRNAs is reviewed, describing their conservation, their organization in the genome and their roles in gene regulation, and the medical implications.

...read moreread less

Abstract: In mammals and other eukaryotes most of the genome is transcribed in a developmentally regulated manner to produce large numbers of long non-coding RNAs (ncRNAs). Here we review the rapidly advancing field of long ncRNAs, describing their conservation, their organization in the genome and their roles in gene regulation. We also consider the medical implications, and the emerging recognition that any transcript, regardless of coding potential, can have an intrinsic function as an RNA.

...read moreread less

4,911 citations