scispace - formally typeset
Journal ArticleDOI

An algorithm for approximate tandem repeats.

TLDR
This paper precisely defines approximate multiple repeats, and presents an algorithm that finds all repeats that concur with the definition, and the time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k), where a is the maximum number of periods in any reported repeat.
Abstract
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro u, for which the Hamming distance of umacro and u is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and u is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.

read more

Citations
More filters
Journal ArticleDOI

mreps: efficient and flexible detection of tandem repeats in DNA

TL;DR: Mreps as discussed by the authors is a software tool for fast identification of tandemly repeated structures in DNA sequences, which is able to identify all types of repeat structures within a single run on a whole genomic sequence.
Journal ArticleDOI

Linear time algorithms for finding and representing all the tandem repeats in a string

TL;DR: An O(|S|)-time algorithm that operates on the suffix tree T(S) for a string S, finding and marking the endpoint of every tandem repeat that occurs in S, improves and generalizes several prior efforts to efficiently capture large subsets of tandem repeats.
Book ChapterDOI

Theoretical and practical improvements on the RMQ-Problem, with applications to LCA and LCE

TL;DR: This work presents a direct algorithm for the general RMQ-problem with linear preprocessing time and constant query time, without making use of any dynamic data structure, and consumes less than half of the space needed by the method by Berkman and Vishkin.
Journal ArticleDOI

T-reks

TL;DR: A new program called T-REKS is developed, based on clustering of lengths between identical short strings by using a K-means algorithm, which opens the way for large-scale analysis of protein tandem repeats.
Journal ArticleDOI

XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences

TL;DR: XSTREAM is a practical and valuable tool for TR detection in protein and nucleotide sequences at the multi-genome scale, and an effective tool for modeling TR domains with diverse architectures and varied levels of degeneracy.
References
More filters
Journal ArticleDOI

A general method applicable to the search for similarities in the amino acid sequence of two proteins

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Book

The Design and Analysis of Computer Algorithms

TL;DR: This text introduces the basic data structures and programming techniques often used in efficient algorithms, and covers use of lists, push-down stacks, queues, trees, and graphs.
Journal ArticleDOI

Tandem repeats finder: a program to analyze DNA sequences

TL;DR: A new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size is presented and its ability to detect tandem repeats that have undergone extensive mutational change is demonstrated.