An algorithm for approximate tandem repeats.

doi:10.1089/106652701300099038

Journal ArticleDOI

An algorithm for approximate tandem repeats.

Gad M. Landau, +2 more

- 01 Jan 2001 -

Journal of Computational Biology

- Vol. 8, Iss: 1, pp 1-18

TLDR

This paper precisely defines approximate multiple repeats, and presents an algorithm that finds all repeats that concur with the definition, and the time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k), where a is the maximum number of periods in any reported repeat.

Abstract:

A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro u, for which the Hamming distance of umacro and u is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and u is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.

An algorithm for approximate tandem repeats.

Citations

mreps: efficient and flexible detection of tandem repeats in DNA

Linear time algorithms for finding and representing all the tandem repeats in a string

Theoretical and practical improvements on the RMQ-Problem, with applications to LCA and LCE

T-reks

XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences

References

A general method applicable to the search for similarities in the amino acid sequence of two proteins

The Design and Analysis of Computer Algorithms

Binary codes capable of correcting deletions, insertions and reversals

Binary codes capable of correcting deletions, insertions, and reversals

Tandem repeats finder: a program to analyze DNA sequences

Related Papers (5)

Tandem repeats finder: a program to analyze DNA sequences

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

mreps: efficient and flexible detection of tandem repeats in DNA

Finding maximal repetitions in a word in linear time

Algorithms on Strings, Trees, and Sequences: Suffix Trees and Their Uses