scispace - formally typeset
Search or ask a question

Showing papers by "Costas S. Iliopoulos published in 2013"


Journal ArticleDOI
03 Jul 2013-Genomics
TL;DR: It is demonstrated that the CoVEC approach outperforms most individual methods and highlights the benefit of combining results from multiple tools.

92 citations


Posted Content
TL;DR: Order-preserving matching on numeric strings was introduced in this article, where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern.
Abstract: We introduce a new string matching problem called order-preserving matching on numeric strings where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price analysis and musical melody matching in which the order relations should be matched instead of the strings themselves. Solving order-preserving matching has to do with representations of order relations of a numeric string. We define prefix representation and nearest neighbor representation, which lead to efficient algorithms for order-preserving matching. We present efficient algorithms for single and multiple pattern cases. For the single pattern case, we give an O(n log m) time algorithm and optimize it further to obtain O(n + m log m) time. For the multiple pattern case, we give an O(n log m) time algorithm.

75 citations


Book ChapterDOI
07 Oct 2013
TL;DR: In this article, an O(n log logn) time algorithm was proposed to construct an index that enables order-preserving pattern matching queries in time proportional to pattern length.
Abstract: Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching: for a given text the goal is to find its factors having the same 'shape' as a given pattern. Known results include a linear-time algorithm for this problem (in case of polynomially-bounded alphabet) and a generalization to multiple patterns. We give an O(nloglogn) time construction of an index that enables order-preserving pattern matching queries in time proportional to pattern length. The main component is a data structure being an incomplete suffix tree in the order-preserving setting. The tree can miss single letters related to branching at internal nodes. Such incompleteness results from the weakness of our so called weak character oracle. However, due to its weakness, such oracle can answer queries on-line in O(loglogn) time using a sliding-window approach. For most of the applications such incomplete suffix-trees provide the same functional power as the complete ones. We also give an $O(\frac{n\log{n}}{\log\log{n}})$ time algorithm constructing complete order-preserving suffix trees.

47 citations



Journal ArticleDOI
TL;DR: New, simple, easily-computed, and widely applicable notions of string covering that provide an intuitive and useful characterisation of a string are proposed: the enhanced cover; the enhanced left cover; and the enhancedleft seed.

35 citations


Book ChapterDOI
10 Jul 2013
TL;DR: This work considers an index data structure for similar strings and the generalized suffix tree, a compacted trie representing all suffixes in A and B, which is a solution for this.
Abstract: We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A| + |B| leaves and can be constructed in O(|A| + |B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B.

27 citations


Journal ArticleDOI
TL;DR: In this article, a linear time algorithm for finding all Abelian periods in a string is presented. But the algorithm is based on a reduction of the problem of all Abelians periods to that of (already solved) Abelian squares which provides new insight into both connected problems.

25 citations


Journal ArticleDOI
TL;DR: This work gives the first time-space optimal algorithm that computes the Longest Previous Factor array, given the Suffix Array and the Longmost Common Prefix array.
Abstract: The Longest Previous Factor array gives, for each position i in a string y , the length of the longest factor (substring) of y that occurs both at i and to the left of i in y . The Longest Previous Factor array is central in many text compression techniques as well as in the most efficient algorithms for detecting motifs and repetitions occurring in a text. Computing the Longest Previous Factor array requires usually the Suffix Array and the Longest Common Prefix array. We give the first time-space optimal algorithm that computes the Longest Previous Factor array, given the Suffix Array and the Longest Common Prefix array. We also give the first linear-time algorithm that computes the permutation that applied to the Longest Common Prefix array produces the Longest Previous Factor array.

24 citations


Proceedings Article
01 Jan 2013
TL;DR: An algorithm for maximal palindromic factorization of a finite string is presented by adapting an Gusfield algorithm for detecting all occurrences of maximalPalindromes in a string in linear time to the length of the given string then using the breadth first search (BFS) to find the maximal palINDromicfactorization set.
Abstract: A palindrome is a symmetric string, phrase, number, or other sequence of units sequence that reads the same forward and backward. We present an algorithm for maximal palindromic factorization of a finite string by adapting an Gusfield algorithm [15] for detecting all occurrences of maximal palindromes in a string in linear time to the length of the given string then using the breadth first search (BFS) to find the maximal palindromic factorization set. A factorization F of s with respect to S refers to a decomposition of s such that s = si1si2 · · · sil where sij ∈ S and l is minimum. In this context the set S is referred to as the factorization set. In this paper, we tackle the following problem. Given a string s, find the maximal palindromic factorization of s, that is a factorization of s where the factorization set is the set of all center-distinct maximal palindromes of a string s MP(s).

17 citations


Journal ArticleDOI
TL;DR: In this paper, the shortest seed problem is solved in O(n log n/m) time, where m is the length of a seed and n is the number of prefixes in the string.

15 citations


DOI
01 Aug 2013
TL;DR: The principal aim of this journal is to promote the growth of computing science, to show its relation to practice and to stimulate applications of apposite formalisms to practical problems.
Abstract: This journal aims to publish contributions at the junction of theory and practice. The objective is to disseminate applicable research. Thus new theoretical contributions are welcome where they are motivated by potential application; applications of existing formalisms are of interest if they show something novel about the approach or application. The term "formal methods" has been applied to a range of notations, theories and tools. There is no doubt that some of these have already had a significant impact on practical applications of computing. Indeed, it is interesting to note that once something is adopted into practical use it is no longer thought of as a formal method. Apart from widely used notations such as those for syntax and state machines, there have been significant applications of specification notations, development methods and tools both for proving general results and for searching for specific conditions. However, the most profound and lasting influence of the formal approach is the way it has illuminated fundamental concepts like those of communication. In this spirit, the principal aim of this journal is to promote the growth of computing science, to show its relation to practice and to stimulate applications of apposite formalisms to practical problems. One significant challenge is to show how a range of formal models can be related to each other.

Journal ArticleDOI
TL;DR: By introducing the idea of equivalence classes in weighted sequences, this work identifies the tandem repeats of every possible length using an iterative partitioning technique and proves that the problem can be solved in O(n2) time.
Abstract: A weighted biological sequence is a string in which a set of characters may appear at each position with respective probabilities of occurrence. We attempt to locate all the tandem repeats in a weighted sequence. A repeated substring is called a tandem repeat if each occurrence of the substring is directly adjacent to each other. By introducing the idea of equivalence classes in weighted sequences, we identify the tandem repeats of every possible length using an iterative partitioning technique. We also present the algorithm for recording the tandem repeats, and prove that the problem can be solved in O(n2) time.

Journal ArticleDOI
13 Nov 2013
TL;DR: This work investigates the appearance of simpler monochromatic graphs such as stripes, stars and trees under a 2-colouring of the edges of a bipartite graph.
Abstract: The Ramsey number R(m, n) is the smallest integer p such that any blue-red colouring of the edges of the complete graph Kp forces the appearance of a blue Km or a red Kn. Bipartite Ramsey problems deal with the same questions but the graph explored is the complete bipartite graph instead of the complete graph. We consider special cases of the bipartite Ramsey problem. More specifically we investigate the appearance of simpler monochromatic graphs such as stripes, stars and trees under a 2-colouring of the edges of a bipartite graph. We give the Ramsey numbers Rb(mP2, nP2), Rb(Tm, Tn), Rb(Sm, nP2), Rb(Tm, nP2) and Rb(Sm, Tn).

Proceedings ArticleDOI
22 Sep 2013
TL;DR: Millions of pairwise sequence alignments, performed under realistic conditions based on the properties of real full-length genomes, show that GapsMis can increase the accuracy of extending short-read alignments end-to-end compared to more traditional approaches.
Abstract: Motivation: Recent developments in next-generation sequencing technologies have renewed interest in pairwise sequence alignment techniques, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence. After the fast alignment between a factor of the reference sequence and the high-quality fragment of a short read, an important problem is to find the best possible alignment between a succeeding factor of the reference sequence and the remaining low-quality part of the read; allowing a number of mismatches and the insertion of gaps in the alignment. Results: We present GapsMis, a tool for pairwise global and semi-global sequence alignment with a variable, but bounded, number of gaps. It is based on a new algorithm, which computes a different version of the traditional dynamic programming matrix. Millions of pairwise sequence alignments, performed under realistic conditions based on the properties of real full-length genomes, show that GapsMis can increase the accuracy of extending short-read alignments end-to-end compared to more traditional approaches. Availability: http://www.exelixis-lab.org/gapmis

Journal ArticleDOI
TL;DR: The presented experimental results demonstrate that GapMis is more suitable and efficient than most popular tools for this task, and based on a simple algorithm, which computes a different version of the traditional dynamic programming matrix.
Abstract: Motivation: Pairwise sequence alignment has received a new motivation due to the advent of recent patents in next-generation sequencing technologies, particularly so for the application of re-sequencing---the assembly of a genome directed by a reference sequence After the fast alignment between a factor of the reference sequence and a high-quality fragment of a short read by a short-read alignment programme, an important problem is to find the alignment between a relatively short succeeding factor of the reference sequence and the remaining low-quality part of the read allowing a number of mismatches and the insertion of a single gap in the alignment Results: We present GapMis, a tool for pairwise sequence alignment with a single gap It is based on a simple algorithm, which computes a different version of the traditional dynamic programming matrix The presented experimental results demonstrate that GapMis is more suitable and efficient than most popular tools for this task

Posted Content
TL;DR: A linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns, and a number of applications of order- Preserving suffix trees to identify patterns and repetitions in time series.
Abstract: Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns. We make one step forward in the analysis and give an $O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix trees in the order-preserving setting. We show a number of applications of order-preserving suffix trees to identify patterns and repetitions in time series.

Book ChapterDOI
01 Jan 2013
TL;DR: Generic RAM and PRAM algorithms for factoring words over sets of strings known as circ-UMFFs are described, generalizations of the well-known Lyndon words based on lexorder, whose properties were first studied in 1958 by Chen, Fox and Lyndon.
Abstract: In this paper we describe algorithms for factoring words over sets of strings known as circ-UMFFs, generalizations of the well-known Lyndon words based on lexorder, whose properties were first studied in 1958 by Chen, Fox and Lyndon. In 1983 Duval designed an elegant linear-time sequential (RAM) Lyndon factorization algorithm; a corresponding parallel (PRAM) algorithm was described in 1994 by Daykin, Iliopoulos and Smyth. In 2003 Daykin and Daykin introduced various circ-UMFFs, including one based on V-words and V-ordering; in 2011 linear string comparison and sequential factorization algorithms based on V-order were given by Daykin, Daykin and Smyth. Here we first describe generic RAM and PRAM algorithms for factoring a word over any circ-UMFF; then we show how to customize these generic algorithms to yield optimal parallel Lyndon-like V-word factorization.

Posted Content
TL;DR: In this paper, the longest common compatible prefix (LCP) problem for regular words has been solved in O(n) and O(1) query time, respectively, using ideas from alignment algorithms and dynamic programming.
Abstract: For a partial word $w$ the longest common compatible prefix of two positions $i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k-1]\uparrow w[j,j+k-1]$, where $\uparrow$ is the compatibility relation of partial words (it is not an equivalence relation). The LCCP problem is to preprocess a partial word in such a way that any query $lccp(i,j)$ about this word can be answered in $O(1)$ time. It is a natural generalization of the longest common prefix (LCP) problem for regular words, for which an $O(n)$ preprocessing time and $O(1)$ query time solution exists. Recently an efficient algorithm for this problem has been given by F. Blanchet-Sadri and J. Lazarow (LATA 2013). The preprocessing time was $O(nh+n)$, where $h$ is the number of "holes" in $w$. The algorithm was designed for partial words over a constant alphabet and was quite involved. We present a simple solution to this problem with slightly better runtime that works for any linearly-sortable alphabet. Our preprocessing is in time $O(n\mu+n)$, where $\mu$ is the number of blocks of holes in $w$. Our algorithm uses ideas from alignment algorithms and dynamic programming.

Proceedings Article
01 Jan 2013
TL;DR: This paper proposes a static analysis approach using text based search technique, control flow graph, hashing, and machine learning to cluster malware variants accordingly.
Abstract: Malware is computer software with the harmful intension to both computers and networks. Anti-virus companies receive extensive amount of malware variants daily, therefore there is an essential need to automatically cluster malware variants into their corresponding family in order to reduce the effort and time on manual analysis. As malware variants which belong to the same family, share certain amount of code, we classify them into the same cluster based on the shared features that we extract from them. In this paper we propose a static analysis approach using text based search technique, control flow graph, hashing, and machine learning to cluster malware variants accordingly. However, this is an ongoing work, but we will be able to explain our methodology and the preliminary results achieved.

Proceedings ArticleDOI
09 May 2013
TL;DR: The experimental results show that the proposed system is successful for on-the-fly classification of web spambots and computer viruses hence eliminating spam in web 2.0 applications and detecting infected files in computers.
Abstract: In this paper, we describe REAL: An efficient Read Aligner for next generation sequencing reads structures to detect and compare the results of web spambots and Viruses. Email spam, also known as junk email or unsolicited bulk email (UBE), is a subset of electronic spam involving nearly identical messages sent to numerous recipients by email. In the last decade or so, Web spam has emerged to be a bigger than previous thought problem. It not only wastes resources, misleads people but also has the ability to trick search algorithms to gain unfair search result ranking, hence resulting in the decrease of quality and reliability of the World Wide Web (WWW) and its content. The Internet brings a new dimension to the virus problem. Before, viruses generally spread from system to system on physical media, often the floppy disk. This is a fundamentally slow way for viruses to spread. The Internet changes all this. The viruses that really win in the Internet environment are the macro viruses. They are attached to data, not code, making them harder to avoid. An increasing number of documents on the Net are available as Word files, for example, with no alternative format, and Word documents are frequently exchanged via email. Our experimental results show that the proposed system is successful for on-the-fly classification of web spambots and computer viruses hence eliminating spam in web 2.0 applications and detecting infected files in computers. Our comparison shows it is slightly harder to detect viruses due to nature of the complexity and especially if they have an executable packing to dodge antivirus engines.

Journal ArticleDOI
TL;DR: An asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O( nk/w) time algorithms for solving the extreme similarity sequencing problem.
Abstract: In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S0, S1, …, Sk, of sequences of equal length, where Si, for all 1≤i≤k, differs from S0 by a constant number of errors – around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k – the number of sequences.

Journal ArticleDOI
01 Feb 2013-Genomics
TL;DR: Using RNA-seq data from two distinct developmental stages of the mouse cortex, embryonic day 18 (E18) and postnatal day 7 (P7), this work established for the first time a developmental-related transcriptome map of the Mouse isochores and estimated the correlation between isochore' GC level and their expression activity, and the genes' expression patterns for each isochORE family.



Journal Article
TL;DR: A linear time algorithm is proposed for the identification of all overlapping factors of a word, the appearance of overlapping factors in Fibonacci words is investigated, and some bounds on the maximum number of distinct overlap factors in a word are provided.
Abstract: The concept of quasiperiodicity is a generalization of the notion of periodicity where in contrast to periodicity the quasiperiods of a quasiperiodic string may overlap. A lot of research has been concentrated around algorithms for the computation of quasiperiodicities in strings while not much is known about bounds on their maximum number of occurrences in words. We study the overlapping factors of a word as a means to provide more insight into quasiperiodic structures of words. We propose a linear time algorithm for the identification of all overlapping factors of a word, we investigate the appearance of overlapping factors in Fibonacci words and we provide some bounds on the maximum number of distinct overlapping factors in a word.

Journal ArticleDOI
TL;DR: Experimental results shows that the proposed system is efficient and it is a novel way for detecting malware code embedded in different types of computer files, using bioinformatics tools with consistency and accuracy in detecting the malware and it was able to complete the assignment in high speed without excessive memory usages.
Abstract: The Internet is considered to be as a rich platform of information where many people get benefit from its access but still they are being attacked by computer malwares and various other threats which distract their normal work flow to be carried out in an efficient manner. In this paper, we give an overview of the efficient read aligner software termed as REAL which is used for next generation sequencing. It reads structures as a tool to detect computer Malware. Using this tools a dynamic computer malware detection model has been presented in this paper that can detect the malwares to prevent attacks which might cause damaging or stealing sensitive information. This model is inspired by REAL which is an efficient read aligner for next generation sequencing for processing biological data. New anti-Malware technologies are introduced to the world by the clock, but at the same time new malware techniques have also emerged to misuse these technologies. Experimental results of this study shows that the proposed system is efficient and it is a novel way for detecting malware code embedded in different types of computer files, using bioinformatics tools with consistency and accuracy in detecting the malware and it was able to complete the assignment in high speed without excessive memory usages.

Journal ArticleDOI
TL;DR: The tree pattern matching problem for unranked ordered trees is transformed to a string matching problem, by transforming the tree template and the subject tree to strings representing their postfix bar notation, and a table-driven algorithm is proposed to solve it.

Book ChapterDOI
13 Sep 2013
TL;DR: The detection of various types of repeats is a fundamental and well studied problem in stringology and extensions to this problem with applications to bioinformatics are presented.
Abstract: The detection of various types of repeats is a fundamental and well studied problem in stringology. In this paper we present extensions to this problem with applications to bioinformatics. In this paper we consider the detection of all exact and approximate inverted repeats, as well as all exact and approximate weighted inverted repeats and give efficient algorithms for their computation.

Posted Content
TL;DR: In this article, a space/time-efficient suffix tree of alignment is proposed, which wisely exploits the similarity in an alignment of two similar strings and can be constructed in O(|A|+|B|) time.
Abstract: We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings $A$ and $B$ is a compacted trie representing all suffixes in $A$ and $B$. It has $|A|+|B|$ leaves and can be constructed in $O(|A|+|B|)$ time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of $A$ and $B$. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of $A$ and $B$ has $|A| + l_d + l_1$ leaves where $l_d$ is the sum of the lengths of all parts of $B$ different from $A$ and $l_1$ is the sum of the lengths of some common parts of $A$ and $B$. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern $P$ in $O(|P|+occ)$ time where $occ$ is the number of occurrences of $P$ in $A$ and $B$. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires $O(|A| + l_d + l_1 + l_2)$ time where $l_2$ is the sum of the lengths of other common substrings of $A$ and $B$. When the suffix tree of $A$ is already given, it requires $O(l_d + l_1 + l_2)$ time.

Journal ArticleDOI
TL;DR: In this article, the degree/diameter problem on trees was considered for Cayley trees, caterpillars, lobsters, banana trees and firecracker trees, as well as for tree-like structures such as pseudotrees.