scispace - formally typeset
Search or ask a question

Showing papers by "Costas S. Iliopoulos published in 2014"


Journal ArticleDOI
TL;DR: This work introduces a new string matching problem called order-preserving matching on numeric strings, where a pattern matches a text if the text contains a substring of values whose relative orders coincide with those of the pattern.

74 citations


Journal ArticleDOI
TL;DR: Lyndon words are used and the Lyndon structure of runs are introduced as a useful tool when computing powers and in problems related to periods some versions of the Manhattan skyline problem are used.

68 citations


Journal ArticleDOI
TL;DR: A suboptimal average-case algorithm for exact circular string matching requiring time O(n) requiring time k=O(m/logm) for moderate values of k, and how the same results can be easily obtained under the edit distance model.
Abstract: Background Circular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal average-case algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area.

34 citations


Journal ArticleDOI
TL;DR: It is shown how many binary words have shortest border of a given length by identifying relations with Dyck words and some bounds on the number of abelian border-free words of agiven length are given.

15 citations


Journal ArticleDOI
TL;DR: A novel variant of Crochemore’s partitioning algorithm for weighted sequences, which requires optimal O(nlogn) time, is presented, thus improving on the best known On2-time algorithm for computing all repetitions in a weighted sequence of length n.
Abstract: Tandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. More formally, a repetition or tandem repeat in a string of letters consists of exact concatenations of identical factors of the string. Biologists are interested in approximate tandem repeats and not necessarily only in exact tandem repeats. A weighted sequence is a string in which a set of letters may occur at each position with respective probabilities of occurrence. It naturally arises in many biological contexts and provides a method to realise the approximation among distinct adjacent occurrences of the same DNA segment. Crochemore’s repetitions algorithm, also referred to as Crochemore’s partitioning algorithm, was introduced in 1981, and was the first optimal -time algorithm to compute all repetitions in a string of length n. In this article, we present a novel variant of Crochemore’s partitioning algorithm for weighted sequences, which requires optimal time, thus improving on the best known -time algorithm (Zhang et al., 2013) for computing all repetitions in a weighted sequence of length n.

13 citations


Proceedings ArticleDOI
16 Sep 2014
TL;DR: A systematic review of the current developments in assessing information credibility automatically in UGC platforms, focusing on microblogging service, covers different aspects from dataset collection and feature usage, through classification techniques, to performance evaluation.
Abstract: Due to their openness and low publishing barrier nature, User-Generated Content (UGC) platforms facilitate the creation of huge amounts of inaccurate content. Consequently, assessing UGC information credibility is developing into a vitally important research topic. This paper offers a systematic review of the current developments in assessing information credibility automatically in UGC platforms, focusing on microblogging service. It covers different aspects from dataset collection and feature usage, through classification techniques, to performance evaluation. A novel theoretical credibility model which integrates the evaluators' traits and context factors to assess information credibility is also presented along with important directions for future research on UGC information credibility.

11 citations


Proceedings ArticleDOI
20 Sep 2014
TL;DR: A fast filter-based algorithm for exact Circular Pattern Matching, which solves the problem of finding all occurrences of the rotations of a pattern P of length m in a text T of length n.
Abstract: Exact Circular Pattern Matching (ECPM) problem consists in finding all occurrences of the rotations of a pattern P of length m in a text T of length n. In this paper we present a fast filter-based algorithm for this problem.

9 citations


Journal ArticleDOI
TL;DR: Three new simple O(nlogn) time algorithms related to repeating factors and novel algorithmic solutions for several classical string problems which are much simpler than (usually quite sophisticated) linear time algorithms are presented.

8 citations


Book ChapterDOI
15 Oct 2014
TL;DR: This article introduces a new and simple data structure, the prefix table under Hamming distance, and presents two algorithms to compute it efficiently: one asymptotically fast; the other very fast on average and in practice.
Abstract: In this article, we introduce a new and simple data structure, the prefix table under Hamming distance, and present two algorithms to compute it efficiently: one asymptotically fast; the other very fast on average and in practice. Because the latter approach avoids the computation of global data structures, such as the suffix array and the longest common prefix array, it yields algorithms much faster in practice than existing methods. We show how this data structure can be used to solve two string problems of interest: (a) approximate string matching under Hamming distance; and (b) longest approximate overlap under Hamming distance. Analogously, we introduce the prefix table under edit distance, and present an efficient algorithm for its computation. In the process, we also define the border array under both distance measures, and provide an algorithm for conversion between prefix tables and border arrays.

8 citations


Posted Content
TL;DR: In this paper, it was shown that the problem of computing a shortest solid cover of an indeterminate string is NP-complete for binary alphabet and partial word covering problem is fixed-parameter tractable with respect to the number of non-solid symbols.
Abstract: We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove that indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to $k$, the number of non-solid symbols. For the indeterminate string covering problem we obtain a $2^{O(k \log k)} + n k^{O(1)}$-time algorithm. For the partial word covering problem we obtain a $2^{O(\sqrt{k}\log k)} + nk^{O(1)}$-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no $2^{o(\sqrt{k})} n^{O(1)}$-time solution exists for either problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice.

8 citations


Posted Content
TL;DR: These are the first results on the average-case complexity of pattern matching with wildcards which, as a by product, provide with first provable separation in complexity between exact pattern matching and pattern matchingwith wildcards in the word RAM model.
Abstract: Pattern matching with wildcards is the problem of finding all factors of a text $t$ of length $n$ that match a pattern $x$ of length $m$, where wildcards (characters that match everything) may be present. In this paper we present a number of fast average-case algorithms for pattern matching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the \textit{average-case} complexity of these algorithms and show the first non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which, as a by product, provide with first provable separation in complexity between exact pattern matching and pattern matching with wildcards in the word RAM model.

Book ChapterDOI
15 Dec 2014
TL;DR: It is proved that both indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to \(k\), the number of non-solid symbols.
Abstract: We consider the problem of computing a solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don’t care symbol. We prove that both indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to \(k\), the number of non-solid symbols. For the indeterminate string covering problem we obtain a \(2^{\mathcal {O}(k\log k)} + n k^{\mathcal {O}(1)}\)-time algorithm. For the partial word covering problem we obtain a \(2^{\mathcal {O}(\sqrt{k}\log k)} + nk^{\mathcal {O}(1)}\)-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no \(2^{o(\sqrt{k})} n^{\mathcal {O}(1)}\)-time solution exists for this problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice.

Journal ArticleDOI
TL;DR: The average number of powers and runs occurring in a word of length n drawn from an alphabet of size @s is studied and it is shown that a word contains [email protected]^(^r^-^1^)-1+o(n) powers of exponent r, at most [email-protected]+o( n) runs, and also ([email protected])n+o (n) palindromes.

Journal ArticleDOI
TL;DR: In this paper, a graph-theoretic model was proposed to solve the swap matching problem, and the resulting algorithms are adaptations of the classic shift-and algorithm for patterns having length similar to the word-size of the target machine.

01 Jan 2014
TL;DR: A suboptimal average-case algorithm for exact circular string matching requiring time O(n) and two fast average- case algorithms for approximate circular string matches with k-mismatches, under the Hamming distance model are presented.
Abstract: Background: Circular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal average-case algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area. Results: In this article, we present a suboptimal average-case algorithm for exact circular string matching requiring time O(n). Based on our solution for the exact case, we present two fast average-case algorithms for approximate circular string matching with k-mismatches, under the Hamming distance model, requiring time O(n) for moderate values of k ,t hat isk = O(m/logm) .W e show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naive approach. Conclusions: We present two fast average-case algorithms for approximate circular string matching with k-mismatches; and show that they also perform very well in practice. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any biological pipeline. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.

Proceedings ArticleDOI
03 Dec 2014
TL;DR: The preliminary experimental results show that the proposed method is successful for the classification of web spam bot in the presence of decoy actions, hence eliminating spam in Web 2.0 applications.
Abstract: Based on the recent research and statistics by Symantec, significant amount of all global web traffic and email traffic is marked as spam. Spambot is basically a robot that maliciously traverses the World Wide Web (WWW), and gathers information, email addresses, etc. For the spammer. The increasing growth of spam bot sophistication advances in the introduction of Spam 2.0, which infiltrate legitimate Web 2.0 unsolicited. This leads to various unwanted outcomes, such as the appearance of spam pages as the top search engines results due to excessive usage of popular terms, unreal web-pages visit rate, spam emails, and wastes of resources. Here we present an efficient method to detect web spam bot in the presence of decoy actions, by applying efficient approximate string-matching techniques. Our preliminary experimental results show that the proposed method is successful for the classification of web spam bot in the presence of decoy actions, hence eliminating spam in Web 2.0 applications.

Posted Content
TL;DR: In this article, the authors presented a new algorithm for approximate circular string matching under the edit distance model with optimal average case search time O(n(k + log m)/m).
Abstract: Approximate string matching is the problem of finding all factors of a text t of length n that are at a distance at most k from a pattern x of length m. Approximate circular string matching is the problem of finding all factors of t that are at a distance at most k from x or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time O(n(k + log m)/m). Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using x and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach.

Journal ArticleDOI
TL;DR: A generalisation of the authors' solution to solve the problem of extending an alignment with k-mismatches and @?-gaps in time @Q([email protected]@?).

DOI
01 Jan 2014
TL;DR: In this paper, the problem of finding a shortest solid string whose occurrences cover the whole indeterminate string was shown to be NP-complete for all non-standard words and even for partial words.
Abstract: Indeterminate strings are a subclass of non-standard words having non-deterministic nature. In a classic string every position contains exactly one symbol—we say it is a solid symbol—while in an indeterminate string a position may contain a set of symbols (possible at this position); such sets are called non-solid symbols. The most important subclass of indeterminate strings are partial words, where each non-solid symbol is the whole alphabet; in this case non-solid symbols are also called don't care symbols. We consider the problem of finding a shortest cover of an indeterminate string, i.e., finding a shortest solid string whose occurrences cover the whole indeterminate string. We show that this classical problem becomes NP-complete for indeterminate strings and even for partial words. The proof of this fact is one of the main results of this paper. Our other main results focus on design of algorithms efficient with respect to certain parameters of the input (so called FPT algorithms) for the shortest cover problem. For the indeterminate string covering problem we obtain an O ( n k 2 + 2 k k 3 ) -time algorithm, where k is the number of non-solid symbols, while for the partial word covering problem we obtain a running time of O ( n k 2 + 2 O ( k log ⁡ k ) ) . Additionally, we prove that, unless the Exponential Time Hypothesis is false, no 2 o ( k ) n O ( 1 ) -time solution exists for either problem, which shows that our algorithm for partial words is close to optimal. We also present an algorithm for both problems parameterized both by k and the alphabet size with a simple implementation. A preliminary version of this article was presented at the 25th International Symposium on Algorithms and Computation (ISAAC 2014), LNCS, vol. 8889, pp. 220–232, Springer (2014) [12] .

Book ChapterDOI
19 Sep 2014
TL;DR: This paper presents an on-line O(n)-time algorithm to calculate the size of a minimum closed cover for each prefix of a given string w of length n and shows a method to recover a Minimum Closed Covers problem in greedy manner from right to left.
Abstract: The Minimum Closed Covers problem asks us to compute a minimum size of a closed cover of given string In this paper we present an on-line O(n)-time algorithm to calculate the size of a minimum closed cover for each prefix of a given string w of length n We also show a method to recover a minimum closed cover of each prefix of w in greedy manner from right to left