scispace - formally typeset
Search or ask a question

Showing papers by "Arlindo L. Oliveira published in 2008"


Journal ArticleDOI
TL;DR: A new compressed self-index able to locate the occurrences of P in O((m + occ)log u) time, where occ is the number of occurrences and the fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O( m2) to O(m).
Abstract: A compressed full-text self-index for a text T, of size u, is a data structure used to search for patterns P, of size m, in T, that requires reduced space, i.e. space that depends on the empirical entropy (H k or H 0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed self-index able to locate the occurrences of P in O((m + occ)log u) time, where occ is the number of occurrences. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O(m 2) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the $${\mathcal{T}}_{78}$$ suffix tree. We show that our method is very competitive in practice by comparing it against other state of the art compressed indexes.

46 citations


Book ChapterDOI
07 Apr 2008
TL;DR: This paper introduces the first compressed suffix tree representation that requires sublinear extra space and supports a large set of navigational operations in logarithmic time, and reveals important connections between LCA queries and suffix tree navigation.
Abstract: Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval Classical representations of suffix trees require O(n log n) bits of space, for a string of size n This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size The size of suffix trees has been a barrier to their wider adoption in practice Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences In this paper we introduce the first compressed suffix tree representation that breaks this linear-space barrier Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time An essential ingredient of our representation is the lowest common ancestor (LCA) query We reveal important connections between LCA queries and suffix tree navigation

26 citations


Book ChapterDOI
18 Jun 2008
TL;DR: This paper shows how to support dynamic FCSTs within the same optimal space of the static version and executing all the operations in polylogarithmic time, able to build the suffix tree within optimal space.
Abstract: Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics, data compression and information retrieval. Classical representations of suffix trees require O(nlogn) bits of space, for a string of size n. This is considerably more than the nlog 2 i¾?bits needed for the string itself, where i¾?is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. A recent so-called fully-compressed suffix tree (FCST) requires asymptotically only the space of the text entropy. FCSTs, however, have the disadvantage of being static, not supporting updates to the text. In this paper we show how to support dynamic FCSTs within the same optimal space of the static version and executing all the operations in polylogarithmic time. In particular, we are able to build the suffix tree within optimal space.

25 citations


Journal ArticleDOI
TL;DR: This work confirmed the very close relationships among ccrAB alleles associated with SCCmec types I–IV and VI, which was found to be independent of the MRSA lineage, geographic origin or isolation period, and developed an online resource for storage and automatic analysis of ccrB internal sequences obtained using the previously published protocol.
Abstract: Sir, In 2006, we published a manuscript in this journal describing the allelic variation in the ccrAB locus in a representative collection of methicillin-resistant Staphylococcus aureus (MRSA) based on DNA sequencing of internal fragments of both genes. This work confirmed the very close relationships among ccrAB alleles associated with SCCmec types I–IV and VI, which was found to be independent of the MRSA lineage, geographic origin or isolation period. Moreover, particularly for the ccrB gene, SCCmec types II and IV, both defined by ccrAB allotype 2, could be discriminated. This method provides a significant improvement in ccrAB typing resolution since these SCCmec types have different epidemiological characteristics: type II is mostly found among hospital-acquired MRSA (e.g. ST5/ USA100 and ST36/EMRSA-16 clones), whereas type IV is mostly found among community-acquired MRSA (e.g. ST80, ST30 and ST1 clones). Based on these observations and since SCCmec types are defined based on the ccrAB allotype and the genetic organization of the mecA locus, we have proposed that sequencing an internal fragment of ccrB could be used as a SCCmec typing strategy, either as a first-line assay or as a confirmation tool for SCCmec type assignments. From a practical perspective, ccrB sequence typing can be easily incorporated in other widespread sequence-based MRSA typing strategies, such as multilocus sequencing (MLST) and spa typing. Under this rationale, we have developed an online resource for storage and automatic analysis of ccrB internal sequences obtained using our previously published protocol. The so-called ‘ccrB typing tool’ was launched in late October 2007 and is freely available at http://www.ccrbtyping.net. A detailed tutorial is available online as well the contacts of the site developers. Users can access the ccrB typing online database either anonymously or as registered users; registration is required for submission of data to the public database or to create a personal and private online database of ccrB alleles. Users can paste ccrB internal sequences in the FASTA format, which can be automatically trimmed to fix the sequence length for analysis at 455 bp. Then, after an automatic multiple sequence alignment to the database known ccrB alleles, the user’s sequence is either assigned to a ccrB allele (based on a 100% homology) or to a new one, if a homology between 90% and 100% is found to any of the available alleles. If a new allele is found, the most similar allele is indicated and, after submission to the public database, an allele number is assigned. Based on this assignment, a prediction of the ccrAB allotype and SCCmec type is also outputted. The user can also check all outputs by inspecting a graphical display of the multiple sequence alignment and the reconstruction of neighbour-joining or average-distance trees available through a Java applet. Users can also select subsets of private and public sequences to run the multiple sequence algorithm and visualize the resulting trees. If users choose to submit their data to the public database, the submission process is validated by a curator that checks for data consistency and quality. If a new ccrB allele is found, users are requested to upload both trace files. Upon development of the ‘ccrB typing tool’, we have deposited all sequences described in Oliveira et al. and also all ccrB sequences available at GenBank (www.ncbi.nlm.nih.gov, last accessed on 16 November 2007) covering the same 455 bp used in the ccrB typing tool. Besides ccrB sequences for S. aureus, 13 sequences for coagulase-negative staphylococci (CoNS), such as Staphylococcus epidermidis, Staphylococcus hominis, Staphylococcus saprophyticus and Staphylococcus warneri, were inserted. Altogether, as of 15 November 2007, 96 ccrB internal sequences were made available, which were assigned to 17 alleles (Figure 1). In spite of the increased size of the collection and the extension to staphylococcal species other than S. aureus, the conclusions obtained for the well-defined MRSA collection are still valid. This is particularly relevant if one takes into account that 45 sequences were assigned to SCCmec type IV, the most variable structural type, and that there is a great diversity in the SCCmec elements circulating in the CoNS population. Among the 96 ccrB isolates, five were described for methicillin-susceptible strains (i.e. SCCmec negative). Although one sequence was assigned to a new cluster (ccrB allele 700), the remaining sequences were clustered in previously existing groups, suggesting that ccrB typing might also be useful for the characterization of other SCC elements. In conclusion, ccrB typing is indeed a promising SCCmec typing strategy since there is robust correlation among ccrB allelic clusters and SCCmec types.

21 citations


Journal ArticleDOI
TL;DR: It is concluded that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used.
Abstract: Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially. We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery. We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.

21 citations


Book ChapterDOI
20 May 2008
TL;DR: A number of effective improvements to PBO-based HIPP are proposed, including the use of lower bounding and pruning techniques effective with other approaches, which reduces by 50% the number of instances that remain unsolvable by HIPP based approaches.
Abstract: Haplotype inference has relevant biological applications, and represents a challenging computational problem. Among others, pure parsimony provides a viable modeling approach for haplotype inference and provides a simple optimization criterion. Alternative approaches have been proposed for haplotype inference by pure parsimony (HIPP), including branch and bound, integer programming and, more recently, propositional satisfiability and pseudo-Boolean optimization (PBO). Among these, the currently best performing HIPP approach is based on PBO. This paper proposes a number of effective improvements to PBO-based HIPP, including the use of lower bounding and pruning techniques effective with other approaches. The new PBO-based HIPP approach reduces by 50% the number of instances that remain unsolvable by HIPP based approaches.

13 citations


Book ChapterDOI
10 Nov 2008
TL;DR: A new approach is taken to successfully and efficiently cluster these large graphs by analyzing clique overlap and a priori induced cliques, using query coverage inspection to extract semantic relations between queries and their terms.
Abstract: In this paper we propose a method for the analysis of very large graphs obtained from query logs, using query coverage inspection. The goal is to extract semantic relations between queries and their terms. We take a new approach to successfully and efficiently cluster these large graphs by analyzing clique overlap and a priori induced cliques. The clustering quality is evaluated with an extension of the modularity score. Results obtained with real data show that the identified clusters can be used to infer properties of the queries and interesting semantic relations between them and their terms. The quality of the semantic relations is evaluated both using a tf-idf based score and data from the Open Directory Project. The proposed approach is also able to identify and filter out multitopical URLs, a feature that is interesting in itself.

11 citations


01 Jan 2008
TL;DR: This paper describes and evaluates two fundamentally different modeling and algorithmic solutions for the computation of minimum-size prime implicants of Boolean functions, one based on explicit search methods, and uses Integer Linear Programming models and algorithms, whereas the other is based on implicit techniques, and so it uses Binary Decision Diagrams.
Abstract: Minimum-size prime implicants of Boolean functions find application in many areas of Computer Science including, among others, Electronic Design Automation and Artificial Intelligence. The main purpose of this paper is to describe and evaluate two fundamentally different modeling and algorithmic solutions for the computation of minimum-size prime implicants. One is based on explicit search methods, and uses Integer Linear Programming models and algorithms, whereas the other is based on implicit techniques, and so it uses Binary Decision Diagrams. For the explicit approach we propose new dedicated ILP algorithms, specifically target at solving these types of problems. As shown by the experimental results, other well-known ILP algorithms are in general impractical for computing minimumsize prime implicants. Moreover, we experimentally evaluate the two proposed algorithmic strategies.

10 citations


Book ChapterDOI
10 Nov 2008
TL;DR: This work presents a new search procedure for approximate string matching over suffix trees, and shows that hierarchical verification, which is a well-established technique for on-line searching, can also be used with an indexed approach.
Abstract: We present a new search procedure for approximate string matching over suffix trees. We show that hierarchical verification, which is a well-established technique for on-line searching, can also be used with an indexed approach. For this, we need that the index supports bidirectionality, meaning that the search for a pattern can be updated by adding a letter at the right or at the left. This turns out to be easily supported by most compressed text self-indexes, which represent the index and the text essentially in the same space of the compressed text alone. To complete the symbiotic exchange, our hierarchical verification largely reduces the need to access the text, which is expensive in compressed text self-indexes. The resulting algorithm can, in particular, run over an existing fully compressed suffix tree, which makes it very appealing for applications in computational biology. We compare our algorithm with related approaches, showing that our method offers an interesting space/time tradeoff, and in particular does not need of any parameterization, which is necessary in the most successful competing approaches.

8 citations


Book ChapterDOI
26 Jul 2008
TL;DR: It is concluded that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios where memory is at a premium and high speed is not critical.
Abstract: Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used nowadays. Regarding time and memory requirements, LZ encoding is much more demanding than decoding. In order to speed up the encoding process, efficient data structures, like suffix trees, have been used. In this paper, we explore the use of suffix arrays to hold the dictionary of the LZ encoder, and propose an algorithm to search over it. We show that the resulting encoder attains roughly the same compression ratios as those based on suffix trees. However, the amount of memory required by the suffix array is fixed, and much lower than the variable amount of memory used by encoders based on suffix trees (which depends on the text to encode). We conclude that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios (e.g., embedded systems) where memory is at a premium and high speed is not critical.

5 citations


Proceedings ArticleDOI
03 Nov 2008
TL;DR: This paper provides an overview of SAT-based approaches for solving the HIPP problem and identifies current research directions.
Abstract: Boolean satisfiability (SAT) finds a wide range of practical applications, including Artificial Intelligence and, more recently, Bioinformatics. Although encoding some combinatorial problems using Boolean logic may not be the most intuitive solution, the efficiency of state-of-the-art SAT solvers often makes it worthwhile to consider encoding a problem to SAT. One representative application of SAT in Bioinformatics is haplotype inference. The problem of haplotype inference under the assumption of pure parsimony consists in finding the smallest number of haplotypes that explains a given set of genotypes. The original formulations for solving the problem of Haplotype Inference by Pure Parsimony (HIPP) were based on Integer Linear Programming. More recently, solutions based on SAT have been shown to be remarkably more efficient. This paper provides an overview of SAT-based approaches for solving the HIPP problem and identifies current research directions.

01 May 2008
TL;DR: It is concluded that specialized PBO solvers are more suitable than generic ILP solvers on a variety of HIPP models.
Abstract: Haplotype inference is an important and computationally challenging problem in genetics. A well-known approach to haplotype inference is pure parsimony (HIPP). Despite being based on a simple optimization criterion, HIPP is a computationally hard problem. Recent work has shown that approaches based on Boolean satisfiability namely pseudo-Boolean optimization (PBO), are very effective at tackling the HIPP problem. Extensive work on PBO-based HIPP approaches has been recently developed. Considering that the PBO problem, also known as 0-1 ILP problem, is a particular case of the integer linear programming (ILP) problem, generic ILP solvers can be considered. This paper compares the performance of PBO and ILP solvers on a variety of HIPP models. We conclude that specialized PBO solvers are more suitable than generic ILP solvers.

Book ChapterDOI
25 Jun 2008
TL;DR: In this article, the authors identify three design principles (minimal ontological commitment, granularity separation, and orthogonal domain) and two deployment techniques (intensionally normalized form (INF) and extensionally normalized forms (ENF) ) as the potential remedies for these problems.
Abstract: The fundamental issue of knowledge sharing in the web is the ability to share the ontological constrains associated with the Uniform Resource Identifiers (URI). To maximize the expressiveness and robustness of an ontological system in the web, each ontology should be ideally designed for a confined conceptual domain and deployed with minimal dependencies upon others. Through a retrospective analysis of the existing design of BioPAX ontologies, we illustrate the often encountered problems in ontology design and deployment. In this paper, we identify three design principles --- minimal ontological commitment, granularity separation, and orthogonal domain --- and two deployment techniques --- intensionally normalized form (INF) and extensionally normalized form (ENF) --- as the potential remedies for these problems.

Book ChapterDOI
06 May 2008
TL;DR: This work proposes a method that processes the output of combinatorial motif finders in order to find groups of motifs that representvariations of the same motif, thus reducing the output to a manageablesize.
Abstract: Many algorithms have been proposed to date for the problemof finding biologically significant motifs in promoter regions. They can beclassified into two large families: combinatorial methods and probabilisticmethods. Probabilistic methods have been used more extensively, sincetheir output is easier to interpret. Combinatorial methods have the potentialto identify hard to detect motifs, but their output is much harderto interpret, since it may consist of hundreds or thousands of motifs.In this work, we propose a method that processes the output of combinatorialmotif finders in order to find groups of motifs that representvariations of the same motif, thus reducing the output to a manageablesize. This processing is done by building a graph that represents the cooccurrencesof motifs, and finding communities in this graph. We showthat this innovative approach leads to a method that is as easy to useas a probabilistic motif finder, and as sensitive to low quorum motifsas a combinatorial motif finder. The method was integrated with twocombinatorial motif finders, and made available on the Web.

Book ChapterDOI
28 Aug 2008
TL;DR: The results prove that discrete models can be used in protein folding to obtain low resolution models and since the side chains are already present in the models, the refinement of these solutions is simpler and more effective.
Abstract: Discrete models are important to reduce the complexity of the protein folding problem. However, a compromise must be made between the model complexity and the accuracy of the model. Previous work by Park and Levitt has shown that the protein backbone can be modeled with good accuracy by four state discrete models. Nonetheless, for ab-initio protein folding, the side chains are important to determine if the structure is physically possible and well packed. We extend the work of Park and Levitt by taking into account the positioning of the side chain in the evaluation of the accuracy. We show that the problem becomes much harder and more dependent on the type of protein being modeled. In fact, the structure fitting method used in their work is no longer adequate to this extended version of the problem. We propose a new method to test the model accuracy. The presented results show that, for some proteins, the discrete models with side chains cannot achieve the accuracy of the backbone only discrete models. Nevertheless, for the majority of the proteins an RMSD of four angstrom or less is obtained, and, for many of those, we reach an accuracy near the two angstrom limit. These results prove that discrete models can be used in protein folding to obtain low resolution models. Since the side chains are already present in the models, the refinement of these solutions is simpler and more effective.

Proceedings Article
01 Jan 2008
TL;DR: Suffix arrays are a very interesting option regarding the tradeoff b etween time, memory, and compressionratio, when compared with sufflx trees, that make them preferable in compression scenarios.
Abstract: Keywords: Lempel-Ziv, Lossless Data Compression, Suffix Arrays, Suffix Tre es, String Matching.Abstract: Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used in a variety of applications.The LZ encoder and decoder exhibit a high asymmetry, regarding time and memory requirements, with theformer being much more demanding. Several techniques have been used to speed up the encoding process;among them is the use of suffix trees. In this paper, we explore the use of a simple data structure, namedsuffix array , to hold the dictionary of the LZ encoder, and propose an algorithm to search the dictionary.A comparison with the suffix tree based LZ encoder is carried out, showin g that the compression ratios areroughly the same. The ammount of memory required by the suffix arra y is fixed, being much lower than thevariable memory requirements of the suffix tree encoder, which depen ds on the text to encode. We concludethat suffix arrays are a very interesting option regarding the tradeoff b etween time, memory, and compressionratio, when compared with suffix trees, that make them preferable in som e compression scenarios.

Journal ArticleDOI
TL;DR: This work proposes a method to build ontologies encoding this structure information by the application of grammar inference techniques that results in a semi-automatic approach to the inference of such ontologies.
Abstract: Information produced by people usually has an implicit agreed-upon structure. However, this structure is not usually available to computer programs, where it could be used, for example, to aid in answering search queries. For example, when considering technical articles, one could ask for the occurrence of a keyword in a particular part of the article, such as the reference section. This implicit structure could be used, in the form of an ontology, to further the efforts of improving search in the semantic web. We propose a method to build ontologies encoding this structure information by the application of grammar inference techniques. This results in a semi-automatic approach to the inference of such ontologies. Our approach has two main components: (1) the inference of a grammatical description of the implicit structure of the supplied examples, and (2) the transformation of that description into an ontology. We present the application of the method to the inference of an ontology describing the structure of technical articles.