scispace - formally typeset
Search or ask a question

Showing papers by "Arlindo L. Oliveira published in 2005"


Proceedings ArticleDOI
01 Jan 2005
TL;DR: A new algorithm for identifying cis-regulatory modules in genomic sequences that extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them, which is extremely relevant in the research of gene regulatory mechanisms.
Abstract: In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can e! ectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi.

51 citations


Journal ArticleDOI
TL;DR: This work surveys the existing approaches that generalize state merging algorithms by using search to explore the tree that represents the space of possible sequences of state mergings and presents comparisons of existing algorithms that show that the quality of the derived solutions is improved by applying this type of search.

46 citations


Book ChapterDOI
03 Oct 2005
TL;DR: This work proposes an algorithm that finds and reports all relevant biclusters in time linear on the size of the data matrix by manipulating a discretized version of the matrix and by using string processing techniques based on suffix trees.
Abstract: Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications The goal of biclustering is to find subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated behaviors In the most common settings, biclustering is an NP-complete problem, and heuristic approaches are used to obtain sub-optimal solutions using reasonable computational resources In this work, we examine a particular setting of the problem, where we are concerned with finding biclusters in time series expression data In this context, we are interested in finding biclusters with consecutive columns For this particular version of the problem, we propose an algorithm that finds and reports all relevant biclusters in time linear on the size of the data matrix This complexity is obtained by manipulating a discretized version of the matrix and by using string processing techniques based on suffix trees We report results in both synthetic and real data that show the effectiveness of the approach

45 citations


Book ChapterDOI
19 Jun 2005
TL;DR: It is pointed out that condensed neighborhoods are not a minimal representation of a pattern neighborhood, and an algorithm for generating Super Condensed Neighborhoods is presented, which takes O(m⌈ m / w ⌉ s) time and is very fast.
Abstract: Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Here, we point out that condensed neighborhoods are not a minimal representation of a pattern neighborhood. We show that we can restrict our attention to super condensed neighborhoods which are minimal. We then present an algorithm for generating Super Condensed Neighborhoods. The algorithm runs in O(m⌈ m / w ⌉ s), where m is the pattern size, s is the size of the super condensed neighborhood and w the size of the processor word. Previous algorithms took O(m⌈ m / w ⌉ c) time, where c is the size of the condensed neighborhood. We further improve this algorithm by using Bit-Parallelism and Increased Bit-Parallelism techniques. Our experimental results show that the resulting algorithm is very fast.

6 citations


Book ChapterDOI
05 Dec 2005
TL;DR: A more powerful set of replies to the membership queries posed by the L* algorithm that reduces the number of such queries by several orders of magnitude in a practical application is defined.
Abstract: In this work we propose to use a more powerful teacher to effectively apply query learning algorithms to identify regular languages in practical, real-world problems. More specifically, we define a more powerful set of replies to the membership queries posed by the L* algorithm that reduces the number of such queries by several orders of magnitude in a practical application. The basic idea is to avoid the needless repetition of membership queries in cases where the reply will be negative as long as a particular condition is met by the string in the membership query. We present an example of the application of this method to a real problem, that of inferring a grammar for the structure of technical articles.

5 citations


Book ChapterDOI
02 Nov 2005
TL;DR: This work presents a bit-parallel algorithm based on automata which is faster, conceptually much simpler and uses less memory than the existing method.
Abstract: We present a new algorithm for generating super condensed neighbourhoods. Super condensed neighbourhoods have recently been presented as the minimal set of words that represent a pattern neighbourhood. These sets play an important role in the generation phase of hybrid algorithms for indexed approximate string matching. An existing algorithm for this purpose is based on a dynamic programming approach, implemented using bit-parallelism. In this work we present a bit-parallel algorithm based on automata which is faster, conceptually much simpler and uses less memory than the existing method.

2 citations