scispace - formally typeset
Search or ask a question

Showing papers by "Arlindo L. Oliveira published in 2011"


Journal ArticleDOI
TL;DR: The YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT) information system, developed to support the analysis of transcription regulatory associations in Saccharomyces cerevisiae, was revisited and detailed information on the experimental evidences that sustain those associations was added.
Abstract: The YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT) information system (http://www.yeastract.com) was developed to support the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Last updated in June 2010, this database contains over 48 200 regulatory associations between transcription factors (TFs) and target genes, including 298 specific DNA-binding sites for 110 characterized TFs. All regulatory associations stored in the database were revisited and detailed information on the experimental evidences that sustain those associations was added and classified as direct or indirect evidences. The inclusion of this new data, gathered in response to the requests of YEASTRACT users, allows the user to restrict its queries to subsets of the data based on the existence or not of experimental evidences for the direct action of the TFs in the promoter region of their target genes. Another new feature of this release is the availability of all data through a machine readable web-service interface. Users are no longer restricted to the set of available queries made available through the existing web interface, and can use the web service interface to query, retrieve and exploit the YEASTRACT data using their own implementation of additional functionalities. The YEASTRACT information system is further complemented with several computational tools that facilitate the use of the curated data when answering a number of important biological questions. Since its first release in 2006, YEASTRACT has been extensively used by hundreds of researchers from all over the world. We expect that by making the new data and services available, the system will continue to be instrumental for yeast biologists and systems biology researchers.

206 citations


Journal ArticleDOI
TL;DR: This article introduces the first compressed suffix tree representation that requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time.
Abstract: Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ(n log n) bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but the linear extra bits are still unsatisfactory when σ is small as in DNA sequences. In this article, we introduce the first compressed suffix tree representation that breaks this Θ(n)-bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time. This includes extracting arbitrary text substrings, so the FCST replaces the text using almost the same space as the compressed text. An essential ingredient of FCSTs is the lowest common ancestor (LCA) operation. We reveal important connections between LCAs and suffix tree navigation. We also describe how to make FCSTs dynamic, that is, support updates to the text. The dynamic FCST also supports several operations. In particular, it can build the static FCST within optimal space and polylogarithmic time per symbol. Our theoretical results are also validated experimentally, showing that FCSTs are very effective in practice as well.

98 citations


Journal ArticleDOI
TL;DR: Results on a large suite of benchmark data sets from the UCI repository show that fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources.
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources.

55 citations


Journal ArticleDOI
TL;DR: This paper provides a detailed description of RPoly, a PBO approach for the haplotype inference by pure parsimony (HIPP) problem and an extensive evaluation of existent HIPP solvers confirms that RPoly is currently the most efficient and robust HIPP approach.
Abstract: The fast development of sequencing techniques in the recent past has required an urgent development of efficient and accurate haplotype inference tools. Besides being a crucial issue in genetics, haplotype inference is also a challenging computational problem. Among others, pure parsimony is a viable modeling approach to solve the problem of haplotype inference and also an interesting NP-hard problem in itself. Recently, the introduction of SAT-based methods, including pseudo-Boolean optimization (PBO) methods, has produced very efficient solvers. This paper provides a detailed description of RPoly, a PBO approach for the haplotype inference by pure parsimony (HIPP) problem. Moreover, an extensive evaluation of existent HIPP solvers, on a comprehensive set of instances, confirms that RPoly is currently the most efficient and robust HIPP approach.

21 citations


Journal ArticleDOI
TL;DR: TFRank is presented, a graph-based framework to prioritize regulatory players involved in transcriptional responses within the regulatory network of an organism, whereby every regulatory path containing genes of interest is explored and incorporated into the analysis.
Abstract: Motivation: Uncovering mechanisms underlying gene expression control is crucial to understand complex cellular responses. Studies in gene regulation often aim to identify regulatory players involved in a biological process of interest, either transcription factors coregulating a set of target genes or genes eventually controlled by a set of regulators. These are frequently prioritized with respect to a context-specific relevance score. Current approaches rely on relevance measures accounting exclusively for direct transcription factor–target interactions, namely overrepresentation of binding sites or target ratios. Gene regulation has, however, intricate behavior with overlapping, indirect effect that should not be neglected. In addition, the rapid accumulation of regulatory data already enables the prediction of large-scale networks suitable for higher level exploration by methods based on graph theory. A paradigm shift is thus emerging, where isolated and constrained analyses will likely be replaced by whole-network, systemic-aware strategies. Results: We present TFRank, a graph-based framework to prioritize regulatory players involved in transcriptional responses within the regulatory network of an organism, whereby every regulatory path containing genes of interest is explored and incorporated into the analysis. TFRank selected important regulators of yeast adaptation to stress induced by quinine and acetic acid, which were missed by a direct effect approach. Notably, they reportedly confer resistance toward the chemicals. In a preliminary study in human, TFRank unveiled regulators involved in breast tumor growth and metastasis when applied to genes whose expression signatures correlated with short interval to metastasis. Availability: Prototype at http://kdbio.inesc-id.pt/software/tfrank/. Contact: jpg@kdbio.inesc-id.pt; sara.madeira@ist.utl.pt Supplementary Information: Supplementary data are available at Bioinformatics online.

18 citations


Journal ArticleDOI
TL;DR: This work presents an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454) system against a reference sequence, which outperforms a number of mainstream tools on the quantity and quality of successful alignments, as well as on the execution time.
Abstract: Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects. We present an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454) system against a reference sequence. Our approach explores the characteristics of the data in these re-sequencing applications and uses state of the art indexing techniques combined with a flexible seed-based approach, leading to a fast and accurate algorithm which needs very little user parameterization. An evaluation performed using real and simulated data shows that our proposed method outperforms a number of mainstream tools on the quantity and quality of successful alignments, as well as on the execution time. The proposed methodology was implemented in a software tool called TAPyR--Tool for the Alignment of Pyrosequencing Reads--which is publicly available from http://www.tapyr.net .

14 citations


Journal ArticleDOI
TL;DR: The qualitative modelling and simulation of the transcriptional regulatory network controlling the response of the model eukaryote Saccharomyces cerevisiae to the agricultural fungicide mancozeb resulted in a computable model, permitting a quick and cost-effective test of hypotheses prior to experimental validation.
Abstract: Background: Qualitative models allow understanding the relation between the structure and the dynamics of gene regulatory networks. The dynamical properties of these models can be automatically analysed by means of formal verification methods, like model checking. This facilitates the model-validation process and the test of new hypotheses to reconcile model predictions with the experimental data. Results: The authors report in this study the qualitative modelling and simulation of the transcriptional regulatory network controlling the response of the model eukaryote Saccharomyces cerevisiae to the agricultural fungicide mancozeb. The model allowed the analysis of the regulation level and activity of the components of the gene mancozeb-induced network controlling the transcriptional activation of the FLR1 gene, which is proposed to confer multidrug resistance through its putative role as a drug eflux pump. Formal verification analysis of the network allowed us to confront model predictions with the experimental data and to assess the model robustness to parameter ordering and gene deletion. Conclusions: This analysis enabled us to better understand the mechanisms regulating the FLR1 gene mancozeb response and confirmed the need of a new transcription factor for the full transcriptional activation of YAP1. The result is a computable model of the FLR1 gene response to mancozeb, permitting a quick and cost-effective test of hypotheses prior to experimental validation.

8 citations


Journal ArticleDOI
TL;DR: Post-processing the output of combinatorial algorithms by incorporating prior information leads to a very efficient and effective motif discovery method, and combining priors from different sources is even more beneficial than considering them separately.
Abstract: Position-specific priors (PSP) have been used with success to boost EM and Gibbs sampler-based motif discovery algorithms. PSP information has been computed from different sources, including orthologous conservation, DNA duplex stability, and nucleosome positioning. The use of prior information has not yet been used in the context of combinatorial algorithms. Moreover, priors have been used only independently, and the gain of combining priors from different sources has not yet been studied. We extend RISOTTO, a combinatorial algorithm for motif discovery, by post-processing its output with a greedy procedure that uses prior information. PSP's from different sources are combined into a scoring criterion that guides the greedy search procedure. The resulting method, called GRISOTTO, was evaluated over 156 yeast TF ChIP-chip sequence-sets commonly used to benchmark prior-based motif discovery algorithms. Results show that GRISOTTO is at least as accurate as other twelve state-of-the-art approaches for the same task, even without combining priors. Furthermore, by considering combined priors, GRISOTTO is considerably more accurate than the state-of-the-art approaches for the same task. We also show that PSP's improve GRISOTTO ability to retrieve motifs from mouse ChiP-seq data, indicating that the proposed algorithm can be applied to data from a different technology and for a higher eukaryote. The conclusions of this work are twofold. First, post-processing the output of combinatorial algorithms by incorporating prior information leads to a very efficient and effective motif discovery method. Second, combining priors from different sources is even more beneficial than considering them separately.

8 citations


Book ChapterDOI
01 Jan 2011
TL;DR: New data structures and a new implementation of a well known agglomerative greedy algorithm to find community structure in large networks, the CNM algorithm are proposed and results show that the improved data structures speedup the method by a large factor, making it competitive with other state of the art algorithms.
Abstract: Community detection or graph clustering is an important problem in the analysis of computer networks, social networks, biological networks and many other natural and artificial networks. These networks are in general very large and, thus, finding hidden structures and functional modules is a very hard task. In this paper we propose new data structures and a new implementation of a well known agglomerative greedy algorithm to find community structure in large networks, the CNM algorithm. The experimental results show that the improved data structures speedup the method by a large factor, for large networks, making it competitive with other state of the art algorithms.

7 citations


Journal ArticleDOI
TL;DR: This work addresses the problem of finding a mathematical model for the genetic network regulating the stress response of the yeast Saccharomyces cerevisiae to the fungicide mancozeb and achieves partial success when trained on the non-mutant datasets.
Abstract: In this study we address the problem of finding a quantitative mathematical model for the genetic network regulating the stress response of the yeast Saccharomyces cerevisiae to the agricultural fungicide mancozeb. An S-system formalism was used to model the interactions of a five-gene network encoding four transcription factors (Yap1, Yrr1, Rpn4 and Pdr3) regulating the transcriptional activation of the FLR1 gene. Parameter estimation was accomplished by decoupling the resulting system of nonlinear ordinary differential equations into a larger nonlinear algebraic system, and using the Levenberg-Marquardt algorithm to fit the models predictions to experimental data. The introduction of constraints in the model, related to the putative topology of the network, was explored. The results show that forcing the network connectivity to adhere to this topology did not lead to better results than the ones obtained using an unrestricted network topology. Overall, the modeling approach obtained partial success when trained on the nonmutant datasets, although further work is required if one wishes to obtain more accurate prediction of the time courses.

4 citations


Proceedings ArticleDOI
29 Mar 2011
TL;DR: This paper proposes a new efficient algorithm to update the sliding window each time a token is outputted, and toggles between two SA on consecutive tokens, which requires less memory than tree-based encoders.
Abstract: The sliding window dictionary-based algorithms of the Lempel-Ziv (LZ) 77 family are widely used for universal loss less data compression. The encoding component of these algorithms performs repeated sub string search. Data structures, such as hash tables, binary search trees, and suffix trees have been used to speedup these searches, at the expense of memory usage. Previous work has shown how suffix arrays (SA) can be used for dictionary representation and LZ77 decomposition. In this paper, we improve over that work by proposing a new efficient algorithm to update the sliding window each time a token is outputted. Our algorithm toggles between two SA on consecutive tokens. The resulting SA-based encoder requires less memory than tree-based encoders. We compare our technique against tree-based encoders, on a large set of benchmark files. In some cases, our encoder is also faster than tree-based encoders.

Book ChapterDOI
01 Jan 2011
TL;DR: This approach generalizes several notions of graph core proposed independently in the literature, introducing a general and theoretical sound framework for the study of fully generalized graph cores and discussing emerging applications of graph cores, such as improved graph clustering methods and complex network motif detection.
Abstract: A core in a graph is usually taken as a set of highly connected vertices. Although general, this definition is intuitive and useful for studying the structure of many real networks. Nevertheless, depending on the problem, different formulations of graph core may be required, leading us to the known concept of generalized core. In this paper we study and further extend the notion of generalized core. Given a graph, we propose a definition of graph core based on a subset of its subgraphs and on a subgraph property function. Our approach generalizes several notions of graph core proposed independently in the literature, introducing a general and theoretical sound framework for the study of fully generalized graph cores. Moreover, we discuss emerging applications of graph cores, such as improved graph clustering methods and complex network motif detection.

Proceedings ArticleDOI
01 Mar 2011
TL;DR: The transcription regulatory network underlying FLR1 activation was defined based on experimental data and a mathematical model describing this network was built and its response to mancozeb stress in different genetic backgrounds was simulated, using the Genetic Network Analyzer software.
Abstract: Multidrug resistance (MDR), a phenomenon with impact in Human Health and in Agro-Food and Environmental Biotechnology, often results from the activation of drug efflux pumps, many times controlled at the transcriptional level. The complex transcriptional control of these genes has been on the focus of our research, guided by the information gathered in the YEASTRACT database. In this paper, the approach used to elucidate the transcriptional control of FLR1, encoding a Saccharomyces cerevisiae Drug:H+ Antiporter, in response to stress induced by the fungicide mancozeb is explained. The transcription regulatory network underlying FLR1 activation was defined based on experimental data. Subsequently, a mathematical model describing this network was built and its response to mancozeb stress in different genetic backgrounds was simulated, using the Genetic Network Analyzer (GNA) software. This approach allowed the identification of essential features of the transition from unstressed to fungicide stressed cells and to make new predictions on the dynamical behavior of the system, which were validated experimentally. This work provides a good example of the successful combination of experimental and computational approaches in a systems biology perspective.

Proceedings ArticleDOI
26 Jul 2011
TL;DR: This work addresses the problem of finding a mathematical model for the genetic network regulating the stress response of the yeast Saccharomyces cerevisiae to the fungicide mancozeb and achieves partial success when trained on the non-mutant datasets.
Abstract: We address the problem of finding a mathematicalmodel for the genetic network regulating the stress response ofthe yeast Saccharomyces cerevisiae to the fungicide mancozeb.An S-system formalism was used to model the interactions ofthis 5 gene network. Parameter estimation was accomplishedby decoupling the resulting system of nonlinear ordinarydifferential equations into a larger nonlinear algebraic system,and using the Levenberg-Marquardt algorithm to fit themodels predictions to experimental data. The introduction ofconstraints in the model, related to the putative topology ofthe network, was explored. The results show that forcing thenetwork connectivity to adhere to this topology did not leadto better results than the ones obtained using an unrestrictednetwork topology. Overall, the modeling approach obtainedpartial success when trained on the non-mutant datasets,although further work is required if one wishes to obtain moreaccurate prediction of the time courses.