scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Bioinformatics and Computational Biology in 2009"


Journal ArticleDOI
TL;DR: This work presents an exact algorithm, based on techniques from the field of Model Checking, for finding control policies for Boolean Networks (BN) with control nodes, and extends it to automatically identify a set of Boolean transfer functions that reproduce the qualitative behavior of gene regulatory networks.
Abstract: We present an exact algorithm, based on techniques from the field of Model Checking, for finding control policies for Boolean Networks (BN) with control nodes. Given a BN, a set of starting states, I, a set of goal states, F, and a target time, t, our algorithm automatically finds a sequence of control signals that deterministically drives the BN from I to F at, or before time t, or else guarantees that no such policy exists. Despite recent hardness-results for finding control policies for BNs, we show that, in practice, our algorithm runs in seconds to minutes on over 13,400 BNs of varying sizes and topologies, including a BN model of embryogenesis in Drosophila melanogaster with 15,360 Boolean variables. We then extend our method to automatically identify a set of Boolean transfer functions that reproduce the qualitative behavior of gene regulatory networks. Specifically, we automatically learn a BN model of D. melanogaster embryogenesis in 5.3 seconds, from a space containing 6.9 × 1010 possible models.

65 citations


Journal ArticleDOI
TL;DR: This review takes a fresh look at a specific family of models used for constructing genetic networks, the so-called Boolean networks, and outlines the various different types of Boolean network developed to date.
Abstract: The modeling of genetic networks especially from microarray and related data has become an important aspect of the biosciences. This review takes a fresh look at a specific family of models used for constructing genetic networks, the so-called Boolean networks. The review outlines the various different types of Boolean network developed to date, from the original Random Boolean Network to the current Probabilistic Boolean Network. In addition, some of the different inference methods available to infer these genetic networks are also examined. Where possible, particular attention is paid to input requirements as well as the efficiency, advantages and drawbacks of each method. Though the Boolean network model is one of many models available for network inference today, it is well established and remains a topic of considerable interest in the field of genetic network inference. Hybrids of Boolean networks with other approaches may well be the way forward in inferring the most informative networks.

50 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the Hoeffding's D measure outperforms Pearson's and Spearman's approaches in identifying nonlinear associations and is a more powerful tool to identify nonlinear and non-monotonic associations.
Abstract: DNA microarrays have become a powerful tool to describe gene expression profiles associated with different cellular states, various phenotypes and responses to drugs and other extra- or intra-cellu...

47 citations


Journal ArticleDOI
TL;DR: A new coherence measure called scaling mean squared residue (SMSR) is proposed and it has been proved that the proposed new measure is able to detect the scaling patterns effectively and it is invariant to local or global scaling of the input dataset.
Abstract: Biclustering methods are used to identify a subset of genes that are co-regulated in a subset of experimental conditions in microarray gene expression data. Many biclustering algorithms rely on optimizing mean squared residue to discover biclusters from a gene expression dataset. Recently it has been proved that mean squared residue is only good in capturing constant and shifting biclusters. However, scaling biclusters cannot be detected using this metric. In this article, a new coherence measure called scaling mean squared residue (SMSR) is proposed. Theoretically it has been proved that the proposed new measure is able to detect the scaling patterns effectively and it is invariant to local or global scaling of the input dataset. The effectiveness of the proposed coherence measure in detecting scaling patterns has been demonstrated experimentally on artificial and real-life benchmark gene expression datasets. Moreover, biological significance tests have been conducted to show that the biclusters identified using the proposed measure are composed of functionally enriched sets of genes.

44 citations


Journal ArticleDOI
TL;DR: The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low, and this improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.
Abstract: Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e. ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increases the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for metagenomic projects when the genome assembly does not work because of the low sequence coverage.

40 citations


Journal ArticleDOI
TL;DR: This work presents an efficient method for finding potential ncRNAs in bacteria by clustering genomic sequences based on homology inferred from both primary sequence and secondary structure, and shows promise for discovering new families of ncRNA.
Abstract: Non-coding RNAs (ncRNAs) are transcripts that do not code for proteins. Recent findings have shown that RNA-mediated regulatory mechanisms influence a substantial portion of typical microbial genom...

40 citations


Journal ArticleDOI
TL;DR: This paper reviews some representative algorithms in identifying functional modules in PPI networks, focusing on the algorithms underlying the approaches and how the algorithms relate to each other.
Abstract: Protein–Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. When studying the workings of a biological cell, it is useful to be able to detect known and predict still undiscovered protein complexes within the cell's PPI networks. Such predictions may be used as an inexpensive tool to direct biological experiments. The increasing amount of available PPI data necessitate a fast, accurate approach to biological complex identification. Because of its importance in the studies of protein interaction network, there are different models and algorithms in identifying functional modules in PPI networks. In this paper, we review some representative algorithms, focusing on the algorithms underlying the approaches and how the algorithms relate to each other. In particular, a comparison is given based on the property of the algorithms. Since the PPI network is noisy and still incomplete, some methods which consider other additional properties for preprocessing and purifying of PPI data are presented. We also give a discussion about the functional annotation and validation of protein complexes. Finally, new progress and future research directions are discussed from the computational viewpoint.

38 citations


Journal ArticleDOI
TL;DR: The genome halving problem, previously solved by El-Mabrouk for inversions and reciprocal translocations, is here solved in a more general context allowing transpositions and block interchange as well, for genomes including multiple linear and circular chromosomes.
Abstract: The genome halving problem, previously solved by El-Mabrouk for inversions and reciprocal translocations, is here solved in a more general context allowing transpositions and block interchange as well, for genomes including multiple linear and circular chromosomes. We apply this to several datasets and compare the results to the previous algorithm.

33 citations


Journal ArticleDOI
TL;DR: This work gives an exact algorithm for constructing level-1 networks consistent with a maximum number of input triplets, and proves that for all k > or = 1 it is NP-hard to construct a level-k network consistent with allinput triplets.
Abstract: Phylogenetic networks provide a way to describe and visualize evolutionary histories that have undergone so-called reticulate evolutionary events such as recombination, hybridization or horizontal gene transfer. The level k of a network determines how non-treelike the evolution can be, with level-0 networks being trees. We study the problem of constructing level-k phylogenetic networks from triplets, i.e. phylogenetic trees for three leaves (taxa). We give, for each k, a level-k network that is uniquely defined by its triplets. We demonstrate the applicability of this result by using it to prove that (1) for all k of at least one it is NP-hard to construct a level-k network consistent with all input triplets, and (2) for all k it is NP-hard to construct a level-k network consistent with a maximum number of input triplets, even when the input is dense. As a response to this intractability we give an exact algorithm for constructing level-1 networks consistent with a maximum number of input triplets.

31 citations


Journal ArticleDOI
TL;DR: A novel statistical modeling technique for target property prediction, with applications to virtual screening and drug design, and a novel graph kernel function to utilize the topology features to build predictive models for chemicals via Support Vector Machine classifier.
Abstract: In this paper, we introduce a novel statistical modeling technique for target property prediction, with applications to virtual screening and drug design. In our method, we use graphs to model chemical structures and apply a wavelet analysis of graphs to summarize features capturing graph local topology. We design a novel graph kernel function to utilize the topology features to build predictive models for chemicals via Support Vector Machine classifier. We call the new graph kernel a graph wavelet-alignment kernel. We have evaluated the efficacy of the wavelet-alignment kernel using a set of chemical structure–activity prediction benchmarks. Our results indicate that the use of the kernel function yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art chemical classification approaches. In addition, our results also show that the use of wavelet functions significantly decreases the computational costs for graph kernel computation with more than ten fold speedup.

28 citations


Journal ArticleDOI
TL;DR: This study shows how the canonical GMA and S-system models in BST can be directly implemented in a standard Petri Net framework and validate the hybrid modeling approach through comparative analyses and simulations with other approaches and highlight the feasibility, quality, and efficiency of the hybrid method.
Abstract: Many biological systems are genuinely hybrids consisting of interacting discrete and continuous components and processes that often operate at different time scales. It is therefore desirable to create modeling frameworks capable of combining differently structured processes and permitting their analysis over multiple time horizons. During the past 40 years, Biochemical Systems Theory (BST) has been a very successful approach to elucidating metabolic, gene regulatory, and signaling systems. However, its foundation in ordinary differential equations has precluded BST from directly addressing problems containing switches, delays, and stochastic effects. In this study, we extend BST to hybrid modeling within the framework of Hybrid Functional Petri Nets (HFPN). First, we show how the canonical GMA and S-system models in BST can be directly implemented in a standard Petri Net framework. In a second step we demonstrate how to account for different types of time delays as well as for discrete, stochastic, and switching effects. Using representative test cases, we validate the hybrid modeling approach through comparative analyses and simulations with other approaches and highlight the feasibility, quality, and efficiency of the hybrid method.

Journal ArticleDOI
TL;DR: The review shows that simple linear rules govern the response behavior of biological networks in an ensemble of cells, where without the requirement of detailed in vivo physiological parameters, the analysis of temporal concentration or activation response unravels biological network features.
Abstract: Complex living systems have shown remarkably well-orchestrated, self-organized, robust, and stable behavior under a wide range of perturbations. However, despite the recent generation of high-throu...

Journal ArticleDOI
TL;DR: The proposed approach to computational PPI detection is a promising methodology for mediating between structural studies and systems biology by utilizing cumulative protein structure data for pathway analysis.
Abstract: We propose a computational screening system of protein-protein interactions using tertiary structure data. Our system combines all-to-all protein docking and clustering to find interacting protein pairs. We tuned our prediction system by applying various parameters and clustering algorithms and succeeded in outperforming previous methods. This method was also applied to a biological pathway estimation problem to show its use in network level analysis. The structural data were collected from the Protein Data Bank, PDB. Then all-to-all docking among target protein structures was conducted using a conventional protein-protein docking software package, ZDOCK. The highest-ranked 2000 decoys were clustered based on structural similarity among the predicted docking forms. The features of generated clusters were analyzed to estimate the biological relevance of protein-protein interactions. Our system achieves a best F-measure value of 0.43 when applied to a subset of general protein-protein docking benchmark data. The same system was applied to protein data in a bacterial chemotaxis pathway, utilizing essentially the same parameter set as the benchmark data. We obtained 0.45 for the F-measure value. The proposed approach to computational PPI detection is a promising methodology for mediating between structural studies and systems biology by utilizing cumulative protein structure data for pathway analysis.

Journal ArticleDOI
TL;DR: A novel operon prediction method that is applicable to any prokaryotic genome with high prediction accuracy and has higher prediction sensitivity as well as specificity than most of the published methods.
Abstract: Identification of operons at the genome scale of prokaryotic organisms represents a key step in deciphering of their transcriptional regulation machinery, biological pathways, and networks. While numerous computational methods have been shown to be effective in predicting operons for well-studied organisms such as Escherichia coli K12 and Bacillus subtilis 168, these methods generally do not generalize well to genomes other than the ones used to train the methods, or closely related genomes because they rely on organism–specific information. Several methods have been explored to address this problem through utilizing only genomic structural information conserved across multiple organisms, but they all suffer from the issue of low prediction sensitivity. In this paper, we report a novel operon prediction method that is applicable to any prokaryotic genome with high prediction accuracy. The key idea of the method is to predict operons through identification of conserved gene clusters across multiple genomes and through deriving a key parameter relevant to the distribution of intergenic distances in genomes. We have implemented this method using a graph-theoretic approach, to calculate a set of maximum gene clusters in the target genome that are conserved across multiple reference genomes. Our computational results have shown that this method has higher prediction sensitivity as well as specificity than most of the published methods. We have carried out a preliminary study on operons unique to archaea and bacteria, respectively, and derived a number of interesting new insights about operons between these two kingdoms. The software and predicted operons of 365 prokaryotic genomes are available at .

Journal ArticleDOI
TL;DR: A novel method for modeling signaling pathways from PPI networks automatically is presented, formalized and solved as a mixed integer linear programming (MILP) model, which is simple in algorithm and efficient in computation.
Abstract: Signal transduction is an important process that controls cell proliferation, metabolism, differentiation, and so on. Effective computational models which unravel such a process by taking advantage of high-throughput genomic and proteomic data are highly demanded to understand the essential mechanisms underlying signal transduction. Since protein-protein interaction (PPI) plays an important role in signal transduction, in this paper, we present a novel method for modeling signaling pathways from PPI networks automatically. Given an undirected weighted protein interaction network, finding signaling pathways is treated as searching for optimal subnetworks according to some cost function. To cope with this optimization problem, a network flow model is proposed in this work to extract signaling pathways from protein interaction networks. In particular, the network flow model is formalized and solved as a mixed integer linear programming (MILP) model, which is simple in algorithm and efficient in computation. The numerical results on two known yeast MAPK signaling pathways demonstrate the efficiency and effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: It is shown that DNA microarrays can produce highly repeatable data in a cross-platform cross-lab manner, when one focuses on the discriminating genes and classifiers.
Abstract: Microarray technology has great potential for improving our understanding of biological processes, medical conditions, and diseases. Often, microarray datasets are collected using different microarray platforms (provided by different companies) under different conditions in different laboratories. The cross-platform and cross-laboratory concordance of the microarray technology needs to be evaluated before it can be successfully and reliably applied in biological/clinical practice. New measures and techniques are proposed for comparing and evaluating the quality of microarray datasets generated from different platforms/laboratories. These measures and techniques are based on the following philosophy: the practical usefulness of the microarray technology may be confirmed if discriminating genes and classifiers, which are the focus of most, if not all, comparative investigations, discovered/trained from data collected in one lab/platform combination can be transferred to another lab/platform combination. The rationale is that the nondiscriminating genes might not be as strongly regulated as the discriminating genes, by the biological process of the tissue cells under study, and hence they may behave more randomly than the discriminating genes. Our experiment results, on microarray datasets generated from different platforms/laboratories using the reference mRNA samples in the Microarray Quality Control (MAQC) project, showed that DNA microarrays can produce highly repeatable data in a cross-platform cross-lab manner, when one focuses on the discriminating genes and classifiers. In our comparative study, we compare samples of one type against samples of another type; the methodology can be applied to situations where one compares one arbitrary class of data against another. Other findings include: (1) using three discriminating-gene/classifier-based methods to test the concordance between microarray datasets gave consistent results; (2) when noisy (nondiscriminating) genes were removed, the microarray datasets from different laboratories using common platform were found to be highly concordant, and the data generated using most of the commercial platforms studied here were also found to be concordant with each other; (3) several series of artificial datasets with known degree of difference were created, to establish a bridge between consistency rate and P-value, allowing us to estimate P-value if consistency rate between two datasets is known.

Journal ArticleDOI
TL;DR: The haplotype inference problem from pedigree data under the zero recombination assumption is studied, which is well supported by real data for tightly linked markers (i.e. single nucleotide polymorphisms (SNPs) over a relatively large chromosome segment) by formulating genotype constraints as a linear system of inheritance variables.
Abstract: We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e. single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O(mn · α(n)), where m is the number of loci, n is the number of individuals and α is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 105-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.

Journal ArticleDOI
TL;DR: This paper describes an algorithm to solve the pairwise alignment problem for metabolic pathways by pursuing the intuition that both pairwise similarities of entities and the organization of their interactions are important for metabolic pathway alignment.
Abstract: Pathways show how different biochemical entities interact with one another to perform vital functions for the survival of an organism. Comparative analysis of pathways is crucial in identifying functional similarities that are difficult to identify by comparing individual entities that build up these pathways. When interacting entities are of single type, the problem of identifying similarities by aligning the pathways can be reduced to graph isomorphism problem. For pathways with varying types of entities such as metabolic pathways, alignment problem is even more challenging. In order to simplify this problem, existing methods often reduce metabolic pathways to graphs with restricted topologies and single type of nodes. However, these abstractions reduce the relevance of the alignment significantly as they cause losses in the information content. In this paper, we describe an algorithm to solve the pairwise alignment problem for metabolic pathways. A distinguishing feature of our method is that it aligns different types of entities, such as enzymes, reactions and compounds. Also, our approach is free of any abstraction in modeling the pathways. We pursue the intuition that both pairwise similarities of entities (homology) and the organization of their interactions (topology) are important for metabolic pathway alignment. In our algorithm, we account for both by creating an eigenvalue problem for each entity type. We enforce the consistency while combining the alignments of different entity types by considering the reachability sets of entities. Our experiments show that our method finds biologically and statistically significant alignments in the order of milliseconds. Availability: Our software and the source code in C programming language is available at .

Journal ArticleDOI
TL;DR: The results suggest that translational repression (which has no effect on mRNA level), instead of mRNA degradation, is the dominant mechanism in miRNA regulation, which explained the previously observed discordant correlation between mRNA expression and protein abundance.
Abstract: Due to the difficulties in identifying microRNA (miRNA) targets experimentally in a high-throughput manner, several computational approaches have been proposed. To this date, most leading algorithms are based on sequence information alone. However, there has been limited overlap between these predictions, implying high false-positive rates, which underlines the limitation of sequence-based approaches. Considering the repressive nature of miRNAs at the mRNA translational level, here we describe a probabilistic model to make predictions by combining sequence complementarity, miRNA expression level, and protein abundance. Our underlying assumption is that, given sequence complementarity between a miRNA and its putative mRNA targets, the miRNA expression level should be high and the protein abundance of the mRNA should be low. Having identified a set of confident predictions, we then built a second probabilistic model to trace back to the mRNA expression of the confident targets to investigate the mechanisms of the miRNA-mediated post-transcriptional regulation. Our results suggest that translational repression (which has no effect on mRNA level), instead of mRNA degradation, is the dominant mechanism in miRNA regulation. This observation explained the previously observed discordant correlation between mRNA expression and protein abundance.

Journal ArticleDOI
TL;DR: This work provides the global evidence that interaction hubs obtain their robustness against uneven protein concentrations through co-expression of the constituents, and that the degree of co- expression correlates strongly with the complexity of the embedded motif.
Abstract: Almost all cellular functions are the results of well-coordinated interactions between various proteins. A more connected hub or motif in the interaction network is expected to be more important, and any perturbation in this motif would be more damaging to the smooth performance of the related functions. Thus, some coherent robustness of these hubs has to be derived. Here, we provide the global evidence that interaction hubs obtain their robustness against uneven protein concentrations through co-expression of the constituents, and that the degree of co-expression correlates strongly with the complexity of the embedded motif. We calculated the gene expression correlations between the proteins embedded in 3-, 4-, 5-, and 6-node interaction motifs of increasing complexities, and compared them to those between proteins from random motifs of similar complexities. We find that as the connectedness of these motifs increases, there is higher co-expression between the constituent proteins. For example, when the expression correlation is 0.7, the kernel density of the correlation increases from 0.152 for 4-node motifs with three edges to 0.403 for 4-node cliques. This implies that the robustness of the interaction system emerges from a proportionate synchronicity among the constituents of the motif via co-expression. We further show that such biological coherence via co-expression of component proteins can be reinforced by integrating conservation data in the analysis. For example, with addition of evolutionary information from other genomes, the ratio of kernel density for interaction and random data in the case of 5- and 6-node cliques in yeast increases from 37.8 to 123 and 98.4 to 1300, respectively, given that the expression correlation is 0.8. Our results show that genes whose products are involved in motifs have transcription and translation properties that minimize the noise in final protein concentrations, compared to random sets of genes.

Journal ArticleDOI
TL;DR: A novel heuristic approach of simulating ligand-receptor binding processes is introduced, which is not dependent on calculating lengthy molecular dynamics trajectories and characterizes the metastable steps of the binding process and can yield the corresponding transition probabilities.
Abstract: The understanding of biological ligand-receptor binding processes is relevant for a variety of research topics and assists the rational design of novel drug molecules. Computer simulation can help to advance this understanding, but, due to the high dimensionality of according systems, suffers from the severe computational cost. Based on the framework provided by conformation dynamics and transition state theory, a novel heuristic approach of simulating ligand-receptor binding processes is introduced, which is not dependent on calculating lengthy molecular dynamics trajectories. First, the relevant portion of conformational space is partitioned with meshless methods. Then, each region is sampled separately, using hybrid Monte Carlo. Finally, the dynamical binding process is reconstructed from the static overlaps between the partial densities obtained in the sampling step. The method characterizes the metastable steps of the binding process and can yield the corresponding transition probabilities.

Journal ArticleDOI
TL;DR: Two novel methods, named GaborLocal and GaborEnvelop, are proposed, both of which can detect more true peaks with a lower false discovery rate than previous methods, and outperform other commonly used methods in the Receiver Operating Characteristic (ROC) curve.
Abstract: Mass Spectrometry (MS) is increasingly being used to discover diseases-related proteomic patterns. The peak detection step is one of the most important steps in the typical analysis of MS data. Recently, many new algorithms have been proposed to increase true position rate with low false discovery rate in peak detection. Most of them follow two approaches: one is the denoising approach and the other is the decomposing approach. In the previous studies, the decomposition of MS data method shows more potential than the first one. In this paper, we propose two novel methods, named GaborLocal and GaborEnvelop, both of which can detect more true peaks with a lower false discovery rate than previous methods. We employ the method of Gaussian local maxima to detect peaks, because it is robust to noise in signals. A new approach, peak rank, is defined for the first time to identify peaks instead of using the signal-to-noise ratio. Meanwhile, the Gabor filter is used to amplify important information and compress noise in the raw MS signal. Moreover, we also propose the envelope analysis to improve the quantification of peaks and remove more false peaks. The proposed methods have been performed on the real SELDI-TOF spectrum with known polypeptide positions. The experimental results demonstrate that our methods outperform other commonly used methods in the Receiver Operating Characteristic (ROC) curve.

Journal ArticleDOI
TL;DR: This paper studies combinatorial asymptotics for two special subclasses of RNA secondary structures - canonical and saturated structures and introduces a stochastic greedy method to sample random saturated structures, and applies the work of Drmota to show that the density of states for [all resp. saturated] secondary structures is asymptonically Gaussian.
Abstract: It is a classical result of Stein and Waterman that the asymptotic number of RNA secondary structures is 1.104366 · n-3/2 · 2.618034n. In this paper, we study combinatorial asymptotics for two special subclasses of RNA secondary structures — canonical and saturated structures. Canonical secondary structures are defined to have no lonely (isolated) base pairs. This class of secondary structures was introduced by Bompfunewerer et al., who noted that the run time of Vienna RNA Package is substantially reduced when restricting computations to canonical structures. Here we provide an explanation for the speed-up, by proving that the asymptotic number of canonical RNA secondary structures is 2.1614 · n-3/2 · 1.96798n and that the expected number of base pairs in a canonical secondary structure is 0.31724 · n. The asymptotic number of canonical secondary structures was obtained much earlier by Hofacker, Schuster and Stadler using a different method. Saturated secondary structures have the property that no base pairs can be added without violating the definition of secondary structure (i.e. introducing a pseudoknot or base triple). Here we show that the asymptotic number of saturated structures is 1.07427 · n-3/2 · 2.35467n, the asymptotic expected number of base pairs is 0.337361 · n, and the asymptotic number of saturated stem-loop structures is 0.323954 · 1.69562n, in contrast to the number 2n - 2 of (arbitrary) stem-loop structures as classically computed by Stein and Waterman. Finally, we apply the work of Drmota to show that the density of states for [all resp. canonical resp. saturated] secondary structures is asymptotically Gaussian. We introduce a stochastic greedy method to sample random saturated structures, called quasi-random saturated structures, and show that the expected number of base pairs is 0.340633 · n.

Journal ArticleDOI
TL;DR: A new approach to segmenting multiple time series by analyzing the dynamics of cluster formation and rearrangement around putative segment boundaries is presented, revealing clusters of genes along with a segmentation such that clusters show concerted behavior within segments but exhibit significant regrouping across segmentation boundaries.
Abstract: We present a new approach to segmenting multiple time series by analyzing the dynamics of cluster formation and rearrangement around putative segment boundaries. This approach finds application in distilling large numbers of gene expression profiles into temporal relationships underlying biological processes. By directly minimizing information-theoretic measures of segmentation quality derived from Kullback-Leibler (KL) divergences, our formulation reveals clusters of genes along with a segmentation such that clusters show concerted behavior within segments but exhibit significant regrouping across segmentation boundaries. The results of the segmentation algorithm can be summarized as Gantt charts revealing temporal dependencies in the ordering of key biological processes. Applications to the yeast metabolic cycle and the yeast cell cycle are described.

Journal ArticleDOI
TL;DR: It is concluded that, after chromosome doubling, the "choice" of which paralogous gene pairs will lose copies is random, but that the retention of strings of single-copy genes on one chromosome versus the other is decidedly non-random.
Abstract: We develop criteria to detect neighborhood selection effects on gene loss following whole genome duplication, and apply them to the recently sequenced poplar (Populus trichocarpa) genome. We improve on guided genome halving algorithms so that several thousand gene sets, each containing two paralogs in the descendant T of the doubling event and their single ortholog from an undoubled reference genome R, can be analyzed to reconstruct the ancestor A of T at the time of doubling. At the same time, large numbers of defective gene sets, either missing one paralog from T or missing their ortholog in R, may be incorporated into the analysis in a consistent way. We apply this genomic rearrangement distance-based approach to the poplar and grapevine (Vitis vinifera) genomes, as T and R respectively. We conclude that, after chromosome doubling, the "choice" of which paralogous gene pairs will lose copies is random, but that the retention of strings of single-copy genes on one chromosome versus the other is decidedly non-random.

Journal ArticleDOI
TL;DR: This work proposes to combine a well-balanced set of existing approaches to new, ensemble-based prediction methods for subcellular localization of proteins, and shows that their ensembles improve substantially over the underlying base methods.
Abstract: In the past decade, many automated prediction methods for the subcellular localization of proteins have been proposed, utilizing a wide range of principles and learning approaches. Based on an experimental evaluation of different methods and their theoretical properties, we propose to combine a well-balanced set of existing approaches to new, ensemble-based prediction methods. The experimental evaluation shows that our ensembles improve substantially over the underlying base methods.

Journal ArticleDOI
TL;DR: CurveSOM is a very promising tool for the exploratory analysis of time course expression data, as it is not only able to group genes into clusters with high accuracy but also able to find true time-shifted correlations of expression patterns across clusters.
Abstract: There is an increasing interest in clustering time course gene expression data to investigate a wide range of biological processes. However, developing a clustering algorithm ideal for time course gene express data is still challenging. As timing is an important factor in defining true clusters, a clustering algorithm shall explore expression correlations between time points in order to achieve a high clustering accuracy. Moreover, inter-cluster gene relationships are often desired in order to facilitate the computational inference of biological pathways and regulatory networks. In this paper, a new clustering algorithm called CurveSOM is developed to offer both features above. It first presents each gene by a cubic smoothing spline fitted to the time course expression profile, and then groups genes into clusters by applying a self-organizing map-based clustering on the resulting splines. CurveSOM has been tested on three well-studied yeast cell cycle datasets, and compared with four popular programs including Cluster 3.0, GENECLUSTER, MCLUST, and SSClust. The results show that CurveSOM is a very promising tool for the exploratory analysis of time course expression data, as it is not only able to group genes into clusters with high accuracy but also able to find true time-shifted correlations of expression patterns across clusters.

Journal ArticleDOI
TL;DR: A novel approach called STSA for non-sequential pair-wise structural alignment that has a high sensitivity and high specificity values and is competitive with state-of-the-art alignment methods and gives longer alignments with lower rmsd.
Abstract: Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at http://www.cs.rpi.edu/~zaki/software/SNAP.

Journal ArticleDOI
TL;DR: This paper presents an algorithm that is alphabet-independent and is based on the run-length encoding scheme, and derives a quantitative evidence that there is a directional bias in the growth of minisatellites of the MSY1 dataset.
Abstract: Subsequent duplication events are responsible for the evolution of the minisatellite maps. Alignment of two minisatellite maps should therefore take these duplication events into account, in addition to the well-known edit operations. All algorithms for computing an optimal alignment of two maps, including the one presented here, first deduce the costs of optimal duplication scenarios for all substrings of the given maps. Then, they incorporate the pre-computed costs in the alignment recurrence. However, all previous algorithms addressing this problem are dependent on the number of distinct map units (map alphabet) and do not fully make use of the repetitiveness of the map units. In this paper, we present an algorithm that remedies these shortcomings: our algorithm is alphabet-independent and is based on the run-length encoding scheme. It is the fastest in theory, and in practice as well, as shown by experimental results. Furthermore, our alignment model is more general than that of the previous algorithms, and captures better the duplication mechanism. Using our algorithm, we derive a quantitative evidence that there is a directional bias in the growth of minisatellites of the MSY1 dataset.

Journal ArticleDOI
TL;DR: By means of the technique of the imbedded Markov chain, an efficient algorithm is proposed to exactly calculate first, second moments of word counts and the probability for a word to occur at least once in random texts generated by a Markov Chain.
Abstract: By means of the technique of the imbedded Markov chain, an efficient algorithm is proposed to exactly calculate first, second moments of word counts and the probability for a word to occur at least once in random texts generated by a Markov chain. A generating function is introduced directly from the imbedded Markov chain to derive asymptotic approximations for the problem. Two Z-scores, one based on the number of sequences with hits and the other on the total number of word hits in a set of sequences, are examined for discovery of motifs on a set of promoter sequences extracted from A. thaliana genome. Source code is available at http://www.itp.ac.cn/zheng/oligo.c.