scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2008"


Journal ArticleDOI
TL;DR: This paper significantly extends the class of pathways that can be efficiently queried to the case of trees, and graphs of bounded treewidth, and implements a tool for tree queries, called QNet, and uses it to perform the first large-scale cross-species comparison of protein complexes.
Abstract: Molecular interaction databases can be used to study the evolution of molecular pathways across species. Querying such pathways is a challenging computational problem, and recent efforts have been limited to simple queries (paths), or simple networks (forests). In this paper, we significantly extend the class of pathways that can be efficiently queried to the case of trees, and graphs of bounded treewidth. Our algorithm allows the identification of non-exact (homeomorphic) matches, exploiting the color coding technique of Alon et al. (1995). We implement a tool for tree queries, called QNet, and test its retrieval properties in simulations and on real network data. We show that QNet searches queries with up to nine proteins in seconds on current networks, and outperforms sequence-based searches. We also use QNet to perform the first large-scale cross-species comparison of protein complexes, by querying known yeast complexes against a fly protein interaction network. This comparison points to strong conservation between the two species, and underscores the importance of our tool in mining protein interaction networks.

150 citations


Journal ArticleDOI
TL;DR: In this article, a statistical and analytical method is proposed to identify exceptional motifs in a given network, which does not require any simulation and is applied on protein-protein interaction (PPI) networks.
Abstract: Getting and analyzing biological interaction networks is at the core of systems biology. To help understanding these complex networks, many recent works have suggested to focus on motifs which occur more frequently than expected in random. To identify such exceptional motifs in a given network, we propose a statistical and analytical method which does not require any simulation. For this, we first provide an analytical expression of the mean and variance of the count under any exchangeable random graph model. Then we approximate the motif count distribution by a compound Poisson distribution whose parameters are derived from the mean and variance of the count. Thanks to simulations, we show that the compound Poisson approximation outperforms the Gaussian approximation. The compound Poisson distribution can then be used to get an approximate p-value and to decide if an observed count is significantly high or not. Our methodology is applied on protein-protein interaction (PPI) networks, and statistical issues related to exceptional motif detection are discussed.

102 citations


Journal ArticleDOI
TL;DR: An algorithm is described that computes both the knock-out sets and the elementary modes containing the blocked reactions directly from the description of the network and whose worst-case computational complexity is better than the algorithms currently in use for these problems.
Abstract: Given a metabolic network in terms of its metabolites and reactions, our goal is to efficiently compute the minimal knock-out sets of reactions required to block a given behavior. We describe an algorithm that improves the computation of these knock-out sets when the elementary modes (minimal functional subsystems) of the network are given. We also describe an algorithm that computes both the knock-out sets and the elementary modes containing the blocked reactions directly from the description of the network and whose worst-case computational complexity is better than the algorithms currently in use for these problems. Computational results are included.

92 citations


Journal ArticleDOI
TL;DR: A novel search method and a novel method for learning energy functions from training data that are both based on Tree Reweighted Belief Propagation are presented, which suggest that combining machine learning with approximate inference can improve the state-of-the-art in side-chain prediction.
Abstract: Side-chain prediction is an important subproblem of the general protein folding problem. Despite much progress in side-chain prediction, performance is far from satisfactory. As an example, the ROSETTA program that uses simulated annealing to select the minimum energy conformations, correctly predicts the first two side-chain angles for approximately 72% of the buried residues in a standard data set. Is further improvement more likely to come from better search methods, or from better energy functions? Given that exact minimization of the energy is NP hard, it is difficult to get a systematic answer to this question. In this paper, we present a novel search method and a novel method for learning energy functions from training data that are both based on Tree Reweighted Belief Propagation (TRBP). We find that TRBP can obtain the global optimum of the ROSETTA energy function in a few minutes of computation for approximately 85% of the proteins in a standard benchmark set. TRBP can also effectively bound the partition function which enables using the Conditional Random Fields (CRF) framework for learning. Interestingly, finding the global minimum does not significantly improve side-chain prediction for an energy function based on ROSETTA's default energy terms (less than 0:1%), while learning new weights gives a significant boost from 72% to 78%. Using a recently modified ROSETTA energy function with a softer Lennard-Jones repulsive term, the global optimum does improve prediction accuracy from 77% to 78%. Here again, learning new weights improves side-chain modeling even further to 80%. Finally, the highest accuracy (82.6%) is obtained using an extended rotamer library and CRF learned weights. Our results suggest that combining machine learning with approximate inference can improve the state-of-the-art in side-chain prediction.

82 citations


Journal ArticleDOI
TL;DR: New one- and two-stage pooling designs, together with new probabilistic pooling Designs, for both error-free and error-tolerance scenarios are presented.
Abstract: The study of gene functions requires a DNA library of high quality, such a library is obtained from a large mount of testing and screening. Pooling design is a very helpful tool for reducing the number of tests for DNA library screening. In this paper, we present new one- and two-stage pooling designs, together with new probabilistic pooling designs. The approach in this paper works for both error-free and error-tolerance scenarios.

75 citations


Journal ArticleDOI
TL;DR: A novel highly efficient method for the detection of a pharmacophore from a set of drug-like ligands that interact with a target receptor, which is expected to be a key component in the discovery of new leads by screening large databases ofdrug-like molecules.
Abstract: We present a novel highly efficient method for the detection of a pharmacophore from a set of drug-like ligands that interact with a target receptor. A pharmacophore is a spatial arrangement of physico-chemical features in a ligand that is essential for the interaction with a specific receptor. In the absence of a known 3D receptor structure, a pharmacophore can be identified from a multiple structural alignment of ligand molecules. The key advantages of the presented algorithm are: (a) its ability to multiply align flexible ligands in a deterministic manner, (b) its ability to focus on subsets of the input ligands, which may share a large common substructure, resulting in the detection of both outlier molecules and alternative binding modes, and (c) its computational efficiency, which allows to detect pharmacophores shared by a large number of molecules on a standard PC. The algorithm was extensively tested on a dataset of 74 ligands that are classified into 12 cases according to the protein receptor they bind to. The results, which were achieved using a set of standard default parameters, were consistent with reference pharmacophores that were derived from the bound ligand-receptor complexes. The pharmacophores detected by the algorithm are expected to be a key component in the discovery of new leads by screening large databases of drug-like molecules.

72 citations


Journal ArticleDOI
TL;DR: A heuristic algorithm, called DUPCAR, is proposed for reconstructing ancestral genomic orders with duplications and is applied to reconstruct the ancestral chromosome X of placental mammals and the ancestral genomes of the ciliate Paramecium tetraurelia.
Abstract: Accurately reconstructing the large-scale gene order in an ancestral genome is a critical step to better understand genome evolution. In this paper, we propose a heuristic algorithm, called DUPCAR, for reconstructing ancestral genomic orders with duplications. The method starts from the order of genes in modern genomes and predicts predecessor and successor relationships in the ancestor. Then a greedy algorithm is used to reconstruct the ancestral orders by connecting genes into contiguous regions based on predicted adjacencies. Computer simulation was used to validate the algorithm. We also applied the method to reconstruct the ancestral chromosome X of placental mammals and the ancestral genomes of the ciliate Paramecium tetraurelia.

69 citations


Journal ArticleDOI
TL;DR: A kinetic model of post-transcriptional gene regulation by miRNAs is developed, focusing on the miRNA-mediated effect on increasing the target mRNAs degradation rates, and yields a good correspondence between the inferred and experimentally measured decay rates of human target m RNAs.
Abstract: MicroRNAs (miRNAs) have recently emerged as a new complex layer of gene regulation MiRNAs act post-transcriptionally, influencing the stability, compartmentalization, and translation of their target mRNAs Computational efforts to understand the post-transcriptional gene regulation by miRNAs have been focused on the target prediction tools, while quantitative kinetic models of gene regulation by miRNAs have so far largely been overlooked We here develop a kinetic model of post-transcriptional gene regulation by miRNAs, focusing on the miRNAs' effect on increasing the target mRNAs degradation rates The model is fitted to a temporal microarray dataset where human mRNAs are measured upon transfection with a specific miRNA (miRNA124a) The proposed model exhibits good fit with many target mRNA profiles, indicating that such type of models can be used for studying post-transcriptional gene regulation by miRNA In particular, the proposed kinetic model can be used for quantifying the miRNA-mediated effects o

66 citations


Journal ArticleDOI
TL;DR: The mathematical foundations of this design based on de Bruijn sequences generated by linear feedback shift registers are presented, showing that these sequences represent the maximum number of variants for any given set of array dimensions, while also exhibiting desirable pseudo-randomness properties.
Abstract: Our group has recently developed a compact, universal protein binding microarray (PBM) that can be used to determine the binding preferences of transcription factors (TFs). This design represents all possible sequence variants of a given length k (i.e., all k-mers) on a single array, allowing a complete characterization of the binding specificities of a given TF. Here, we present the mathematical foundations of this design based on de Bruijn sequences generated by linear feedback shift registers. We show that these sequences represent the maximum number of variants for any given set of array dimensions (i.e., number of spots and spot lengths), while also exhibiting desirable pseudo-randomness properties. Moreover, de Bruijn sequences can be selected that represent gapped sequence patterns, further increasing the coverage of the array. This design yields a powerful experimental platform that allows the binding preferences of TFs to be determined with unprecedented resolution.

64 citations


Journal ArticleDOI
TL;DR: This work characterization of DS-trees shows that this question can be answered in linear time, and that a DS-tree induces a single species tree.
Abstract: We consider two algorithmic questions related to the evolution of gene families. First, given a gene tree for a gene family, can the evolutionary history of this family be explained with only speciation and duplication events? Such gene trees are called DS-trees. We show that this question can be answered in linear time, and that a DS-tree induces a single species tree. We then study a natural extension of this problem: what is the minimum number of gene losses involved in an evolutionary history leading to an observed gene tree or set of gene trees? Based on our characterization of DS-trees, we propose a heuristic for this problem, and evaluate it on a dataset of plants gene families and on simulated data.

64 citations


Journal ArticleDOI
TL;DR: This article presents a new algorithm that is able to extract biclusters from sparse, binary datasets and finds transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.
Abstract: Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression biclustering algorithms can handle the large number of zeros in sparse binary matrices. The two proposed binary algorithms failed to produce meaningful results. In this article, we present a new algorithm that is able to extract biclusters from sparse, binary datasets. A powerful feature is that biclusters with different numbers of rows and columns can be detected, varying from many rows to few columns and few rows to many columns. It allows the user to guide the search towards biclusters of specific dimensions. When applying our algorithm to an input matrix derived from TRANSFAC, we find transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.

Journal ArticleDOI
TL;DR: The results provide a theoretical upper bound on the length of RNA sequences amenable to probabilistic shape analysis, and a surprising 1-to-1 correspondence between pi-shapes and Motzkin numbers.
Abstract: RNA shapes, introduced by Giegerich et al. (2004), provide a useful classification of the branching complexity for RNA secondary structures. In this paper, we derive an exact value for the asymptotic number of RNA shapes, by relying on an elegant relation between non-ambiguous, context-free grammars, and generating functions. Our results provide a theoretical upper bound on the length of RNA sequences amenable to probabilistic shape analysis (Steffen et al., 2006; Voss et al., 2006), under the assumption that any base can basepair with any other base. Since the relation between context-free grammars and asymptotic enumeration is simple, yet not well-known in bioinformatics, we give a self-contained presentation with illustrative examples. Additionally, we prove a surprising 1-to-1 correspondence between π-shapes and Motzkin numbers.

Journal ArticleDOI
TL;DR: A novel approach for finding relevant routes based on atom mapping rules (describing which educt atoms are mapped onto which product atoms in a chemical reaction) is presented, leading to a reformulation of the problem as a lightest path search in a degree-weighted metabolic network.
Abstract: Computational analysis of pathways in metabolic networks has numerous applications in systems biology. While graph theory–based approaches have been presented that find biotransformation routes from one metabolite to another in these networks, most of these approaches suffer from finding too many routes, most of which are biologically infeasible or meaningless. We present a novel approach for finding relevant routes based on atom mapping rules (describing which educt atoms are mapped onto which product atoms in a chemical reaction). This leads to a reformulation of the problem as a lightest path search in a degree-weighted metabolic network. The key component of the approach is a new method of computing optimal atom mapping rules.

Journal ArticleDOI
TL;DR: The applicability of the PL identification method has been evaluated using simulated data obtained from a model of the carbon starvation response in the bacterium Escherichia coli, allowing us to systematically test the performance of the method under different data characteristics, notably variations in the noise level and the sampling density.
Abstract: We present a method for the structural identification of genetic regulatory networks (GRNs), based on the use of a class of Piecewise-Linear (PL) models. These models consist of a set of decoupled linear models describing the different modes of operation of the GRN and discrete switches between the modes accounting for the nonlinear character of gene regulation. They thus form a compromise between the mathematical simplicity of linear models and the biological expressiveness of nonlinear models. The input of the PL identification method consists of time-series measurements of concentrations of gene products. As output it produces estimates of the modes of operation of the GRN, as well as all possible minimal combinations of threshold concentrations of the gene products accounting for switches between the modes of operation. The applicability of the PL identification method has been evaluated using simulated data obtained from a model of the carbon starvation response in the bacterium Escherichia coli. This has allowed us to systematically test the performance of the method under different data characteristics, notably variations in the noise level and the sampling density.

Journal ArticleDOI
TL;DR: This study presents Short Quartet Puzzling, a new quartet-based phylogeny reconstruction algorithm, and demonstrates the improved topological accuracy of the new method over maximum parsimony and neighbor joining, disproving the conjecture of Ranwez and Gascuel.
Abstract: Quartet-based phylogeny reconstruction methods, such as Quartet Puzzling, were introduced in the hope that they might be competitive with maximum likelihood methods, without being as computationally intensive. However, despite the numerous quartet-based methods that have been developed, their performance in simulation has been disappointing. In particular, Ranwez and Gascuel, the developers of one of the best quartet methods, conjecture that quartet-based methods have inherent limitations that make them unable to produce trees as accurate as neighbor joining or maximum parsimony. In this paper, we present Short Quartet Puzzling, a new quartet-based phylogeny reconstruction algorithm, and we demonstrate the improved topological accuracy of the new method over maximum parsimony and neighbor joining, disproving the conjecture of Ranwez and Gascuel. We also show a dramatic improvement over Quartet Puzzling. Thus, while our new method is not compared to any ML method (as it is not expected to be as accurate as...

Journal ArticleDOI
TL;DR: GBP is an effective means for computing free energy in all-atom models of protein structures and is also efficient, taking a few minutes to run on a typical sized protein, further suggesting that GBP may be an attractive alternative to more costly molecular dynamic simulations for some tasks.
Abstract: We present a technique for approximating the free energy of protein structures using generalized belief propagation (GBP). The accuracy and utility of these estimates are then demonstrated in two different application domains. First, we show that the entropy component of our free energy estimates can useful in distinguishing native protein structures from decoys-structures with similar internal energy to that of the native structure, but otherwise incorrect. Our method is able to correctly identify the native fold from among a set of decoys with 87.5% accuracy over a total of 48 different immunoglobulin folds. The remaining 12.5% of native structures are ranked among the top four of all structures. Second, we show that our estimates of DeltaDeltaG upon mutation upon mutation for three different data sets have linear correlations of 0.63-0.70 with experimental measurements and statistically significant p-values. Together, these results suggest that GBP is an effective means for computing free energy in all-atom models of protein structures. GBP is also efficient, taking a few minutes to run on a typical sized protein, further suggesting that GBP may be an attractive alternative to more costly molecular dynamic simulations for some tasks.

Journal ArticleDOI
TL;DR: This paper addresses the problem of discovering novel non-coding RNA (ncRNA) using primary sequence, and secondary structure conservation, focusing on ncRNA families with pseudoknotted structures, and develops an efficient algorithm for computing an optimum structural alignment of an RNA sequence against a genomic substring.
Abstract: In this paper, we address the problem of discovering novel non-coding RNA (ncRNA) using primary sequence, and secondary structure conservation, focusing on ncRNA families with pseudoknotted structures. Our main technical result is an efficient algorithm for computing an optimum structural alignment of an RNA sequence against a genomic substring. This algorithm has two applications. First, by scanning a genome, we can identify novel (homologous) pseudoknotted ncRNA, and second, we can infer the secondary structure of the target aligned sequence. We test an implementation of our algorithm (PAL) and show that it has near-perfect behavior for predicting the structure of many known pseudoknots. Additionally, it can detect the true homologs with high sensitivity and specificity in controlled tests. We also use PAL to search entire viral genome and mouse genome for novel homologs of some viral and eukaryotic pseudoknots, respectively. In each case, we have found strong support for novel homologs.

Journal ArticleDOI
TL;DR: A Support Vector Machine (SVM) method is used that provides a well-founded way of estimating complex alignment models with hundred of thousands of parameters and outperforms the Generative alignment method SSALN, a highly accurate generative alignment model that incorporates structural information.
Abstract: Sequence to structure alignment is an important step in homology modeling of protein structures. Incorporation of features such as secondary structure, solvent accessibility, or evolutionary information improve sequence to structure alignment accuracy, but conventional generative estimation techniques for alignment models impose independence assumptions that make these features difficult to include in a principled way. In this paper, we overcome this problem using a Support Vector Machine (SVM) method that provides a well-founded way of estimating complex alignment models with hundred of thousands of parameters. Furthermore, we show that the method can be trained using a variety of loss functions. In a rigorous empirical evaluation, the SVM algorithm outperforms the generative alignment method SSALN, a highly accurate generative alignment model that incorporates structural information. The alignment model learned by the SVM aligns 50% of the residues correctly and aligns over 70% of the residues within a shift of four positions.

Journal ArticleDOI
TL;DR: A probabilistic model is developed to approach two realistic scenarios regarding the singular haplotype reconstruction problem--the incompleteness and inconsistency that occurred in the DNA sequencing process to generate the input haplotype fragments, and the common practice used to generate synthetic data in experimental algorithm studies.
Abstract: In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem--the incompleteness and inconsistency that occurred in the DNA sequencing process to generate the input haplotype fragments, and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in linear time in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.

Journal ArticleDOI
TL;DR: A theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics is presented, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithm for practical applications.
Abstract: In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calc...

Journal ArticleDOI
TL;DR: This paper proposes to compute the minimum number of breakpoints and the maximum number of adjacencies between two genomes in presence of duplications using two different approaches: an exact, generic 0-1 linear programming approach, and a collection of three heuristics.
Abstract: Comparing genomes of different species is a fundamental problem in comparative genomics. Recent research has resulted in the introduction of different measures between pairs of genomes: for example, reversal distance, number of breakpoints, and number of common or conserved intervals. However, classical methods used for computing such measures are seriously compromised when genomes have several copies of the same gene scattered across them. Most approaches to overcome this difficulty are based either on the exemplar model, which keeps exactly one copy in each genome of each duplicated gene, or on the maximum matching model, which keeps as many copies as possible of each duplicated gene. The goal is to find an exemplar matching, respectively a maximum matching, that optimizes the studied measure. Unfortunately, it turns out that, in presence of duplications, this problem for each above-mentioned measure is NP-hard. In this paper, we propose to compute the minimum number of breakpoints and the maximum number of adjacencies between two genomes in presence of duplications using two different approaches. The first one is an exact, generic 0-1 linear programming approach, while the second is a collection of three heuristics. Each of these approaches is applied on each problem and for each of the following models: exemplar, maximum matching and intermediate model, that we introduce here. All these programs are run on a well-known public benchmark dataset of gamma-Proteobacteria, and their performances are discussed.

Journal ArticleDOI
TL;DR: The HFold algorithm uses two-phase energy minimization to predict hierarchically formed secondary structures in O(n(3)) time, matching the complexity of the best algorithms for pseudoknot-free secondary structure prediction viaEnergy minimization.
Abstract: Algorithms for prediction of RNA secondary structure—the set of base pairs that form when an RNA molecule folds—are valuable to biologists who aim to understand RNA structure and function. Improving the accuracy and efficiency of prediction methods is an ongoing challenge, particularly for pseudoknotted secondary structures, in which base pairs overlap. This challenge is biologically important, since pseudoknotted structures play essential roles in functions of many RNA molecules, such as splicing and ribosomal frameshifting. State-of-the-art methods, which are based on free energy minimization, have high run-time complexity (typically Θ(n5) or worse), and can handle (minimize over) only limited types of pseudoknotted structures. We propose a new approach for prediction of pseudoknotted structures, motivated by the hypothesis that RNA structures fold hierarchically, with pseudoknot-free (non-overlapping) base pairs forming first, and pseudoknots forming later so as to minimize energy relative to the folde...

Journal ArticleDOI
TL;DR: Lower bounds are given for the rearrangement distance between linear genomes and for the breakpoint re-use rate as functions of the number and proportion of transpositions.
Abstract: Multi-break rearrangements break a genome into multiple fragments and further glue them together in a new order. While 2-break rearrangements represent standard reversals, fusions, fissions, and translocations, 3-break rearrangements represent a natural generalization of transpositions. Alekseyev and Pevzner (2007a, 2008a) studied multi-break rearrangements in circular genomes and further applied them to the analysis of chromosomal evolution in mammalian genomes. In this paper, we extend these results to the more difficult case of linear genomes. In particular, we give lower bounds for the rearrangement distance between linear genomes and for the breakpoint re-use rate as functions of the number and proportion of transpositions. We further use these results to analyze comparative genomic architecture of mammalian genomes.

Journal ArticleDOI
TL;DR: A unified pathway-analysis method that can be used for diverse phenotypes including binary, multiclass, continuous, count, rate, and censored survival phenotypes is proposed.
Abstract: Pathway analysis of microarray data evaluates gene expression profiles of a priori defined biological pathways in association with a phenotype of interest. We propose a unified pathway-analysis method that can be used for diverse phenotypes including binary, multiclass, continuous, count, rate, and censored survival phenotypes. The proposed method also allows covariate adjustments and correlation in the phenotype variable that is encountered in longitudinal, cluster-sampled, and paired designs. These are accomplished by combining the regression-based test statistic for each individual gene in a pathway of interest into a pathway-level test statistic. Applications of the proposed method are illustrated with two real pathway-analysis examples: one evaluating relapse-associated gene expression involving a matched-pair binary phenotype in children with acute lymphoblastic leukemia; and the other investigating gene expression in breast cancer tissues in relation to patients' survival (a censored survival pheno...

Journal ArticleDOI
TL;DR: This study exploits frequency analysis with the representation method on DNA sequences, demonstrating possible applications in coding region prediction, and sequence analysis, and demonstrates that the performance of the optimized predictor is comparable to that of other popular methods.
Abstract: Graphical representation of DNA sequences provides a simple and intuitive way of viewing, anchoring, and comparing various gene structures, so a simple and non-degenerate method is attractive to both biologists and computational biologists. In this study, a universal graphical representation method for DNA sequences based on S.S.-T. Yau's method is presented. The method adopts a trigonometric function to represent the four nucleotides A, G, C, and T. Some interesting characteristics of the universal representation are introduced. We exploit frequency analysis with our representation method on DNA sequences, demonstrating possible applications in coding region prediction, and sequence analysis. Based on the statistically experimental results from this frequency analysis, a simple coding region predictor and an optimized one are presented. An experiment on the broadly accepted ROSETTA data set demonstrates that the performance of the optimized predictor is comparable to that of other popular methods.

Journal ArticleDOI
TL;DR: The coherence that is observed in the human haplotypes as patterns is exploited and a network model of patterns is presented to reconstruct the ancestral recombination graph (ARG) from a collection of extant sequences.
Abstract: Traditionally nonrecombinant genome, i.e., mtDNA or Y chromosome, has been used for phylogeography, notably for ease of analysis. The topology of the phylogeny structure in this case is an acyclic graph, which is often a tree, is easy to comprehend and is somewhat easy to infer. However, recombination is an undeniable genetic fact for most part of the genome. Driven by the need for a more complete analysis, we address the problem of estimating the ancestral recombination graph (ARG) from a collection of extant sequences. We exploit the coherence that is observed in the human haplotypes as patterns and present a network model of patterns to reconstruct the ARG. We test our model on simulations that closely mimic the observed haplotypes and observe promising results.

Journal ArticleDOI
Marshall Bern1, David E. Goldberg
TL;DR: A new program called ComByne is described for scoring and ranking higher-level identifications for peptide tandem mass spectra, which corrects for protein lengths and makes use of more information, such as retention times and spectrum-to-spectrum corroborations.
Abstract: There are a number of computational tools for assigning identifications to peptide tandem mass spectra, but only a few tools, most notably ProteinProphet, for the crucial next step of integrating peptide identifications into higher-level identifications, such as proteins or modification sites. Here we describe a new program called ComByne for scoring and ranking higher-level identifications. Unlike other identification integration tools, ComByne corrects for protein lengths; it also makes use of more information, such as retention times and spectrum-to-spectrum corroborations. We compare ComByne to existing algorithms on several complex biological samples, including a sample of mouse blood plasma spiked with known concentrations of human proteins. On our samples, the combination of ComByne with our database search tool ByOnic is more sensitive than the combinations of Mascot with ProteinProphet and SEQUEST with DTASelect, with over 40% more proteins identified at 1% false discovery rate. A Web interface to our software is at http://bio.parc.xerox.com.

Journal ArticleDOI
TL;DR: A new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions is presented and an information criterion similar to the Akaike Information Criterion for determining the number of classes is described.
Abstract: Evolutionary conservation is an important indicator of function and a major component of bioinformatics methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions

Journal ArticleDOI
TL;DR: This work draws random samples directly from a well chosen, importance-sampling probability distribution to approximate alignment score significance, and shows that the extreme value significance statistic for the local alignment model that is examined does not follow a Gumbel distribution.
Abstract: Measurement of the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 ± 0.3) × 10−1314. Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.

Journal ArticleDOI
TL;DR: It is shown that formal concept analysis is able to find a list of candidate genes for inclusion into a partially known basic network that can be reduced by a statistical analysis so that the resulting genes interact strongly with the basic network and therefore should be included when modeling the network.
Abstract: In order to understand the behavior of a gene regulatory network, it is essential to know the genes that belong to it. Identifying the correct members (e.g., in order to build a model) is a difficult task even for small subnetworks. Usually only few members of a network are known and one needs to guess the missing members based on experience or informed speculation. It is beneficial if one can additionally rely on experimental data to support this guess. In this work we present a new method based on formal concept analysis to detect unknown members of a gene regulatory network from gene expression time series data. We show that formal concept analysis is able to find a list of candidate genes for inclusion into a partially known basic network. This list can then be reduced by a statistical analysis so that the resulting genes interact strongly with the basic network and therefore should be included when modeling the network. The method has been applied to the DNA repair system of Mycobacterium tuberculosi...