scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2004"


Journal ArticleDOI
TL;DR: The best models out-perform previous probabilistic models in the discrimination of human 5' and 3' splice sites from decoys and mechanistically motivated ways of comparing models are discussed.
Abstract: We propose a framework for modeling sequence motifs based on the maximum entropy principle (MEP). We recommend approximating short sequence motif distributions with the maximum entropy distribution (MED) consistent with low-order marginal constraints estimated from available data, which may include dependencies between nonadjacent as well as adjacent positions. Many maximum entropy models (MEMs) are specified by simply changing the set of constraints. Such models can be utilized to discriminate between signals and decoys. Classification performance using different MEMs gives insight into the relative importance of dependencies between different positions. We apply our framework to large datasets of RNA splicing signals. Our best models out-perform previous probabilistic models in the discrimination of human 5' (donor) and 3' (acceptor) splice sites from decoys. Finally, we discuss mechanistically motivated ways of comparing models.

1,667 citations


Journal ArticleDOI
TL;DR: This paper presents experimental results indicating that higher-order states, autocorrelated rates, and multiple functional categories all lead to significant improvements in the fit of a combined phylogenetic and hidden Markov model, with the effect of higher- order states being particularly pronounced.
Abstract: A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individual sites, and hidden Markov models, which allow for changes from site to site. Besides improving the realism of ordinary phylogenetic models, they are potentially very powerful tools for inference and prediction--for example, for gene finding or prediction of secondary structure. In this paper, we review progress on combined phylogenetic and hidden Markov models and present some extensions to previous work. Our main result is a simple and efficient method for accommodating higher-order states in the HMM, which allows for context-dependent models of substitution--that is, models that consider the effects of neighboring bases on the pattern of substitution. We present experimental results indicating that higher-order states, autocorrelated rates, and multiple functional categories all lead to significant improvements in the fit of a combined phylogenetic and hidden Markov model, with the effect of higher-order states being particularly pronounced.

274 citations


Journal ArticleDOI
TL;DR: A new framework for inferring models of transcriptional regulation, which is based on annotated molecular interaction graphs, is developed and successfully predicts gene knock-out effects with a high degree of accuracy in a cross-validation setting.
Abstract: We develop a new framework for inferring models of transcriptional regulation. The models, which we call physical network models, are annotated molecular interaction graphs. The attributes in the model correspond to verifiable properties of the underlying biological system such as the existence of protein–protein and protein–DNA interactions, the directionality of signal transduction in protein–protein interactions, as well as signs of the immediate effects of these interactions. Possible configurations of these variables are constrained by the available data sources. Some of the data sources, such as factor-binding data, involve measurements that are directly tied to the variables in the model. Other sources, such as gene knock-outs, are functional in nature and provide only indirect evidence about the variables. We associate each observed knock-out effect in the deletion mutant data with a set of causal paths (molecular cascades) that could in principle explain the effect, resulting in aggregate constra...

184 citations


Journal ArticleDOI
TL;DR: An integrated probabilistic model to combine protein physical interactions, genetic interactions, highly correlated gene expression networks, protein complex data, and domain structures of individual proteins to predict protein functions is developed.
Abstract: We develop an integrated probabilistic model to combine protein physical interactions, genetic interactions, highly correlated gene expression networks, protein complex data, and domain structures of individual proteins to predict protein functions. The model is an extension of our previous model for protein function prediction based on Markovian random field theory. The model is flexible in that other protein pairwise relationship information and features of individual proteins can be easily incorporated. Two features distinguish the integrated approach from other available methods for protein function prediction. One is that the integrated approach uses all available sources of information with different weights for different sources of data. It is a global approach that takes the whole network into consideration. The second feature is that the posterior probability that a protein has the function of interest is assigned. The posterior probability indicates how confident we are about assigning the function to the protein. We apply our integrated approach to predict functions of yeast proteins based upon MIPS protein function classifications and upon the interaction networks based on MIPS physical and genetic interactions, gene expression profiles, tandem affinity purification (TAP) protein complex data, and protein domain information. We study the recall and precision of the integrated approach using different sources of information by the leave-one-out approach. In contrast to using MIPS physical interactions only, the integrated approach combining all of the information increases the recall from 57% to 87% when the precision is set at 57%-an increase of 30%.

169 citations


Journal ArticleDOI
TL;DR: Although this measure is in principle computationally hard to optimize, it is shown how it can in fact be computed with great accuracy for related proteins by integer linear programming techniques and effective heuristics, such as local search and genetic algorithms.
Abstract: Protein structure comparison is a fundamental problem for structural genomics, with applications to drug design, fold prediction, protein clustering, and evolutionary studies. Despite its importance, there are very few rigorous methods and widely accepted similarity measures known for this problem. In this paper we describe the last few years of developments on the study of an emerging measure, the contact map overlap (CMO), for protein structure comparison. A contact map is a list of pairs of residues which lie in three-dimensional proximity in the protein's native fold. Although this measure is in principle computationally hard to optimize, we show how it can in fact be computed with great accuracy for related proteins by integer linear programming techniques. These methods have the advantage of providing certificates of near-optimality by means of upper bounds to the optimal alignment value. We also illustrate effective heuristics, such as local search and genetic algorithms. We were able to obtain for...

144 citations


Journal ArticleDOI
TL;DR: This work reports the use of partial least squares regression as an analytical method to glean signal-response relationships from heterogeneous multivariate signaling data collected from HT-29 human colon carcinoma cells stimulated to undergo programmed cell death.
Abstract: Biological signaling networks process extracellular cues to control important cell decisions such as death–survival, growth–quiescence, and proliferation–differentiation. After receptor activation, intracellular signaling proteins change in abundance, modification state, and enzymatic activity. Many of the proteins in signaling networks have been identified, but it is not known how signaling molecules work together to control cell decisions. To begin to address this issue, we report the use of partial least squares regression as an analytical method to glean signal–response relationships from heterogeneous multivariate signaling data collected from HT-29 human colon carcinoma cells stimulated to undergo programmed cell death. By partial least squares modeling, we relate dynamic and quantitative measurements of 20–30 intracellular signals to cell survival after treatment with tumor necrosis factor alpha (a death factor) and insulin (a survival factor). We find that partial least squares models can distingu...

127 citations


Journal ArticleDOI
TL;DR: The mathematical and algorithmic results underpinning the analysis of the genome sequences of S. paradoxus, S. mikatae, and S. bayanus are described and demonstrate the power of comparative genomics to further the understanding of any species.
Abstract: In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve...

111 citations


Journal ArticleDOI
TL;DR: An order of magnitude improvement in run times of likelihood computations is demonstrated using the use of stochastic greedy algorithms for optimizing the order of conditioning and summation operations in genetic linkage analysis.
Abstract: Genetic linkage analysis is a challenging application which requires Bayesian networks consisting of thousands of vertices. Consequently, computing the probability of data, which is needed for learning linkage parameters, using exact computation procedures calls for an extremely efficient implementation that carefully optimizes the order of conditioning and summation operations. In this paper, we present the use of stochastic greedy algorithms for optimizing this order. Our algorithm has been incorporated into the newest version of SUPERLINK, which is a fast genetic linkage program for exact likelihood computations in general pedigrees. We demonstrate an order of magnitude improvement in run times of likelihood computations using our new optimization algorithm and hence enlarge the class of problems that can be handled effectively by exact computations.

100 citations


Journal ArticleDOI
TL;DR: Using real-world SNP data, this work proposes a comprehensive statistical model of haplotype block variation and shows how the parameters of this model can be learned from haplotypes and/or unphased genotype data.
Abstract: The haplotype block structure of SNP variation in human DNA has been demonstrated by several recent studies. The presence of haplotype blocks can be used to dramatically increase the statistical power of genetic mapping. Several criteria have already been proposed for identifying these blocks, all of which require haplotypes as input. We propose a comprehensive statistical model of haplotype block variation and show how the parameters of this model can be learned from haplotypes and/or unphased genotype data. Using real-world SNP data, we demonstrate that our approach can be used to resolve genotypes into their constituent haplotypes with greater accuracy than previously known methods.

90 citations


Journal ArticleDOI
TL;DR: This model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest, and is more appropriate for the gene expression domain than other structurally similar Bayesian network classification models.
Abstract: We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed.

88 citations


Journal ArticleDOI
TL;DR: An approach for the exploratory analysis of gene expression data, based upon blind source separation techniques, that yields "elementary expression patterns" (the "sources"), which may be interpreted as potential regulation pathways.
Abstract: We develop an approach for the exploratory analysis of gene expression data, based upon blind source separation techniques. This approach exploits higher-order statistics to identify a linear model for (logarithms of) expression profiles, described as linear combinations of "independent sources." As a result, it yields "elementary expression patterns" (the "sources"), which may be interpreted as potential regulation pathways. Further analysis of the so-obtained sources show that they are generally characterized by a small number of specific coexpressed or antiexpressed genes. In addition, the projections of the expression profiles onto the estimated sources often provides significant clustering of conditions. The algorithm relies on a large number of runs of "independent component analysis" with random initializations, followed by a search of "consensus sources." It then provides estimates for independent sources, together with an assessment of their robustness. The results obtained on two datasets (namely, breast cancer data and Bacillus subtilis sulfur metabolism data) show that some of the obtained gene families correspond to well known families of coregulated genes, which validates the proposed approach.

Journal ArticleDOI
TL;DR: This work proposes a novel clustering strategy based on minimizing mutual information among gene clusters by combining the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm.
Abstract: Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a general dependence among gene random variables. We propose a novel clustering strategy based on minimizing mutual information among gene clusters. Simulated annealing is employed to solve the optimization problem. Bootstrap techniques are employed to get more accurate estimates of mutual information when the data sample size is small. Moreover, we propose to combine the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm. The performances of the new clustering methods are compared with those of some existing methods, using both synthesized data and experimental data. It is seen that the clustering algor...

Journal ArticleDOI
TL;DR: A new algorithm is described for inferring reticulate phylogenies from evolutionary distances among species by gradually improving upon the initial solution provided by a phylogenetic tree model.
Abstract: In many phylogenetic problems, assuming that species have evolved from a common ancestor by a simple branching process is unrealistic. Reticulate phylogenetic models, however, have been largely neglected because the concept of reticulate evolution have not been supported by using appropriate analytical tools and software. The reticulate model can adequately describe such complicated mechanisms as hybridization between species or lateral gene transfer in bacteria. In this paper, we describe a new algorithm for inferring reticulate phylogenies from evolutionary distances among species. The algorithm is capable of detecting contradictory signals encompassed in a phylogenetic tree and identifying possible reticulate events that may have occurred during evolution. The algorithm produces a reticulate phylogeny by gradually improving upon the initial solution provided by a phylogenetic tree model. The new algorithm is compared to the popular SplitsGraph method in a reanalysis of the evolution of photosynthetic o...

Journal ArticleDOI
TL;DR: FlexProt performs a structural comparison of a pair of proteins 300 amino acids long in about seven seconds on a standard desktop PC (400 MHz Pentium II processor with 256MB internal memory) and in practice FlexProt is highly efficient.
Abstract: FlexProt is a novel technique for the alignment of flexible proteins. Unlike all previous algorithms designed to solve the problem of structural comparisons allowing hinge-bending motions, FlexProt...

Journal ArticleDOI
TL;DR: This approach exposes novel characterizations for the base set of motifs in a string, hinged on notions of partial order, which support the design of ad hoc data structures and constructs, and lead to develop an O(n(3)) time incremental discovery algorithm.
Abstract: We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of iterated updates of the set of irredundant motifs in a string under consecutive unit symbol extensions of the string itself. This approach exposes novel characterizations for the base set of motifs in a string, hinged on notions of partial order. Such properties support the design of ad hoc data structures and constructs, and lead to develop an O(n(3)) time incremental discovery algorithm.

Journal ArticleDOI
TL;DR: This paper considers the case of a pair of homologous human chromosomes, one from the mother and the other from the father, and describes a method for statistical reconstruction of haplotypes from a set of aligned SNP fragments.
Abstract: In this paper, we describe a method for statistical reconstruction of haplotypes from a set of aligned SNP fragments. We consider the case of a pair of homologous human chromosomes, one from the mother and the other from the father. After fragment assembly, we wish to reconstruct the two haplotypes of the parents. Given a set of potential SNP sites inferred from the assembly alignment, we wish to divide the fragment set into two subsets, each of which represents one chromosome. Our method is based on a statistical model of sequencing errors, compositional information, and haplotype memberships. We calculate probabilities of different haplotypes conditional on the alignment. Due to computational complexity, we first determine phases for neighboring SNPs. Then we connect them and construct haplotype segments. Also, we compute the accuracy or confidence of the reconstructed haplotypes. We discuss other issues, such as alternative methods, parameter estimation, computational efficiency, and relaxation of assumptions.

Journal ArticleDOI
TL;DR: An automated procedure for high-throughput NMR resonance assignment for a protein of known structure, or of a homologous structure, is reported and Nuclear Vector Replacement (NVR) is introduced to compute assignments that optimally correlate experimentally measured NH residual dipolar couplings (RDCs) to a given a priori whole-protein 3D structural model.
Abstract: High-throughput NMR structural biology can play an important role in structural genomics. We report an automated procedure for high-throughput NMR resonance assignment for a protein of known structure, or of a homologous structure. These assignments are a prerequisite for probing protein-protein interactions, protein-ligand binding, and dynamics by NMR. Assignments are also the starting point for structure determination and refinement. A new algorithm, called Nuclear Vector Replacement (NVR) is introduced to compute assignments that optimally correlate experimentally measured NH residual dipolar couplings (RDCs) to a given a priori whole-protein 3D structural model. The algorithm requires only uniform( 15)N-labeling of the protein and processes unassigned H(N)-(15)N HSQC spectra, H(N)-(15)N RDCs, and sparse H(N)-H(N) NOE's (d(NN)s), all of which can be acquired in a fraction of the time needed to record the traditional suite of experiments used to perform resonance assignments. NVR runs in minutes and efficiently assigns the (H(N),(15)N) backbone resonances as well as the d(NN)s of the 3D (15)N-NOESY spectrum, in O(n(3)) time. The algorithm is demonstrated on NMR data from a 76-residue protein, human ubiquitin, matched to four structures, including one mutant (homolog), determined either by x-ray crystallography or by different NMR experiments (without RDCs). NVR achieves an assignment accuracy of 92-100%. We further demonstrate the feasibility of our algorithm for different and larger proteins, using NMR data for hen lysozyme (129 residues, 97-100% accuracy) and streptococcal protein G (56 residues, 100% accuracy), matched to a variety of 3D structural models. Finally, we extend NVR to a second application, 3D structural homology detection, and demonstrate that NVR is able to identify structural homologies between proteins with remote amino acid sequences using a database of structural models.

Journal ArticleDOI
TL;DR: A preliminary analysis of the results shows that promoters are associated with SidD regions in a highly statistically significant manner, suggesting that SIDD attributes may prove useful in the computational prediction of promoter locations in prokaryotes.
Abstract: We present a method for calculating predicted locations and extents of stress-induced DNA duplex destabilization (SIDD) as functions of base sequence and stress level in long DNA molecules. The base pair denaturation energies are assigned individually, so the influences of near neighbors, methylated bases, adducts, or lesions can be included. Sample calculations indicate that copolymeric energetics give results that are close to those derived when full near-neighbor energetics are used; small but potentially informative differences occur only in the calculated SIDD properties of moderately destabilized regions. The method presented here for analyzing long sequences calculates the destabilization properties within windows of fixed length N, with successive windows displaced by an offset distance d o. The final values of the relevant destabilization parameters for each base pair are calculated as weighted averages of the values computed for each window in which that base pair appears. This approach implicit...

Journal ArticleDOI
TL;DR: A p-value-based scoring scheme using probability generating functions to evaluate the statistical significance of potential TFBSs and introduces the local genomic context into the model so that candidate sites are evaluated based both on their similarities to known binding sites and on their contrasts against their respective local genomic contexts.
Abstract: High-level eukaryotic genomes present a particular challenge to the computational identification of transcription factor binding sites (TFBSs) because of their long noncoding regions and large numbers of repeat elements. This is evidenced by the noisy results generated by most current methods. In this paper, we present a p-value-based scoring scheme using probability generating functions to evaluate the statistical significance of potential TFBSs. Furthermore, we introduce the local genomic context into the model so that candidate sites are evaluated based both on their similarities to known binding sites and on their contrasts against their respective local genomic contexts. We demonstrate that our approach is advantageous in the prediction of myogenin and MEF2 binding sites in the human genome. We also apply LMM to large-scale human binding site sequences in situ and found that, compared to current popular methods, LMM analysis can reduce false positive errors by more than 50% without compromising sensi...

Journal ArticleDOI
TL;DR: The clustering of (binary) oligonucleotide fingerprints is formulated as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously, and an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs is presented.
Abstract: Oligonucleotide fingerprinting is a powerful DNA array-based method to characterize cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression profiling and DNA clone classification. We are especially interested in the latter application. A key step in the method is the cluster analysis of fingerprint data obtained from DNA array hybridization experiments. Most of the existing approaches to clustering use (normalized) real intensity values and thus do not treat positive and negative hybridization signals equally (positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint data are first normalized and binarized using control DNA clones. Because there may exist unresolved (or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously. We study the computational complexity of this clustering problem and a natural parameterized version and present an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs considered here, which allow us to efficiently find the maximum cliques as well as some special maximal cliques. Our preliminary experimental results on simulated and real data demonstrate that the algorithm runs faster and performs better than some popular hierarchical and graph-based clustering methods. The results on real data from DNA clone classification also suggest that this discrete approach is more accurate than clustering methods based on real intensity values in terms of separating clones that have different characteristics with respect to the given oligonucleotide probes.

Journal ArticleDOI
TL;DR: This paper formalizes the abstract problem as a discovery problem called the (pi)pattern problem and gives an algorithm that automatically discovers the clusters of patterns in multiple data sequences and introduces a notation for maximal patterns that drastically reduces the number of valid cluster patterns.
Abstract: Functionally related genes often appear in each other's neighborhood on the genome; however, the order of the genes may not be the same. These groups or clusters of genes may have an ancient evolutionary origin or may signify some other critical phenomenon and may also aid in function prediction of genes. Such gene clusters also aid toward solving the problem of local alignment of genes. Similarly, clusters of protein domains, albeit appearing in different orders in the protein sequence, suggest common functionality in spite of being nonhomologous. In the paper, we address the problem of automatically discovering clusters of entities, be they genes or domains: we formalize the abstract problem as a discovery problem called the πpattern problem and give an algorithm that automatically discovers the clusters of patterns in multiple data sequences. We take a model-less approach and introduce a notation for maximal patterns that drastically reduces the number of valid cluster patterns, without any loss of inf...

Journal ArticleDOI
TL;DR: The task of learning a robust sparse hyperplane from such data is formulated as a second order cone program (SOCP).
Abstract: Molecular profiling studies can generate abundance measurements for thousands of transcripts, proteins, metabolites, or other species in, for example, normal and tumor tissue samples. Treating such measurements as features and the samples as labeled data points, sparse hyperplanes provide a statistical methodology for classifying data points into one of two categories (classification and prediction) and defining a small subset of discriminatory features (relevant feature identification). However, this and other extant classification methods address only implicitly the issue of observed data being a combination of underlying signals and noise. Recently, robust optimization has emerged as a powerful framework for handling uncertain data explicitly. Here, ideas from this field are exploited to develop robust sparse hyperplanes, i.e., classification and relevant feature identification algorithms that are resilient to variation in the data. Specifically, each data point is associated with an explicit data uncertainty model in the form of an ellipsoid parameterized by a center and covariance matrix. The task of learning a robust sparse hyperplane from such data is formulated as a second order cone program (SOCP). Gaussian and distribution-free data uncertainty models are shown to yield SOCPs that are equivalent to the SCOP based on ellipsoidal uncertainty. The real-world utility of robust sparse hyperplanes is demonstrated via retrospective analysis of breast cancer related transcript profiles. Data-dependent heuristics are used to compute the parameters of each ellipsoidal data uncertainty model. The generalization performance of a specific implementation, designated "robust LIKNON," is better than its nominal counterpart. Finally, the strengths and limitations of robust sparse hyperplanes are discussed.

Journal ArticleDOI
TL;DR: A generic branch-and-bound approach to efficient exact p-value computation is defined and the required conditions for successful application are enumerated, which constitutes a first practical exact improvement over the exhaustive enumeration performed by existing statistical software.
Abstract: A major obstacle in applying various hypothesis testing procedures to datasets in bioinformatics is the computation of ensuing p-values. In this paper, we define a generic branchand- bound approach to efficient exact p-value computation and enumerate the required conditions for successful application. Explicit procedures are developed for the entire Cressie–Read family of statistics, which includes the widely used Pearson and likelihood ratio statistics in a one-way frequency table goodness-of-fit test. This new formulation constitutes a first practical exact improvement over the exhaustive enumeration performed by existing statistical software. The general techniques we develop to exploit the convexity of many statistics are also shown to carry over to contingency table tests, suggesting that they are readily extendible to other tests and test statistics of interest. Our empirical results demonstrate a speed-up of orders of magnitude over the exhaustive computation, significantly extending the practical ...

Journal ArticleDOI
TL;DR: The JCFO is introduced, a novel algorithm that uses a sparse Bayesian approach to jointly identify both the optimal nonlinear classifier for diagnosis and the optimal set of genes on which to base that diagnosis.
Abstract: Recent research has demonstrated quite convincingly that accurate cancer diagnosis can be achieved by constructing classifiers that are designed to compare the gene expression profile of a tissue of unknown cancer status to a database of stored expression profiles from tissues of known cancer status. This paper introduces the JCFO, a novel algorithm that uses a sparse Bayesian approach to jointly identify both the optimal nonlinear classifier for diagnosis and the optimal set of genes on which to base that diagnosis. We show that the diagnostic classification accuracy of the proposed algorithm is superior to a number of current state-of-the-art methods in a full leave-one-out cross-validation study of five widely used benchmark datasets. In addition to its superior classification accuracy, the algorithm is designed to automatically identify a small subset of genes (typically around twenty in our experiments) that are capable of providing complete discriminatory information for diagnosis. Focusing attentio...

Journal ArticleDOI
TL;DR: A new method for DNA sequencing using base-specific cleavage and mass spectrometry that appears to be a promising alternative to classical DNA sequencing approaches is proposed and application of this method to simulated data indicates that it might be capable of sequencing DNA molecules with 200+ nt.
Abstract: One of the main endeavors in today's life science remains the efficient sequencing of long DNA molecules. Today, most de novo sequencing of DNA is still performed using the electrophoresis-based Sanger concept of 1977, in spite of certain restrictions of this method. Methods using mass spectrometry to acquire the Sanger sequencing data are limited by short sequencing lengths of 15-25 nt. We propose a new method for DNA sequencing using base-specific cleavage and mass spectrometry that appears to be a promising alternative to classical DNA sequencing approaches. A single stranded DNA or RNA molecule is cleaved by a base-specific (bio-)chemical reaction using, for example, RNAses. The cleavage reaction is modified such that not all, but only a certain percentage of bases are cleaved. The resulting mixture of fragments is then analyzed using MALDI-TOF mass spectrometry, whereby we acquire the molecular masses of fragments. For every peak in the mass spectrum, we calculate those base compositions that will potentially create a peak of the observed mass and, repeating the cleavage reaction for all four bases, finally try to uniquely reconstruct the underlying sequence from these observed spectra. This leads us to the combinatorial problem of sequencing from compomers and, finally, to the graph-theoretical problem of finding a walk in a subgraph of the de Bruijn graph. Application of this method to simulated data indicates that it might be capable of sequencing DNA molecules with 200+ nt.

Journal ArticleDOI
TL;DR: A novel model for the representation of biological networks and algorithms for learning model parameters from experimental data is proposed and used to propose novel regulatory hypotheses in several cases where the literature-based model was inconsistent with the experiments.
Abstract: In this study, we propose a novel model for the representation of biological networks and provide algorithms for learning model parameters from experimental data. Our approach is to build an initial model based on extant biological knowledge and refine it to increase the consistency between model predictions and experimental data. Our model encompasses networks which contain heterogeneous biological entities (mRNA, proteins, metabolites) and aims to capture diverse regulatory circuitry on several levels (metabolism, transcription, translation, post-translation and feedback loops, among them). Algorithmically, the study raises two basic questions: how to use the model for predictions and inference of hidden variables states, and how to extend and rectify model components. We show that these problems are hard in the biologically relevant case where the network contains cycles. We provide a prediction methodology in the presence of cycles and a polynomial time, constant factor approximation for learning the regulation of a single entity. A key feature of our approach is the ability to utilize both high-throughput experimental data, which measure many model entities in a single experiment, as well as specific experimental measurements of few entities or even a single one. In particular, we use together gene expression, growth phenotypes, and proteomics data. We tested our strategy on the lysine biosynthesis pathway in yeast. We constructed a model of more than 150 variables based on an extensive literature survey and evaluated it with diverse experimental data. We used our learning algorithms to propose novel regulatory hypotheses in several cases where the literature-based model was inconsistent with the experiments. We showed that our approach has better accuracy than extant methods of learning regulation.

Journal ArticleDOI
TL;DR: This short note addresses two questions that were left open about the perfect phylogeny haplotyping problem and shows that the problem is NP-hard using a reduction from Vertex Cover (Garey and Johnson, 1979).
Abstract: The problem of inferring haplotype phase from a population of genotypes has received a lot of attention recently. This is partly due to the observation that there are many regions on human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). A Haplotype Map project has been announced by NIH to identify and characterize populations in terms of these haplotypes. Recently, Gusfield introduced the perfect phylogeny haplotyping problem, as an algorithmic implication of the no-recombination in long blocks observation, together with the standard population-genetic assumption of infinite sites. Gusfield's solution based on matroid theory was followed by direct θ(nm2 ) solutions that use simpler techniques (Bafna et al., 2003; Eskin et al., 2003), and also bound the number of solutions to the PPH problem. In this short note, we address two questions that were left open. First, can the algorithms of Bafna et al. (2003) and Eskin et al. (2003)...

Journal ArticleDOI
Tatsuya Akutsu1
TL;DR: This paper focuses on extraction of mapping rules of atoms from enzymatic reaction data, which is useful for drug design, simulation of tracer experiments, and consistency checking of pathway databases, and proposes a novel approach based on graph partition and graph isomorphism.
Abstract: Many computational problems and methods have been proposed for analysis of biological pathways. Among them, this paper focuses on extraction of mapping rules of atoms from enzymatic reaction data, which is useful for drug design, simulation of tracer experiments, and consistency checking of pathway databases. Most of existing methods for this problem are based on maximal common subgraph algorithms. In this paper, we propose a novel approach based on graph partition and graph isomorphism. We show that this problem is NP-hard in general, but can be solved in polynomial time for wide classes of enzymatic reactions. We also present an O(n(1.5)) time algorithm for a special but fundamental class of reactions, where n is the maximum size of compounds appearing in a reaction. We develop practical polynomial-time algorithms in which the Morgan algorithm is used for computing the normal form of a graph, where it is known that the Morgan algorithm works correctly for most chemical structures. Computational experiments are performed for these practical algorithms using the chemical reaction data stored in the KEGG/LIGAND database. The results of computational experiments suggest that practical algorithms are useful in many cases.

Journal ArticleDOI
TL;DR: An algorithm that determines the longest path between every gene pair in an arbitrarily large genetic network from large scale gene perturbation data and is able to distinguish a large fraction of direct regulatory interactions from indirect interactions.
Abstract: I present an algorithm that determines the longest path between every gene pair in an arbitrarily large genetic network from large scale gene perturbation data. The algorithm's computational complexity is O(nk2), where n is the number of genes in the network and k is the average number of genes affected by a genetic perturbation. The algorithm is able to distinguish a large fraction of direct regulatory interactions from indirect interactions, even if the accuracy of its input data is substantially compromised.

Journal ArticleDOI
TL;DR: The scoring function was developed based on the CE algorithm, it is portable to any other protein structure alignment algorithm, and is examined by sensitivity, specificity, and ROC curves.
Abstract: A new scoring function for assessing the statistical significance of protein structure alignment has been developed. The new scores were tested empirically using the combinatorial extension (CE) algorithm. The significance of a given score was given a p-value by curve-fitting the distribution of the scores generated by a random comparison of proteins taken from the PDB_SELECT database and the structural classification of proteins (SCOP) database. Although the scoring function was developed based on the CE algorithm, it is portable to any other protein structure alignment algorithm. The new scoring function is examined by sensitivity, specificity, and ROC curves.