scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 1995"


Journal ArticleDOI
TL;DR: This paper proposes a new computer algorithm for DNA sequence assembly that combines in a novel way the techniques of both shotgun and SBH methods, and promises to be very fast and practical forDNA sequence assembly.
Abstract: Since the advent of rapid DNA sequencing methods in 1976, scientists have had the problem of inferring DNA sequences from sequenced fragments. Shotgun sequencing is a well-established biological and computational method used in practice. Many conventional algorithms for shotgun sequencing are based on the notion of pairwise fragment overlap. While shotgun sequencing infers a DNA sequence given the sequences of overlapping fragments, a recent and complementary method, called sequencing by hybridization (SBH), infers a DNA sequence given the set of oligomers that represents all subwords of some fixed length, k. In this paper, we propose a new computer algorithm for DNA sequence assembly that combines in a novel way the techniques of both shotgun and SBH methods. Based on our preliminary investigations, the algorithm promises to be very fast and practical for DNA sequence assembly.

336 citations


Journal ArticleDOI
TL;DR: The maximum discrimination method for building hidden Markov models (HMMs) of protein or nucleic acid primary sequence consensus compensates for biased representation in sequence data sets, superseding the need for sequence weighting methods.
Abstract: We introduce a maximum discrimination method for building hidden Markov models (HMMs) of protein or nucleic acid primary sequence consensus. The method compensates for biased representation in sequence data sets, superseding the need for sequence weighting methods. Maximum discrimination HMMs are more sensitive for detecting distant sequence homologs than various other HMM methods or BLAST when tested on globin and protein kinase catalytic domain sequences. Key words: hidden Markov model; database searching; sequence consensus; sequence weighting

272 citations


Journal ArticleDOI
TL;DR: The fragment assembly problem is reformulated as one of finding a maximum-likelihood reconstruction with respect to the two-sided Kolmogorov-Smirnov statistic, and it is argued that this is a better formulation of the problem.
Abstract: The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally, the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximum-likelihood reconstruction with respect to the two-sided Kolmogorov–Smirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graph-theoretic terms as one of finding a noncyclic subgraph with certain properties and the objectives of being shortest or maximally likely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is very important as the un...

258 citations


Journal ArticleDOI
TL;DR: It is shown that four simplified models of the physical mapping problem lead to NP-complete decision problems: Colored unit interval graph completion, the maximum interval subgraph, the pathwidth of a bipartite graph, and the k-consecutive ones problem for k > or = 2.
Abstract: Physical mapping is a central problem in molecular biology and the human genome project. The problem is to reconstruct the relative position of fragments of DNA along the genome from information on their pairwise overlaps. We show that four simplified models of the problem lead to NP-complete decision problems: Colored unit interval graph completion, the maximum interval (or unit interval) subgraph, the pathwidth of a bipartite graph, and the k -consecutive ones problem for k ≥ 2. These models have been chosen to reflect various features typical in biological data, including false-negative and positive errors, small width of the map, and chimericism. Key words: physical mapping; NP-completeness; interval graphs; k-consecutive ones problem

220 citations


Journal ArticleDOI
TL;DR: The technical challenges to integration, classifies the approaches, and critiques the available tools and methodologies are surveyed, to counter the increasing dispersion and heterogeneity of data.
Abstract: Scientific data of importance to biologists reside in a number of different data sources, such as GenBank, GSDB, SWISS-PROT, EMBL, and OMIM, among many others. Some of these data sources are conventional databases implemented using database management systems (DBMSs) and others are structured files maintained in a number of different formats (e.g., ASN.1 and ACE). In addition, software packages such as sequence analysis packages (e.g., BLAST and FASTA) produce data and can therefore be viewed as data sources. To counter the increasing dispersion and heterogeneity of data, different approaches to integrating these data sources are appearing throughout the bioinformatics community. This paper surveys the technical challenges to integration, classifies the approaches, and critiques the available tools and methodologies.

200 citations


Journal ArticleDOI
TL;DR: The MSA program implements a branch-and-bound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph to find optimal alignments of multiple protein or DNA sequences.
Abstract: The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program impleme...

165 citations


Journal ArticleDOI
TL;DR: An extensive review of methods for prediction of functional sites, tRNA, and protein-coding genes and discuss possible further directions of research in this area of computational molecular biology.
Abstract: Recognition of function of newly sequenced DNA fragments is an important area of computational molecular biology Here we present an extensive review of methods for prediction of functional sites, tRNA, and protein-coding genes and discuss possible further directions of research in this area Key words: DNA sequence analysis; functional sites; genes; protein-coding regions; exons; introns; prediction; tRNA

128 citations


Journal ArticleDOI
TL;DR: This work proposes an architecture for query-based interoperation that includes a number of novel components of an information infrastructure for molecular biology that bridge the heterogeneities that exist between biological DBs at several different levels.
Abstract: To realize the full potential of biological databases (DBs) requires more than the interactive, hypertext flavor of database interoperation that is now so popular in the bioinformatics community. Interoperation based on declarative queries to multiple network-accessible databases will support analyses and investigations that are orders of magnitude faster and more powerful than what can be accomplished through interactive navigation. I present a vision of the capabilities that a query-based interoperation infrastructure should provide, and identify assumptions underlying, and requirements of, this vision. I then propose an architecture for query-based interoperation that includes a number of novel components of an information infrastructure for molecular biology. These components include a knowledge base that describes relationships among the conceptualizations used in different biological databases, a module that can determine the DBs that are relevant to a particular query, a module that can tr...

120 citations


Journal ArticleDOI
TL;DR: The model is employed for embedding a phylogeny tree into another one via the so-called duplication/speciation principle requiring that the gene duplicated evolves in such a way that any of the contemporary species involved bears only one of the gene copies diverged.
Abstract: In the framework of the problem of combining different gene trees into a unique species phylogeny, a model for duplication/speciation/loss events along the evolutionary tree is introduced. The model is employed for embedding a phylogeny tree into another one via the so-called duplication/speciation principle requiring that the gene duplicated evolves in such a way that any of the contemporary species involved bears only one of the gene copies diverged. The number of biologically meaningful elements in the embedding result (duplications, losses, information gaps) is considered a (asymmetric) dissimilarity measure between the trees. The model duplication concept is compared with that one defined previously in terms of a mapping procedure for the trees. A graph-theoretic reformulation of the measure is derived.

106 citations


Journal ArticleDOI
TL;DR: This work presents several algorithms to infer how the clones overlap, given data about each clone, in data used to map human chromosomes 21 and Y, in which relatively short substrings, or probes, are extracted from the ends of clones.
Abstract: The goal of physical mapping of the genome is to reconstruct a strand of DNA given a collection of overlapping fragments, or clones, from the strand. We present several algorithms to infer how the clones overlap, given data about each clone. We focus on data used to map human chromosomes 21 and Y, in which relatively short substrings, or probes, are extracted from the ends of clones. The substrings are long enough to be unique with high probability. The data we are given is an incidence matrix of clones and probes. In the absence of error, the correct placement can be found easily using a PQ-tree. The data are never free from error, however, and algorithms are differentiated by their performance in the presence of errors. We approach errors from two angles: by detecting and removing them, and by using algorithms that are robust in the presence of errors. We have also developed a strategy to recover noiseless data through an interactive process that detects anomalies in the data and retests questionable entries in the incidence matrix of clones and probes. We evaluate the effectiveness of our algorithms empirically, using simulated data as well as real data from human chromosome 21.

101 citations


Journal ArticleDOI
TL;DR: Different Markov chain models, either with stationary or periodic transition probabilities, are considered, showing that many overabundant words are one-letter mutations of avoided palindromes.
Abstract: Identifying exceptional motifs is often used for extracting information from long DNA sequences. The two difficulties of the method are the choice of the model that defines the expected frequencies of words and the approximation of the variance of the difference T(W) between the number of occurrences of a word W and its estimation. We consider here different Markov chain models, either with stationary or periodic transition probabilities. We estimate the variance of the difference T(W) by the conditional variance of the number of occurrences of W given the oligonucleotides counts that define the model. Two applications show how to use asymptotically standard normal statistics associated with the counts to describe a given sequence in terms of its outlying words. Sequences of Escherichia coli and of Bacillus subtilis are compared with respect to their exceptional tri- and tetranucleotides. For both bacteria, exceptional 3-words are mainly found in the coding frame. E. coli palindrome counts are an...

Journal ArticleDOI
TL;DR: A mathematical model for the polymerase chain reaction and its mutations is constructed using the theory of branching processes and a method for estimating the mutation rate based on pairwise differences is proposed.
Abstract: We construct a mathematical model for the polymerase chain reaction and its mutations using the theory of branching processes. Under this model we study the number of mutations in a random...

Journal ArticleDOI
TL;DR: The conclusion is that decision trees are a highly effective tool for identifying protein coding regions, on DNA sequences ranging from 54 to 162 base pairs in length.
Abstract: Genes in eukaryotic DNA cover hundreds or thousands of base pairs, while the regions of those genes that code for proteins may occupy only a small percentage of the sequence. Identifying the coding regions is of vital importance in understanding these genes. Many recent research efforts have studied computational methods for distinguishing between coding and noncoding regions, and several promising results have been reported. We describe here a new approach, using a machine learning system that builds decision trees from the data. This approach combines several coding measures to produce classifiers with consistently higher accuracies than previous methods, on DNA sequences ranging from 54 to 162 base pairs in length. The algorithm is very efficient, and it can easily be adapted to different sequence lengths. Our conclusion is that decision trees are a highly effective tool for identifying protein coding regions.

Journal ArticleDOI
TL;DR: A mathematical model to treat the polymerase chain reaction (PCR), where the accumulation of new molecules during a PCR cycle is regarded as a randomly bifurcating tree, enables an approximate formula for the distribution of the number of replications that have occurred between a pair of molecules to be computed.
Abstract: We introduce a mathematical model to treat the polymerase chain reaction (PCR), where we regard the accumulation of new molecules during a PCR cycle as a randomly bifurcating tree. This mo...

Journal ArticleDOI
TL;DR: It is shown that building an optimal decision tree is NP-complete, then an approximation algorithm is given that gives trees within a constant multiplicative factor of optimal, and it is demonstrated that subsequence queries are significantly more powerful than substring queries, matching the information theoretic lower bound.
Abstract: We consider an interactive approach to DNA sequencing by hybridization, where we are permitted to ask questions of the form "is s a substring of the unknown sequence S?", where s is a specific query string. We are not told where s occurs in S, nor how many times it occurs, just whether or not s a substring of S. Our goal is to determine the exact contents of S using as few queries as possible. Through interaction, far fewer queries are necessary than using conventional fixed sequencing by hybridization (SBH) sequencing chips. We provide tight bounds on the complexity of reconstructing unknown strings from substring queries. Our lower bound, which holds even for a stronger model that returns the number of occurrence of s as a substring of S, relies on interesting arguments based on de Bruijn sequences. We also demonstrate that subsequence queries are significantly more powerful than substring queries, matching the information theoretic lower bound. Finally, in certain applications, something may already be known about the unknown string, and hence it can be determined faster than an arbitrary string. We show that building an optimal decision tree is NP-complete, then give an approximation algorithm that gives trees within a constant multiplicative factor of optimal.

Journal ArticleDOI
TL;DR: This paper examines the construction of physical maps from hybridization data between sequence tag sites (STS) probes and clones of genomic fragments and proves that only certain types of mapping information can be reliably calculated by any algorithm.
Abstract: An important tool in the analysis of genomic sequences is the physical map. In this paper we examine the construction of physical maps from hybridization data between sequence tag sites (STS) probes and clones of genomic fragments. An algorithmic theory of the mapping process, a proposed performance evaluation procedure, and several new algorithmic strategies for mapping are given. A unifying theme for these developments is the idea of a "conservative extension." An algorithm, measure of algorithm quality, or description of physical map is a conservative extension if it is a generalization for data with errors of a corresponding concept in the error-free case. In our algorithmic theory we show that the nature of hybridization experiments imposes inherent limitations on the mapping information recorded in the experimental data. We prove that only certain types of mapping information can be reliably calculated by any algorithm. A test generator is then presented along with quantitative measures for determining how much of the possible information is being computed by a given algorithm. Weaknesses and strengths of these measures are discussed. Each of the new algorithms presented in this paper is based on combinatorial optimizations. Despite the fact that all the optimizations are NP-complete, we have developed algorithmic tools for the design of competitive approximation algorithms. We apply our performance evaluation program to our algorithms and obtain solid evidence that the algorithms are capable of retrieving high-level reliable mapping information.

Journal ArticleDOI
TL;DR: This work considers molecular models for computing and derives a DNA-based mechanism for solving intractable problems through massive parallelism, and suggests that such methods might reduce the effort needed to solve otherwise difficult tasks.
Abstract: We consider molecular models for computing and derive a DNA-based mechanism for solving intractable problems through massive parallelism. In principle, such methods might reduce the effort needed to solve otherwise difficult tasks, such as factoring large numbers, a computationally intensive task whose intractability forms the basis for much of modern cryptography. Key words: DNA; nanotechnology; recombination; site-directed mutagenesis; intractability; combinatorial search; NP-completeness

Journal ArticleDOI
TL;DR: This paper proposes criteria that would facilitate characterizing, evaluating, and comparing heterogeneous molecular biology database systems and proposes a methodology for evaluating these systems.
Abstract: Molecular biology data are distributed among multiple databases. Although containing related data, these databases are often isolated and are characterized by various degrees of heterogeneity: they usually represent different views (schemas) of the scientific domain and are implemented using different data management systems. Currently, several systems support managing data in heterogeneous molecular biology databases. Lack of clear criteria for characterizing such systems precludes comprehensive evaluations of these systems or determining their relationships in terms of shared goals and facilities. In this paper, we propose criteria that would facilitate characterizing, evaluating, and comparing heterogeneous molecular biology database systems. Key words: characterization criteria, heterogeneous database systems, molecular biology databases

Journal ArticleDOI
TL;DR: This method applies a recently developed Hadamard matrix-based technique to describe elements of I(T) in terms of edge-disjoint packings of subtrees in T, and thereby complements earlier more algebraic treatments.
Abstract: Linear invariants are useful tools for testing phylogenetic hypotheses from aligned DNA/ RNA sequences, particularly when the sites evolve at different rates. Here we give a simple, graph theoretic classification for each phylogenetic tree T, of its associated vector space I(T) of linear invariants under the Jukes–Cantor one-parameter model of nucleotide substitution. We also provide an easily described basis for I(T), and show that if T is a binary (fully resolved) phylogenetic tree with n sequences at its leaves then: dim[I(T)] = 4n − F2n−2 where Fn is the nth Fibonacci number. Our method applies a recently developed Hadamard matrix-based technique to describe elements of I(T) in terms of edge-disjoint packings of subtrees in T, and thereby complements earlier more algebraic treatments. Key words: Phylogenetic invariants; trees; forests; Hadamard matrix; Jukes–Cantor model

Journal ArticleDOI
TL;DR: Techniques are derived to estimate the conditional probability of gene function, given ORF length, based on evidence both from the databases and from simulation for Saccharomyces cerevisiae.
Abstract: The length of an open reading frame (ORF) is one important piece of evidence often used in locating new genes, particularly in organisms where splicing is rare. However, there have been no systematic studies quantifying the degree of correlation between length of ORF, on the one hand, and likelihood of gene function, on the other. In this paper, techniques are derived to estimate the conditional probability of gene function, given ORF length, based on evidence both from the databases and from simulation. Several complete chromosomes of Saccharomyces cerevisiae have now been sequenced, and considerable effort is being expended on locating and characterizing the genes in these sequences. Thus, we illustrate the techniques for this organism.

Journal ArticleDOI
TL;DR: A correlation method that runs in linear time and incorporates pairwise dependencies between amino acid residues at multiple distances to assess the conditional probability that a given residue is part of a given 3D structure is presented.
Abstract: The identification of protein sequences that fold into certain known three-dimensional (3D) structures, or motifs, is evaluated through a probabilistic analysis of their one-dimensional (1...

Journal ArticleDOI
TL;DR: A coherent language base for describing and working with characteristics of combinatorial optimization problems is introduced, which is at once general enough to be used in all such problems and precise enough to allow subtle concepts in this field to be discussed unambiguously.
Abstract: This article introduces a coherent language base for describing and working with characteristics of combinatorial optimization problems, which is at once general enough to be used in all such problems and precise enough to allow subtle concepts in this field to be discussed unambiguously. An example is provided of how this nomenclature is applied to an instance of the phylogeny problem. Also noted is the beneficial effect, on the landscape of the solution space, of transforming the observed data to account for multiple changes of character state.

Journal ArticleDOI
TL;DR: It is shown that the general case of determining perfect compatibility of generalized ordered characters is an NP-complete problem, but can be solved in polynomial time for a special case.
Abstract: We propose a new model of computation for deriving phylogenetic trees based upon a generalization of qualitative characters. The model we propose is based upon recent experimental research in molecular biology. We show that the general case of determining perfect compatibility of generalized ordered characters is an NP-complete problem, but can be solved in polynomial time for a special case.

Journal ArticleDOI
TL;DR: The results suggest that it is unlikely that the multiple sequence tree alignment problem has polynomial-time algorithms that produce either optimal solutions or approximate solutions whose cost may be arbitrarily close to optimal.
Abstract: We give a simple proof which shows that the multiple sequence tree alignment problem from molecular biology is both NP-complete and MAX SNP-hard. Our proof of MAX SNP-hardness is simpler than that given previously by Wang and Jiang. These results suggest that it is unlikely that the multiple sequence tree alignment problem has polynomial-time algorithms that produce either optimal solutions or approximate solutions whose cost may be arbitrarily close to optimal. Key words: multiple sequence tree alignment, computational complexity, approximability, NP-complete, MAX SNP-hard

Journal ArticleDOI
TL;DR: The method in general requires time and space exponential in the number of optional characters in the regular expression, but in practice was used to determine bounds for probabilities of matching all the ProSite patterns without difficulty.
Abstract: A method is presented for determining within strict bounds the probability of matching a regular expression with a match start point in a given section of a random data string The method in general requires time and space exponential in the number of optional characters in the regular expression, but in practice was used to determine bounds for probabilities of matching all the ProSite patterns without difficulty

Journal ArticleDOI
TL;DR: The structure of strings with large amounts of overlap is studied and an algorithm that finds a superstring whose length is no more than 2 3/4 times that of the optimal superstring is given, which matches that of previous algorithms.
Abstract: Given a collection of strings S = {s1,..., sn} over an alphabet Σ, a superstring α of S is a string containing each si as a substring, that is, for each i, 1 ≤ i ≤ n, α contains a block of...

Journal ArticleDOI
TL;DR: An algorithm to construct lattice models of polymers with side chains by dynamic programming, making the search for the global minimum of the error function for a given lattice-to-chain orientation both fast and complete.
Abstract: An algorithm to construct lattice models of polymers with side chains is presented. A search for the global minimum of the error function for a given lattice-to-chain orientation is done by dynamic programming, making the search both fast and complete. Application of the algorithm is illustrated by constructing lattice models for 12 proteins of different sizes and structural types. Key words: protein structure, lattice model, dynamic programming

Journal ArticleDOI
TL;DR: Algorithms for the perfect phylogeny problem restricted to binary characters and two online algorithms that can process any sequence of additions and deletions of species and characters are presented.
Abstract: We present algorithms for the perfect phylogeny problem restricted to binary characters. The first algorithm is faster than a previous algorithm by Gusfield when the input matrix for the p...


Journal ArticleDOI
TL;DR: In this paper, a classical anomaly, known as fragment collapsing, introduces errors into the maps that impedes the construction of high-resolution restriction maps via greedy algorithms, which is called fragment collapsing.
Abstract: In the process of constructing high-resolution restriction maps via greedy algorithms, a classical anomaly, known as fragment collapsing, introduces errors into the maps that impedes furth...