TL;DR: Theprotein folding problem under the HP model on the cubic lattice is shown to be NP-complete, which means that the protein folding problem belongs to a large set of problems that are believed to be computationally intractable.

...read moreread less

Abstract: One of the simplest and most popular biophysical models of protein folding is the hydrophobic-hydrophilic (HP) model. The HP model abstracts the hydrophobic interaction in protein folding by labeling the amino acids as hydrophobic (H for nonpolar) or hydrophilic (P for polar). Chains of amino acids are configured as self-avoiding walks on the 3D cubic lattice, where an optimal conformation maximizes the number of adjacencies between H's. In this paper, the protein folding problem under the HP model on the cubic lattice is shown to be NP-complete. This means that the protein folding problem belongs to a large set of problems that are believed to be computationally intractable.

...read moreread less

399 citations

Journal Article•DOI•

Approaches to the Automatic Discovery of Patterns in Biosequences

[...]

Alvis Brazma¹, Inge Jonassen, Ingvar Eidhammer, David Gilbert•Institutions (1)

European Bioinformatics Institute¹

01 Jan 1998-Journal of Computational Biology

TL;DR: A review of a number of existing methods developed to solve the discovery of patterns in biosequences and how these relate to each other, focusing on the algorithms underlying the approaches.

...read moreread less

Abstract: This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms. Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis is presented of the ways in which an assessment can be made of the significance of the discovered patterns. It is shown that the problem is related to problems studied in the field of machine learning. The major part of this paper comprises a review of a number of existing methods developed to solve the problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examp...

...read moreread less

351 citations

Journal Article•DOI•

Multiple Genome Rearrangement and Breakpoint Phylogeny

[...]

David Sankoff¹, Mathieu Blanchette•Institutions (1)

Centre de Recherches Mathématiques¹

01 Jan 1998-Journal of Computational Biology

TL;DR: This paper shows which formulations of multiple alignment have counterparts in multiple rearrangement, and proposes a branch-and-bound solution to TSP particularly suited to instances of the Travelling Salesman Problem.

...read moreread less

Abstract: Multiple alignment of macromolecular sequences generalizes from N = 2 to N ≥ 3 the comparison of N sequences which have diverged through the local processes of insertion, deletion and substitution. Gene-order sequences diverge through non-local genome rearrangement processes such as inversion (or reversal) and transposition. In this paper we show which formulations of multiple alignment have counterparts in multiple rearrangement. Based on difficulties inherent in rearrangement edit-distance calculation and interpretation, we argue for the simpler "breakpoint analysis." Consensus-based multiple rearrangement of N ≥ 3 orders can be solved exactly through reduction to instances of the Travelling Salesman Problem (TSP). We propose a branch-and-bound solution to TSP particularly suited to these instances. Simulations show how non-uniqueness of the solution is attenuated with increasing numbers of data genomes. Tree-based multiple alignment can be achieved to a great degree of accuracy by decomposing ...

...read moreread less

248 citations

Journal Article•DOI•

Faster multipoint linkage analysis using Fourier transforms.

[...]

Leonid Kruglyak¹, Eric S. Lander•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 1998-Journal of Computational Biology

TL;DR: A new, faster algorithm for the key step in the HMM calculation employs a fast Fourier transform on the group of pedigree inheritance patterns, which substantially improves the overall performance of the software package GENEHUNTER for performing linkage analysis.

...read moreread less

Abstract: Genetic linkage analysis of human pedigrees using many linked markers simultaneously is a difficult computational problem. We have previously described an approach to this problem that uses hidden Markov models (HMMs) and is quite efficient for pedigrees of moderate size. Here, we describe a new, faster algorithm for the key step in the HMM calculation. The algorithm employs a fast Fourier transform on the group of pedigree inheritance patterns. It substantially improves the overall performance of the software package GENEHUNTER for performing linkage analysis. The Fourier representation opens up new research directions for pedigree analysis.

...read moreread less

224 citations

Journal Article•DOI•

A sticker-based model for DNA computation.

[...]

Sam T. Roweis¹, Erik Winfree, Richard Burgoyne, Nickolas Chelyapov, Myron F. Goodman, Paul W. K. Rothemund, Leonard M. Adleman - Show less +3 more•Institutions (1)

University of Southern California¹

01 Jan 1998-Journal of Computational Biology

TL;DR: In this article, the stickers model is introduced and a random access memory is used to store the information of a DNA strand in order to solve a wide class of search problems in the context of a microprocessor-controlled robotic workstation.

...read moreread less

Abstract: We introduce a new model of molecular computation that we call the sticker model. Like many previous proposals it makes use of DNA strands as the physical substrate in which information is represented and of separation by hybridization as a central mechanism. However, unlike previous models, the stickers model has a random access memory that requires no strand extension and uses no enzymes; also (at least in theory), its materials are reusable. The paper describes computation under the stickers model and discusses possible means for physically implementing each operation. Finally, we go on to propose a specific machine architecture for implementing the stickers model as a microprocessor-controlled parallel robotic workstation. In the course of this development a number of previous general concerns about molecular computation (Smith, 1996; Hartmanis, 1995; Linial ct al., 1995) are addressed. First, it is clear that general-purpose algorithms can be implemented by DNA-based computers, potentially solving a wide class of search problems. Second, we Rnd that there are challenging problems, for which only modest volumes of DNA should suffice. Third, we demonstrate that the formation and breaking of covalent bonds is not intrinsic to DNA-based computation. Fourth, we show that a single essential biotechnology, sequence-specific separation, suffices for constructing a general-purpose molecular computer. Concerns about errors in this separation operation and means to reduce them are addressed elsewhere (Karp ct at, 1995; Rowels and Winfree, 1999). Despite these encouraging theoretical advances, we emphasize that substantial engineering challenges remain at almost all stages and that the ultimate success or failure of DNA computing will certainly depend on whether these challenges can be met in laboratory investigations.

...read moreread less

214 citations

Journal Article•DOI•

Methods and Statistics for Combining Motif Match Scores

[...]

Timothy L. Bailey¹, Michael Gribskov•Institutions (1)

San Diego Supercomputer Center¹

01 Jan 1998-Journal of Computational Biology

TL;DR: It is shown that combining motif scores indeed gives better search accuracy, and that the MAST sequence homology search algorithm utilizing the product of p-values scoring method is available for interactive use and downloading.

...read moreread less

Abstract: Position-specific scoring matrices are useful for representing and searching for protein sequence motifs. A sequence family can often be described by a group of one or more motifs, and an effective search must combine the scores for matching a sequence to each of the motifs in the group. We describe three methods for combining match scores and estimating the statistical significance of the combined scores and evaluate the search quality (classification accuracy) and the accuracy of the estimate of statistical significance of each. The three methods are: 1) sum of scores, 2) sum of reduced variates, 3) product of score p-values. We show that method 3) is superior to the other two methods in both regards, and that combining motif scores indeed gives better search accuracy. The MAST sequence homology search algorithm utilizing the product of p-values scoring method is available for interactive use and downloading at URL http://www.sdsc.edu/MEME.

...read moreread less

210 citations

Patent•DOI•

Genomics via optical mapping with ordered restriction maps

[...]

David C. Schwartz¹, Bhubaneswar Mishra¹, Thomas Anantharaman¹•Institutions (1)

Wisconsin Alumni Research Foundation¹

01 Jul 1998-Journal of Computational Biology

TL;DR: In this article, a method of producing high-resolution, high-accuracy ordered restriction maps based on data created from the images of populations of individual DNA molecules (clones) digested by restriction enzymes is presented.

...read moreread less

Abstract: A method of producing high-resolution, high-accuracy ordered restriction maps based on data created from the images of populations of individual DNA molecules (clones) digested by restriction enzymes. Detailed modeling and a statistical algorithm, along with an interactive algorithm based on dynamic programming and a heuristic method employing branch-and-bound procedures, are used to find the most likely true restriction map, based on experimental data.

...read moreread less

121 citations

Journal Article•DOI•

A decision tree system for finding genes in DNA

[...]

Steven L. Salzberg¹, Arthur L. Delcher, Kenneth H. Fasman², John C. Henderson³•Institutions (3)

TigerLogic¹, AstraZeneca², Johns Hopkins University³

01 Jan 1998-Journal of Computational Biology

TL;DR: The MORGAN system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA are described.

...read moreread less

Abstract: MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95 %, with a correlation coefficient of 0.78, and a sensitivity and specificity for coding bases of 83 % and 79%. In addition, MORGAN identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly. This paper describes the MORGAN system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA.

...read moreread less

118 citations

Journal Article•DOI•

Dynamic programming alignment accuracy.

[...]

Ian Holmes¹, Richard Durbin•Institutions (1)

Wellcome Trust¹

01 Jan 1998-Journal of Computational Biology

TL;DR: The accuracy of the standard global dynamic programming method is measured and it is shown that it can be reasonably well modelled by an "edge wander" approximation to the distribution of the optimal scoring path around the correct path in the vicinity of a gap.

...read moreread less

Abstract: Algorithms for generating alignments of biological sequences have inherent statistical limitations when it comes to the accuracy of the alignments they produce. Using simulations, we measure the accuracy of the standard global dynamic programming method and show that it can be reasonably well modelled by an "edge wander" approximation to the distribution of the optimal scoring path around the correct path in the vicinity of a gap. We also give a table from which accuracy values can be predicted for commonly used scoring schemes and sequence divergences (the PAM and BLOSUM series). Finally we describe how to calculate the expected accuracy of a given alignment, and show how this can be used to construct an optimal accuracy alignment algorithm which generates significantly more accurate alignments than standard dynamic programming methods in simulated experiments.

...read moreread less

107 citations

Journal Article•DOI•

Assembling genes from predicted exons in linear time with dynamic programming.

[...]

Roderic Guigó¹•Institutions (1)

University of Barcelona¹

01 Jan 1998-Journal of Computational Biology

TL;DR: An algorithm whose running time grows only linearly with the size of the set of predicted exons, which allows for multiple-gene two-strand predictions and for considering gene features other than coding exons in valid gene structures.

...read moreread less

Abstract: In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.

...read moreread less

Journal Article•DOI•

Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains

[...]

Gesine Reinert¹, Sophie Schbath•Institutions (1)

University of California, Los Angeles¹

01 Jan 1998-Journal of Computational Biology

TL;DR: Modeling a DNA sequence as a stationary Markov chain, it is shown as an application that the compound Poisson approximation is efficient for the number of occurrences of rare stem-loop motifs.

...read moreread less

Abstract: We derive a Poisson process approximation for the occurrences of clumps of multiple words and a compound Poisson process approximation for the number of occurrences of multiple words in a sequence of letters generated by a stationary Markov chain. Using the Chen-Stein method, we provide a bound on the error in the approximations. For rare words, these errors tend to zero as the length of the sequence increases to infinity. Modeling a DNA sequence as a stationary Markov chain, we show as an application that the compound Poisson approximation is efficient for the number of occurrences of rare stem-loop motifs.

...read moreread less

Journal Article•DOI•

An efficient computational method for globally optimal threading.

[...]

Ying Xu¹, Dong Xu, Edward C. Uberbacher•Institutions (1)

Oak Ridge National Laboratory¹

01 Jan 1998-Journal of Computational Biology

TL;DR: This work presents a new method for protein fold recognition through optimally aligning an amino acid sequence and a protein fold template (protein threading), and demonstrates that C is less than or equal to 4 for about 75% of the 293 unique folds in the protein database.

...read moreread less

Abstract: Computational recognition of native-like folds of an anonymous amino acid sequence from a protein fold database is considered to be a promising approach to the three-dimensional (3D) fold prediction of the amino acid sequence. We present a new method for protein fold recognition through optimally aligning an amino acid sequence and a protein fold template (protein threading). The fitness of aligning an amino acid sequence with a fold template is measured by (1) the singleton fitness, representing the compatibility of substituting one amino acid by another and the combined preference of secondary structure and solvent accessibility for a particular amino acid, (2) the pairwise interaction, representing the contact preference between a pair of amino acids, and (3) alignment gap penalties. Though a protein threading problem so defined is known to be NP-hard in the most general sense, our algorithm runs efficiently if we place a cutoff distance on the pairwise interactions, as many of the existing threading programs do. For an amino acid sequence of size n and a fold template of size m with M core secondary structures, the algorithm finds an optimal alignment in O (Mn1.5C + 1 + mnC + 1) time and O (MnC + 1) space, where C is a (small) nonnegative integer, determined by a particular mathematical property of the pairwise interactions. As a case study, we have demonstrated that C is less than or equal to 4 for about 75% of the 293 unique folds in our protein database, when pairwise interactions are restricted to amino acids 4, when threading requires too much memory and time to be practical on a typical workstation.

...read moreread less

Journal Article•DOI•

A surface-based approach to DNA computation.

[...]

Lloyd M. Smith¹, Robert M. Corn, Anne Condon¹, Max G. Lagally¹, Anthony G. Frutos, Qinghua Liu, Andrew J. Thiel - Show less +3 more•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 1998-Journal of Computational Biology

TL;DR: A scalable approach to DNA-based computations is described, where complex combinatorial mixtures of DNA molecules encoding all possible answers to a computational problem are synthesized and attached to the surface of a solid support.

...read moreread less

Abstract: A scalable approach to DNA-based computations is described. Complex combinatorial mixtures of DNA molecules encoding all possible answers to a computational problem are synthesized and attached to the surface of a solid support. This set of molecules is queried in successive MARK (hybridization) and DESTROY (enzymatic digestion) operations. Determination of the sequence of the DNA molecules remaining on the surface after completion of these operations yields the answer to the computational problem. Experimental demonstrations of aspects of the strategy are presented.

...read moreread less

Journal Article•DOI•

A Polyhedral Approach to RNA Sequence Structure Alignment

[...]

Hans-Peter Lenhof¹, Knut Reinert¹, Martin Vingron•Institutions (1)

Max Planck Society¹

01 Jan 1998-Journal of Computational Biology

TL;DR: This paper addresses the problem of optimally aligning a given RNA sequence of unknown structure to one of known sequence and structure using methods from polyhedral combinatorics and could solve large problem instances--23S ribosomal RNA with more than 1400 bases.

...read moreread less

Abstract: Ribonucleic acid (RNA) is a polymer composed of four bases denoted A, C, G, and U. It generally is a single-stranded molecule where the bases form hydrogen bonds within the same molecule leading to structure formation. In comparing different homologous RNA molecules it is important to consider both the base sequence and the structure of the molecules. Traditional alignment algorithms can only account for the sequence of bases, but not for the base pairings. Considering the structure leads to significant computational problems because of the dependencies introduced by the base pairings. In this paper we address the problem of optimally aligning a given RNA sequence of unknown structure to one of known sequence and structure. We phrase the problem as an integer linear program and then solve it using methods from polyhedral combinatorics. In our computational experiments we could solve large problem instances--23S ribosomal RNA with more than 1400 bases--a size intractable for former algorithms.

...read moreread less

Journal Article•DOI•

Sequence alignment in molecular biology.

[...]

Alberto Apostolico¹, Raffaele Giancarlo•Institutions (1)

Purdue University¹

01 Jan 1998-Journal of Computational Biology

TL;DR: A survey of some criteria of wide use in sequence alignment and comparison problems, and of the corresponding solutions is attempted.

...read moreread less

Abstract: Molecular biology is becoming a computationally intense realm of contemporary science and faces some of the current grand scientific challenges. In its context, tools that identify, store, compare and analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance. Biosequences are routinely compared or aligned, in a variety of ways, to infer common ancestry, to detect functional equivalence, or simply while searching for similar entries in a database. A considerable body of knowledge has accumulated on sequence alignment during the past few decades. Without pretending to be exhaustive, this paper attempts a survey of some criteria of wide use in sequence alignment and comparison problems, and of the corresponding solutions. The paper is based on presentations and literature given at the Workshop on Sequence Alignment held at Princeton, N.J., in November 1994, as part of the DIMACS Special Year on Mathematical Support for Molecular Biology.

...read moreread less

Journal Article•DOI•

A database for cell signaling networks.

[...]

Takako Takai-Igarashi, Yoko Nadaoka¹, Tsuguchika Kaminuma•Institutions (1)

Institute of Medical Science¹

01 Jan 1998-Journal of Computational Biology

TL;DR: The database is a novel application of ACEDB, which was the database originally developed to store the C. elegans genome and includes attractive graphical representations of signaling cascades and the three-dimensional structure of molecules.

...read moreread less

Abstract: We developed a data and knowledge base for cellular signal transduction in human cells, to make this rapidly growing information available. The database includes all the biological properties of cellular signal transduction, including biological reactions that transfer cellular signals and molecular attributes characterized by sequences, structures, and functions. Since the database is based on the object-oriented technique, highly flexible methods of data definition and modification are necessary to handle this diverse and complex biological information. The database includes attractive graphical representations of signaling cascades and the three-dimensional structure of molecules. The database is a novel application of ACEDB, which was the database originally developed to store the C. elegans genome. The database can be accessed through the Internet at http://geo.nihs.go.jp/csndb.html.

...read moreread less

Journal Article•DOI•

A method for biomolecular structural recognition and docking allowing conformational flexibility.

[...]

Bilha Sandak¹, Ruth Nussinov, Haim J. Wolfson•Institutions (1)

Weizmann Institute of Science¹

01 Jan 1998-Journal of Computational Biology

TL;DR: An algorithm developed to handle biomolecular structural recognition problems, based on an extension and generalization of the Hough transform and the Geometric Hashing paradigms for rigid object recognition, which allows hinge induced motions to exist in either the receptor or the ligand molecules of diverse sizes.

...read moreread less

Abstract: In this work, we present an algorithm developed to handle biomolecular structural recognition problems, as part of an interdisciplinary research endeavor of the Computer Vision and Molecular Biology fields. A key problem in rational drug design and in biomolecular structural recognition is the generation of binding modes between two molecules, also known as molecular docking. Geometrical fitness is a necessary condition for molecular interaction. Hence, docking a ligand (e.g., a drug molecule or a protein molecule), to a protein receptor (e.g., enzyme), involves recognition of molecular surfaces. Conformational transitions by "hinge-bending" involves rotational movements of relatively rigid parts with respect to each other. The generation of docked binding modes between two associating molecules depends on their three dimensional structures (3-D) and their conformational flexibility. In comparison to the particular case of rigid-body docking, the computational difficulty grows considerably when t...

...read moreread less

Journal Article•DOI•

Duplication-based measures of difference between gene and species trees

[...]

Oliver Eulenstein¹, Boris Mirkin, Martin Vingron²•Institutions (2)

University of Bonn¹, National Research University – Higher School of Economics²

01 Jan 1998-Journal of Computational Biology

TL;DR: In the framework of a duplication-based method for comparing gene and species trees, the concepts of "duplication" and "loss" are reformulated in set-theoretic terms and a number of related tree dissimilarity measures are suggested, and relations between them are analyzed.

...read moreread less

Abstract: In the framework of a duplication-based method for comparing gene and species trees, the concepts of "duplication" and "loss" are reformulated in set-theoretic terms. A number of related tree dissimilarity measures is suggested, and relations between them are analyzed. For any node in the species tree, the number of gene duplications for which it is a "non-child" loss coincides with the number of times when the node's parent is an intermediate between the mapping images of a gene node and its parent. This implies that the total number of losses is equal to the number of intermediate nodes plus the number of one-side duplications and, thus, provides an alternative proof for a conjecture made by Mirkin, Muchnik, and Smith (1995). Another formula proven involves crossings (incompatible gene-species node pairs): the number of losses equals the number of crossings plus the number of duplications.

...read moreread less

Journal Article•DOI•

Constructing phylogenies from quartets: elucidation of eutherian superordinal relationships.

[...]

Amir Ben-Dor¹, Benny Chor, Dan Graur, Ron Ophir, Dan Pelleg - Show less +1 more•Institutions (1)

Technion – Israel Institute of Technology¹

01 Jan 1998-Journal of Computational Biology

TL;DR: Two new approaches for constructing phylogenetic trees are presented, based on geometric ideas and dynamic programming, and it is guaranteed to find the optimal tree (with respect to the given quartets).

...read moreread less

Abstract: In this work we present two new approaches for constructing phylogenetic trees. The input is a list of weighted quartets over n taxa. Each quartet is a subtree on four taxa, and its weight represents a confidence level for the specific topology. The goal is to construct a binary tree with n leaves such that the total weight of the satisfied quartets is maximized (an NP hard problem). The first approach we present is based on geometric ideas. Using semidefinite programming, we embed the n points on the n-dimensional unit sphere, while maximizing an objective function. This function depends on Euclidean distances between the four points and reflects the quartet topology. Given the embedding, we construct a binary tree by performing geometric clustering. This process is similar to the traditional neighbor joining, with the difference that the update phase retains geometric meaning: When two neighbors are joined together, their common ancestor is taken to be the center of mass of the original points....

...read moreread less

Journal Article•DOI•

Identifying satellites and periodic repetitions in biological sequences.

[...]

Marie-France Sagot¹, Eugene W. Myers•Institutions (1)

Pasteur Institute¹

01 Jan 1998-Journal of Computational Biology

TL;DR: An algorithm for identifying satellites in DNA sequences that is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.

...read moreread less

Abstract: We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30–40 base pairs) approximate tandem repeats where copies may differ up to ϵ = 15–20% from a consensus model of the repeating unit (implying individual units may vary by 2ϵ from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 104 when ϵ = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as t...

...read moreread less

Journal Article•DOI•

Homology detection via family pairwise search.

[...]

William Noble Grundy¹•Institutions (1)

University of California, San Diego¹

01 Jan 1998-Journal of Computational Biology

TL;DR: The current work introduces a straightforward generalization of pairwise sequence comparison algorithms to the case when multiple query sequences are available, called Family Pairwise Search (FPS), which is much more efficient than the training algorithms for statistical models.

...read moreread less

Abstract: The function of an unknown biological sequence can often be accurately inferred by identifying sequences homologous to the original sequence. Given a query set of known homologs, there exist at least three general classes of techniques for finding additional homologs: pairwise sequence comparisons, motif analysis, and hidden Markov modeling. Pairwise sequence comparisons are typically employed when only a single query sequence is known. Hidden Markov models (HMMs), on the other hand, are usually trained with sets of more than 100 sequences. Motif-based methods fall in between these two extremes. The current work introduces a straightforward generalization of pairwise sequence comparison algorithms to the case when multiple query sequences are available. This algorithm, called Family Pairwise Search (FPS), combines pairwise sequence comparison scores from each query sequence. A BLAST implementation of FPS is compared to representative examples of hidden Markov modeling (HMMER) and motif modeling (...

...read moreread less

Journal Article•DOI•

AMASS: a structured pattern matching approach to shotgun sequence assembly

[...]

Sun Kim¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

17 Mar 1998-Journal of Computational Biology

TL;DR: An efficient, reliable shotgun sequence assembly algorithm based on a fingerprinting scheme that is robust to both noise and repetitive sequences in the data, two primary roadblocks to effective whole-genome shotgun sequencing is proposed.

...read moreread less

Abstract: In this thesis, the sequence assembly problem is dealt with. The sequence assembly problem is to reconstruct a DNA sequence from a collection of short DNA fragments taken from random positions by identifying overlaps between the fragments. This problem arises in practice when biologists use shotgun sequencing, a cost-effective method for reading DNA. Existing sequence assembly algorithms have demonstrated limited success in assembling (long) real DNA sequences because they are computationally expensive and fail to properly handle repetitive sequences which are common in real DNA. We propose an efficient, reliable, shotgun sequence assembly algorithm based on a fingerprinting scheme that is quite robust to both noise and repetitive sequences in the data. Our algorithm uses exact matches of short patterns randomly selected from fragment data to identify fragment overlaps, construct an overlap map, and finally deliver a consensus sequence. We show how statistical clues made explicit in our approach can easily be exploited to correctly assemble results even in the presence of extensive repetitive sequences. Our approach is exceptionally fast in practice: e.g., we have successfully assembled a whole Mycoplasma genitalium genome (approximately 580 kbps) in roughly 8 minutes of 64MB 200MHz Pentium Pro CPU time from real shotgun data, where most existing algorithms can be expected to run for several hours to a day on the data. In addition, experiments with shotgun data (data is taken from a wide range of organisms, including human DNA) synthetically prepared from real DNA sequences containing extensive repeats demonstrate our algorithm's robustness to repetitive sections in many different sequences. For example, we have correctly assembled a 238kbp Human DNA sequence in less than 3 minutes of 64MB 200MHz Pentium Pro CPU time.

...read moreread less

Journal Article•DOI•

Alignments without low-scoring regions.

[...]

Zheng Zhang¹, Piotr Berman, Webb Miller¹•Institutions (1)

Pennsylvania State University¹

01 Jan 1998-Journal of Computational Biology

TL;DR: The results indicate that computing an optimal alignment under this constraint is very expensive, however, less rigorous conditions on the alignment can be guaranteed by quite efficient algorithms.

...read moreread less

Abstract: Given a strong match between regions of two sequences, how far can the match be meaningfully extended if gaps are allowed in the resulting alignment? The aim is to avoid searching beyond the point that a useful extension of the alignment is likely to be found. Without loss of generality, we can restrict attention to the suffixes of the sequences that follow the strong match, which leads to the following formal problem. Given two sequences and a fixed X > 0, align initial portions of the sequences subject to the constraint that no section of the alignment scores below -X. Our results indicate that computing an optimal alignment under this constraint is very expensive. However, less rigorous conditions on the alignment can be guaranteed by quite efficient algorithms. One of these variants has been implemented in a new release of the Blast suite of database search programs.

...read moreread less

Journal Article•DOI•

Statistical modelling and phylogenetic analysis of a deaminase domain

[...]

I. Saira Mian¹, Michael J. Moser, William R. Holley, Aloke Chatterjee•Institutions (1)

Lawrence Berkeley National Laboratory¹

01 Jan 1998-Journal of Computational Biology

TL;DR: A statistical model, a hidden Markov model (HMM), of the DM domain has been created which identifies currently known DM domains and suggests new DM domains in viral, bacterial and eucaryotic proteins, but no DM domains were identified in the currently predicted proteins from the archaeon Methanococcus jannaschii.

...read moreread less

Abstract: Deamination reactions are catalyzed by a variety of enzymes including those involved in nucleoside/nucleotide metabolism and cytosine to uracil (C→U) and adenosine to inosine (A→I) mRNA editing. The active site of the deaminase (DM) domain in these enzymes contains a conserved histidine (or rarely cysteine), two cysteines and a glutamate proposed to act as a proton shuttle during deamination. Here, a statistical model, a hidden Markov model (HMM), of the DM domain has been created which identifies currently known DM domains and suggests new DM domains in viral, bacterial and eucaryotic proteins. However, no DM domains were identified in the currently predicted proteins from the archaeon Methanococcus jannaschii and possible causes for, and a potential means to ameliorate this situation are discussed. In some of the newly identified DM domains, the glutamate is changed to a residue that could not function as a proton shuttle and in one instance (Mus musculus spermatid protein TENR) the cysteines a...

...read moreread less

Journal Article•DOI•

Pairwise and multiple identification of three-dimensional common substructures in proteins.

[...]

Vincent Escalier, Joël Pothier, Henri Soldano, Alain Viari

01 Jan 1998-Journal of Computational Biology

TL;DR: An algorithm to find three-dimensional substructures common to two or more molecules and extended to perform multiple comparisons, by using one of the structures as a reference point (pivot) to which all other structures are compared.

...read moreread less

Abstract: In this paper, we present an algorithm to find three-dimensional substructures common to two or more molecules. The basic algorithm is devoted to pairwise structural comparison. Given two sets of atomic coordinates, it finds the largest subsets of atoms which are "similar" in the sense that all internal distances are approximately conserved. The basic idea of the algorithm is to recursively build subsets of increasing sizes, combining two sets of size k to build a set of size k + 1. The algorithm can be used "as is" for small molecules or local parts of proteins (about 30 atoms). When a high number of atoms is involved, we use a two step procedure. First we look for common "local" fragments by using the previous algorithm, and then we gather these fragments by using a Branch and Bound technique. We also extend the basic algorithm to perform multiple comparisons, by using one of the structures as a reference point (pivot) to which all other structures are compared. The solution is the largest subsets of atoms common to the pivot and at least q other structures. Although both algorithms are theoretically exponential in the number of atoms, experiments performed on biological data and using realistic parameters show that the solution is obtained within a few minutes. Finally, an application to the determination of the structural core of seven globins is presented.

...read moreread less

Journal Article•DOI•

Regression analysis of multiple protein structures.

[...]

Thomas D. Wu¹, Scott C. Schmidler, Trevor Hastie, Douglas L. Brutlag•Institutions (1)

Stanford University¹

01 Jan 1998-Journal of Computational Biology

TL;DR: A general framework is presented for analyzing multiple protein structures using statistical regression methods, and it is revealed that globins are most strongly conserved structurally in helical regions, particularly in the mid-regions of the E, F, and G helices.

...read moreread less

Abstract: A general framework is presented for analyzing multiple protein structures using statistical regression methods. The regression approach can superimpose protein structures rigidly or with shear. Also, this approach can superimpose multiple structures explicitly, without resorting to pairwise superpositions. The algorithm alternates between matching corresponding landmarks among the protein structures and superimposing these landmarks. Matching is performed using a robust dynamic programming technique that uses gap penalties that adapt to the given data. Superposition is performed using either orthogonal transformations, which impose the rigid-body assumption, or affine transformations, which allow shear. The resulting regression model of a protein family measures the amount of structural variability at each landmark. A variation of our algorithm permits a separate weight for each landmark, thereby allowing one to emphasize particular segments of a protein structure or to compensate for variances that differ at various positions in a structure. In addition, a method is introduced for finding an initial correspondence, by measuring the discrete curvature along each protein backbone. Discrete curvature also characterizes the secondary structure of a protein backbone, distinguishing among helical, strand, and loop regions. An example is presented involving a set of seven globin structures. Regression analysis, using both affine and orthogonal transformations, reveals that globins are most strongly conserved structurally in helical regions, particularly in the mid-regions of the E, F, and G helices.

...read moreread less

Journal Article•DOI•

DNA computing on surfaces: encoding information at the single base level.

[...]

Qinghua Liu¹, Anthony G. Frutos, Andrew J. Thiel, Robert M. Corn, Lloyd M. Smith - Show less +1 more•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 1998-Journal of Computational Biology

TL;DR: It is found that under the conditions required to obtain single nucleotide specificity in the hybridization process, hybridization efficiency is low, compromising the utility of singleucleotide encoding for DNA computing applications in the absence of some additional mechanism for increasing specificity.

...read moreread less

Abstract: The feasibility of encoding a bit (0 or 1) of information for DNA-based computations at the single nucleotide level is evaluated, particularly with regard to the efficiency and specificity of hybridization discrimination. Hybridization experiments are performed on addressed arrays of 32 (25) distinct oligonucleotides immobilized on chemically modified glass and gold surfaces with information encoded in a binary (base 2) format. Similar results are obtained on both glass and gold surfaces and the results are generally consistent with thermodynamic calculations of matched and mismatched duplex stabilities. It is found that under the conditions required to obtain single nucleotide specificity in the hybridization process, hybridization efficiency is low, compromising the utility of single nucleotide encoding for DNA computing applications in the absence of some additional mechanism for increasing specificity. Several methods are suggested to provide such increased discrimination.

...read moreread less

Journal Article•DOI•

Hierarchical organization of molecular structure computations.

[...]

Cheng Che Chen¹, Jaswinder Pal Singh, Russ B. Altman¹•Institutions (1)

Stanford University¹

01 Jan 1998-Journal of Computational Biology

TL;DR: This paper shows that the three methods that combine natural hierarchies with empirical hierarchies create decompositions which increase the efficiency of computations by as much as 50-fold and suggests that a speedup of about five can be expected just by virtue of having a decomposition.

...read moreread less

Abstract: The task of computing molecular structure from combinations of experimental and theoretical constraints is expensive because of the large number of estimated parameters (the 3D coordinates of each atom) and the rugged landscape of many objective functions. For large molecular ensembles with multiple protein and nucleic acid components, the problem of maintaining tractability in structural computations becomes critical. A well-known strategy for solving difficult problems is divide-and-conquer. For molecular computations, there are two ways in which problems can be divided: (1) using the natural hierarchy within biological macromolecules (taking advantage of primary sequence, secondary structural subunits and tertiary structural motifs, when they are known); and (2) using the hierarchy that results from analyzing the distribution of structural constraints (providing information about which substructures are constrained to one another). In this paper, we show that these two hierarchies can be complementary and can provide information for efficient decomposition of structural computations. We demonstrate five methods for building such hierarchies--two automated heuristics that use both natural and empirical hierarchies, one knowledge-based process using both hierarchies, one method based on the natural hierarchy alone, and for completeness one random hierarchy oblivious to auxiliary information--and apply them to a data set for the procaryotic 30S ribosomal subunit using our probabilistic least squares structure estimation algorithm. We show that the three methods that combine natural hierarchies with empirical hierarchies create decompositions which increase the efficiency of computations by as much as 50-fold. There is only half this gain when using the natural decomposition alone, while the random hierarchy suggests that a speedup of about five can be expected just by virtue of having a decomposition. Although the knowledge-based method performs marginally better, the automatic heuristics are easier to use, scale more reliably to larger problems, and can match the performance of knowledge-based methods if provided with basic structural information.

...read moreread less

Journal Article•DOI•

Constructing and counting phylogenetic invariants.

[...]

Steven N. Evans¹, Xiaowen Zhou•Institutions (1)

University of California, Berkeley¹

01 Jan 1998-Journal of Computational Biology

TL;DR: This work explains in detail how the observations of Evans and Speed lead to a simple, computationally feasible algorithm for constructing a minimal generating set for the ideal of invariants, and proves that the cardinality of such a generating set can be computed using a simple "degrees of freedom" formula.

...read moreread less

Abstract: The method of invariants is an approach to the problem of reconstructing the phylogenetic tree of a collection of m taxa using nucleotide sequence data. Models for the respective probabilities of the 4m possible vectors of bases at a given site will have unknown parameters that describe the random mechanism by which substitution occurs along the branches of a putative phylogenetic tree. An invariant is a polynomial in these probabilities that, for a given phylogeny, is zero for all choices of the substitution mechanism parameters. If the invariant is typically non-zero for another phylogenetic tree, then estimates of the invariant can be used as evidence to support one phylogeny over another. Previous work of Evans and Speed showed that, for certain commonly used substitution models, the problem of finding a minimal generating set for the ideal of invariants can be reduced to the linear algebra problem of finding a basis for a certain lattice (that is, a free [unk]-module). They also conjectured ...

...read moreread less