scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 1998"


Journal ArticleDOI
TL;DR: It is shown that the protein folding problem in the two-dimensional H-P model is NP-complete.
Abstract: We show that the protein folding problem in the two-dimensional H-P model is NP-complete.

436 citations


Journal ArticleDOI
TL;DR: Theprotein folding problem under the HP model on the cubic lattice is shown to be NP-complete, which means that the protein folding problem belongs to a large set of problems that are believed to be computationally intractable.
Abstract: One of the simplest and most popular biophysical models of protein folding is the hydrophobic-hydrophilic (HP) model. The HP model abstracts the hydrophobic interaction in protein folding by labeling the amino acids as hydrophobic (H for nonpolar) or hydrophilic (P for polar). Chains of amino acids are configured as self-avoiding walks on the 3D cubic lattice, where an optimal conformation maximizes the number of adjacencies between H's. In this paper, the protein folding problem under the HP model on the cubic lattice is shown to be NP-complete. This means that the protein folding problem belongs to a large set of problems that are believed to be computationally intractable.

399 citations


Journal ArticleDOI
TL;DR: A review of a number of existing methods developed to solve the discovery of patterns in biosequences and how these relate to each other, focusing on the algorithms underlying the approaches.
Abstract: This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms. Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis is presented of the ways in which an assessment can be made of the significance of the discovered patterns. It is shown that the problem is related to problems studied in the field of machine learning. The major part of this paper comprises a review of a number of existing methods developed to solve the problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examp...

351 citations


Journal ArticleDOI
TL;DR: This paper shows which formulations of multiple alignment have counterparts in multiple rearrangement, and proposes a branch-and-bound solution to TSP particularly suited to instances of the Travelling Salesman Problem.
Abstract: Multiple alignment of macromolecular sequences generalizes from N = 2 to N ≥ 3 the comparison of N sequences which have diverged through the local processes of insertion, deletion and substitution. Gene-order sequences diverge through non-local genome rearrangement processes such as inversion (or reversal) and transposition. In this paper we show which formulations of multiple alignment have counterparts in multiple rearrangement. Based on difficulties inherent in rearrangement edit-distance calculation and interpretation, we argue for the simpler "breakpoint analysis." Consensus-based multiple rearrangement of N ≥ 3 orders can be solved exactly through reduction to instances of the Travelling Salesman Problem (TSP). We propose a branch-and-bound solution to TSP particularly suited to these instances. Simulations show how non-uniqueness of the solution is attenuated with increasing numbers of data genomes. Tree-based multiple alignment can be achieved to a great degree of accuracy by decomposing ...

248 citations


Journal ArticleDOI
TL;DR: A new, faster algorithm for the key step in the HMM calculation employs a fast Fourier transform on the group of pedigree inheritance patterns, which substantially improves the overall performance of the software package GENEHUNTER for performing linkage analysis.
Abstract: Genetic linkage analysis of human pedigrees using many linked markers simultaneously is a difficult computational problem. We have previously described an approach to this problem that uses hidden Markov models (HMMs) and is quite efficient for pedigrees of moderate size. Here, we describe a new, faster algorithm for the key step in the HMM calculation. The algorithm employs a fast Fourier transform on the group of pedigree inheritance patterns. It substantially improves the overall performance of the software package GENEHUNTER for performing linkage analysis. The Fourier representation opens up new research directions for pedigree analysis.

224 citations


Journal ArticleDOI
TL;DR: In this article, the stickers model is introduced and a random access memory is used to store the information of a DNA strand in order to solve a wide class of search problems in the context of a microprocessor-controlled robotic workstation.
Abstract: We introduce a new model of molecular computation that we call the sticker model. Like many previous proposals it makes use of DNA strands as the physical substrate in which information is represented and of separation by hybridization as a central mechanism. However, unlike previous models, the stickers model has a random access memory that requires no strand extension and uses no enzymes; also (at least in theory), its materials are reusable. The paper describes computation under the stickers model and discusses possible means for physically implementing each operation. Finally, we go on to propose a specific machine architecture for implementing the stickers model as a microprocessor-controlled parallel robotic workstation. In the course of this development a number of previous general concerns about molecular computation (Smith, 1996; Hartmanis, 1995; Linial ct al., 1995) are addressed. First, it is clear that general-purpose algorithms can be implemented by DNA-based computers, potentially solving a wide class of search problems. Second, we Rnd that there are challenging problems, for which only modest volumes of DNA should suffice. Third, we demonstrate that the formation and breaking of covalent bonds is not intrinsic to DNA-based computation. Fourth, we show that a single essential biotechnology, sequence-specific separation, suffices for constructing a general-purpose molecular computer. Concerns about errors in this separation operation and means to reduce them are addressed elsewhere (Karp ct at, 1995; Rowels and Winfree, 1999). Despite these encouraging theoretical advances, we emphasize that substantial engineering challenges remain at almost all stages and that the ultimate success or failure of DNA computing will certainly depend on whether these challenges can be met in laboratory investigations.

214 citations


Journal ArticleDOI
TL;DR: It is shown that combining motif scores indeed gives better search accuracy, and that the MAST sequence homology search algorithm utilizing the product of p-values scoring method is available for interactive use and downloading.
Abstract: Position-specific scoring matrices are useful for representing and searching for protein sequence motifs. A sequence family can often be described by a group of one or more motifs, and an effective search must combine the scores for matching a sequence to each of the motifs in the group. We describe three methods for combining match scores and estimating the statistical significance of the combined scores and evaluate the search quality (classification accuracy) and the accuracy of the estimate of statistical significance of each. The three methods are: 1) sum of scores, 2) sum of reduced variates, 3) product of score p-values. We show that method 3) is superior to the other two methods in both regards, and that combining motif scores indeed gives better search accuracy. The MAST sequence homology search algorithm utilizing the product of p-values scoring method is available for interactive use and downloading at URL http://www.sdsc.edu/MEME.

210 citations


PatentDOI
TL;DR: In this article, a method of producing high-resolution, high-accuracy ordered restriction maps based on data created from the images of populations of individual DNA molecules (clones) digested by restriction enzymes is presented.
Abstract: A method of producing high-resolution, high-accuracy ordered restriction maps based on data created from the images of populations of individual DNA molecules (clones) digested by restriction enzymes. Detailed modeling and a statistical algorithm, along with an interactive algorithm based on dynamic programming and a heuristic method employing branch-and-bound procedures, are used to find the most likely true restriction map, based on experimental data.

121 citations


Journal ArticleDOI
TL;DR: The MORGAN system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA are described.
Abstract: MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95 %, with a correlation coefficient of 0.78, and a sensitivity and specificity for coding bases of 83 % and 79%. In addition, MORGAN identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly. This paper describes the MORGAN system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA.

118 citations


Journal ArticleDOI
TL;DR: The accuracy of the standard global dynamic programming method is measured and it is shown that it can be reasonably well modelled by an "edge wander" approximation to the distribution of the optimal scoring path around the correct path in the vicinity of a gap.
Abstract: Algorithms for generating alignments of biological sequences have inherent statistical limitations when it comes to the accuracy of the alignments they produce. Using simulations, we measure the accuracy of the standard global dynamic programming method and show that it can be reasonably well modelled by an "edge wander" approximation to the distribution of the optimal scoring path around the correct path in the vicinity of a gap. We also give a table from which accuracy values can be predicted for commonly used scoring schemes and sequence divergences (the PAM and BLOSUM series). Finally we describe how to calculate the expected accuracy of a given alignment, and show how this can be used to construct an optimal accuracy alignment algorithm which generates significantly more accurate alignments than standard dynamic programming methods in simulated experiments.

107 citations


Journal ArticleDOI
TL;DR: An algorithm whose running time grows only linearly with the size of the set of predicted exons, which allows for multiple-gene two-strand predictions and for considering gene features other than coding exons in valid gene structures.
Abstract: In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.

Journal ArticleDOI
TL;DR: Modeling a DNA sequence as a stationary Markov chain, it is shown as an application that the compound Poisson approximation is efficient for the number of occurrences of rare stem-loop motifs.
Abstract: We derive a Poisson process approximation for the occurrences of clumps of multiple words and a compound Poisson process approximation for the number of occurrences of multiple words in a sequence of letters generated by a stationary Markov chain. Using the Chen-Stein method, we provide a bound on the error in the approximations. For rare words, these errors tend to zero as the length of the sequence increases to infinity. Modeling a DNA sequence as a stationary Markov chain, we show as an application that the compound Poisson approximation is efficient for the number of occurrences of rare stem-loop motifs.

Journal ArticleDOI
TL;DR: This work presents a new method for protein fold recognition through optimally aligning an amino acid sequence and a protein fold template (protein threading), and demonstrates that C is less than or equal to 4 for about 75% of the 293 unique folds in the protein database.
Abstract: Computational recognition of native-like folds of an anonymous amino acid sequence from a protein fold database is considered to be a promising approach to the three-dimensional (3D) fold prediction of the amino acid sequence. We present a new method for protein fold recognition through optimally aligning an amino acid sequence and a protein fold template (protein threading). The fitness of aligning an amino acid sequence with a fold template is measured by (1) the singleton fitness, representing the compatibility of substituting one amino acid by another and the combined preference of secondary structure and solvent accessibility for a particular amino acid, (2) the pairwise interaction, representing the contact preference between a pair of amino acids, and (3) alignment gap penalties. Though a protein threading problem so defined is known to be NP-hard in the most general sense, our algorithm runs efficiently if we place a cutoff distance on the pairwise interactions, as many of the existing threading programs do. For an amino acid sequence of size n and a fold template of size m with M core secondary structures, the algorithm finds an optimal alignment in O (Mn1.5C + 1 + mnC + 1) time and O (MnC + 1) space, where C is a (small) nonnegative integer, determined by a particular mathematical property of the pairwise interactions. As a case study, we have demonstrated that C is less than or equal to 4 for about 75% of the 293 unique folds in our protein database, when pairwise interactions are restricted to amino acids 4, when threading requires too much memory and time to be practical on a typical workstation.

Journal ArticleDOI
TL;DR: A scalable approach to DNA-based computations is described, where complex combinatorial mixtures of DNA molecules encoding all possible answers to a computational problem are synthesized and attached to the surface of a solid support.
Abstract: A scalable approach to DNA-based computations is described. Complex combinatorial mixtures of DNA molecules encoding all possible answers to a computational problem are synthesized and attached to the surface of a solid support. This set of molecules is queried in successive MARK (hybridization) and DESTROY (enzymatic digestion) operations. Determination of the sequence of the DNA molecules remaining on the surface after completion of these operations yields the answer to the computational problem. Experimental demonstrations of aspects of the strategy are presented.

Journal ArticleDOI
TL;DR: This paper addresses the problem of optimally aligning a given RNA sequence of unknown structure to one of known sequence and structure using methods from polyhedral combinatorics and could solve large problem instances--23S ribosomal RNA with more than 1400 bases.
Abstract: Ribonucleic acid (RNA) is a polymer composed of four bases denoted A, C, G, and U. It generally is a single-stranded molecule where the bases form hydrogen bonds within the same molecule leading to structure formation. In comparing different homologous RNA molecules it is important to consider both the base sequence and the structure of the molecules. Traditional alignment algorithms can only account for the sequence of bases, but not for the base pairings. Considering the structure leads to significant computational problems because of the dependencies introduced by the base pairings. In this paper we address the problem of optimally aligning a given RNA sequence of unknown structure to one of known sequence and structure. We phrase the problem as an integer linear program and then solve it using methods from polyhedral combinatorics. In our computational experiments we could solve large problem instances--23S ribosomal RNA with more than 1400 bases--a size intractable for former algorithms.

Journal ArticleDOI
TL;DR: A survey of some criteria of wide use in sequence alignment and comparison problems, and of the corresponding solutions is attempted.
Abstract: Molecular biology is becoming a computationally intense realm of contemporary science and faces some of the current grand scientific challenges. In its context, tools that identify, store, compare and analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance. Biosequences are routinely compared or aligned, in a variety of ways, to infer common ancestry, to detect functional equivalence, or simply while searching for similar entries in a database. A considerable body of knowledge has accumulated on sequence alignment during the past few decades. Without pretending to be exhaustive, this paper attempts a survey of some criteria of wide use in sequence alignment and comparison problems, and of the corresponding solutions. The paper is based on presentations and literature given at the Workshop on Sequence Alignment held at Princeton, N.J., in November 1994, as part of the DIMACS Special Year on Mathematical Support for Molecular Biology.

Journal ArticleDOI
TL;DR: The database is a novel application of ACEDB, which was the database originally developed to store the C. elegans genome and includes attractive graphical representations of signaling cascades and the three-dimensional structure of molecules.
Abstract: We developed a data and knowledge base for cellular signal transduction in human cells, to make this rapidly growing information available. The database includes all the biological properties of cellular signal transduction, including biological reactions that transfer cellular signals and molecular attributes characterized by sequences, structures, and functions. Since the database is based on the object-oriented technique, highly flexible methods of data definition and modification are necessary to handle this diverse and complex biological information. The database includes attractive graphical representations of signaling cascades and the three-dimensional structure of molecules. The database is a novel application of ACEDB, which was the database originally developed to store the C. elegans genome. The database can be accessed through the Internet at http://geo.nihs.go.jp/csndb.html.

Journal ArticleDOI
TL;DR: An algorithm developed to handle biomolecular structural recognition problems, based on an extension and generalization of the Hough transform and the Geometric Hashing paradigms for rigid object recognition, which allows hinge induced motions to exist in either the receptor or the ligand molecules of diverse sizes.
Abstract: In this work, we present an algorithm developed to handle biomolecular structural recognition problems, as part of an interdisciplinary research endeavor of the Computer Vision and Molecular Biology fields. A key problem in rational drug design and in biomolecular structural recognition is the generation of binding modes between two molecules, also known as molecular docking. Geometrical fitness is a necessary condition for molecular interaction. Hence, docking a ligand (e.g., a drug molecule or a protein molecule), to a protein receptor (e.g., enzyme), involves recognition of molecular surfaces. Conformational transitions by "hinge-bending" involves rotational movements of relatively rigid parts with respect to each other. The generation of docked binding modes between two associating molecules depends on their three dimensional structures (3-D) and their conformational flexibility. In comparison to the particular case of rigid-body docking, the computational difficulty grows considerably when t...

Journal ArticleDOI
TL;DR: In the framework of a duplication-based method for comparing gene and species trees, the concepts of "duplication" and "loss" are reformulated in set-theoretic terms and a number of related tree dissimilarity measures are suggested, and relations between them are analyzed.
Abstract: In the framework of a duplication-based method for comparing gene and species trees, the concepts of "duplication" and "loss" are reformulated in set-theoretic terms. A number of related tree dissimilarity measures is suggested, and relations between them are analyzed. For any node in the species tree, the number of gene duplications for which it is a "non-child" loss coincides with the number of times when the node's parent is an intermediate between the mapping images of a gene node and its parent. This implies that the total number of losses is equal to the number of intermediate nodes plus the number of one-side duplications and, thus, provides an alternative proof for a conjecture made by Mirkin, Muchnik, and Smith (1995). Another formula proven involves crossings (incompatible gene-species node pairs): the number of losses equals the number of crossings plus the number of duplications.

Journal ArticleDOI
TL;DR: Two new approaches for constructing phylogenetic trees are presented, based on geometric ideas and dynamic programming, and it is guaranteed to find the optimal tree (with respect to the given quartets).
Abstract: In this work we present two new approaches for constructing phylogenetic trees. The input is a list of weighted quartets over n taxa. Each quartet is a subtree on four taxa, and its weight represents a confidence level for the specific topology. The goal is to construct a binary tree with n leaves such that the total weight of the satisfied quartets is maximized (an NP hard problem). The first approach we present is based on geometric ideas. Using semidefinite programming, we embed the n points on the n-dimensional unit sphere, while maximizing an objective function. This function depends on Euclidean distances between the four points and reflects the quartet topology. Given the embedding, we construct a binary tree by performing geometric clustering. This process is similar to the traditional neighbor joining, with the difference that the update phase retains geometric meaning: When two neighbors are joined together, their common ancestor is taken to be the center of mass of the original points....

Journal ArticleDOI
TL;DR: An algorithm for identifying satellites in DNA sequences that is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.
Abstract: We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30–40 base pairs) approximate tandem repeats where copies may differ up to ϵ = 15–20% from a consensus model of the repeating unit (implying individual units may vary by 2ϵ from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 104 when ϵ = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as t...

Journal ArticleDOI
TL;DR: The current work introduces a straightforward generalization of pairwise sequence comparison algorithms to the case when multiple query sequences are available, called Family Pairwise Search (FPS), which is much more efficient than the training algorithms for statistical models.
Abstract: The function of an unknown biological sequence can often be accurately inferred by identifying sequences homologous to the original sequence. Given a query set of known homologs, there exist at least three general classes of techniques for finding additional homologs: pairwise sequence comparisons, motif analysis, and hidden Markov modeling. Pairwise sequence comparisons are typically employed when only a single query sequence is known. Hidden Markov models (HMMs), on the other hand, are usually trained with sets of more than 100 sequences. Motif-based methods fall in between these two extremes. The current work introduces a straightforward generalization of pairwise sequence comparison algorithms to the case when multiple query sequences are available. This algorithm, called Family Pairwise Search (FPS), combines pairwise sequence comparison scores from each query sequence. A BLAST implementation of FPS is compared to representative examples of hidden Markov modeling (HMMER) and motif modeling (...

Journal ArticleDOI
TL;DR: An efficient, reliable shotgun sequence assembly algorithm based on a fingerprinting scheme that is robust to both noise and repetitive sequences in the data, two primary roadblocks to effective whole-genome shotgun sequencing is proposed.
Abstract: In this thesis, the sequence assembly problem is dealt with. The sequence assembly problem is to reconstruct a DNA sequence from a collection of short DNA fragments taken from random positions by identifying overlaps between the fragments. This problem arises in practice when biologists use shotgun sequencing, a cost-effective method for reading DNA. Existing sequence assembly algorithms have demonstrated limited success in assembling (long) real DNA sequences because they are computationally expensive and fail to properly handle repetitive sequences which are common in real DNA. We propose an efficient, reliable, shotgun sequence assembly algorithm based on a fingerprinting scheme that is quite robust to both noise and repetitive sequences in the data. Our algorithm uses exact matches of short patterns randomly selected from fragment data to identify fragment overlaps, construct an overlap map, and finally deliver a consensus sequence. We show how statistical clues made explicit in our approach can easily be exploited to correctly assemble results even in the presence of extensive repetitive sequences. Our approach is exceptionally fast in practice: e.g., we have successfully assembled a whole Mycoplasma genitalium genome (approximately 580 kbps) in roughly 8 minutes of 64MB 200MHz Pentium Pro CPU time from real shotgun data, where most existing algorithms can be expected to run for several hours to a day on the data. In addition, experiments with shotgun data (data is taken from a wide range of organisms, including human DNA) synthetically prepared from real DNA sequences containing extensive repeats demonstrate our algorithm's robustness to repetitive sections in many different sequences. For example, we have correctly assembled a 238kbp Human DNA sequence in less than 3 minutes of 64MB 200MHz Pentium Pro CPU time.

Journal ArticleDOI
TL;DR: The results indicate that computing an optimal alignment under this constraint is very expensive, however, less rigorous conditions on the alignment can be guaranteed by quite efficient algorithms.
Abstract: Given a strong match between regions of two sequences, how far can the match be meaningfully extended if gaps are allowed in the resulting alignment? The aim is to avoid searching beyond the point that a useful extension of the alignment is likely to be found. Without loss of generality, we can restrict attention to the suffixes of the sequences that follow the strong match, which leads to the following formal problem. Given two sequences and a fixed X > 0, align initial portions of the sequences subject to the constraint that no section of the alignment scores below -X. Our results indicate that computing an optimal alignment under this constraint is very expensive. However, less rigorous conditions on the alignment can be guaranteed by quite efficient algorithms. One of these variants has been implemented in a new release of the Blast suite of database search programs.

Journal ArticleDOI
TL;DR: A statistical model, a hidden Markov model (HMM), of the DM domain has been created which identifies currently known DM domains and suggests new DM domains in viral, bacterial and eucaryotic proteins, but no DM domains were identified in the currently predicted proteins from the archaeon Methanococcus jannaschii.
Abstract: Deamination reactions are catalyzed by a variety of enzymes including those involved in nucleoside/nucleotide metabolism and cytosine to uracil (C→U) and adenosine to inosine (A→I) mRNA editing. The active site of the deaminase (DM) domain in these enzymes contains a conserved histidine (or rarely cysteine), two cysteines and a glutamate proposed to act as a proton shuttle during deamination. Here, a statistical model, a hidden Markov model (HMM), of the DM domain has been created which identifies currently known DM domains and suggests new DM domains in viral, bacterial and eucaryotic proteins. However, no DM domains were identified in the currently predicted proteins from the archaeon Methanococcus jannaschii and possible causes for, and a potential means to ameliorate this situation are discussed. In some of the newly identified DM domains, the glutamate is changed to a residue that could not function as a proton shuttle and in one instance (Mus musculus spermatid protein TENR) the cysteines a...

Journal ArticleDOI
TL;DR: An algorithm to find three-dimensional substructures common to two or more molecules and extended to perform multiple comparisons, by using one of the structures as a reference point (pivot) to which all other structures are compared.
Abstract: In this paper, we present an algorithm to find three-dimensional substructures common to two or more molecules. The basic algorithm is devoted to pairwise structural comparison. Given two sets of atomic coordinates, it finds the largest subsets of atoms which are "similar" in the sense that all internal distances are approximately conserved. The basic idea of the algorithm is to recursively build subsets of increasing sizes, combining two sets of size k to build a set of size k + 1. The algorithm can be used "as is" for small molecules or local parts of proteins (about 30 atoms). When a high number of atoms is involved, we use a two step procedure. First we look for common "local" fragments by using the previous algorithm, and then we gather these fragments by using a Branch and Bound technique. We also extend the basic algorithm to perform multiple comparisons, by using one of the structures as a reference point (pivot) to which all other structures are compared. The solution is the largest subsets of atoms common to the pivot and at least q other structures. Although both algorithms are theoretically exponential in the number of atoms, experiments performed on biological data and using realistic parameters show that the solution is obtained within a few minutes. Finally, an application to the determination of the structural core of seven globins is presented.

Journal ArticleDOI
TL;DR: A general framework is presented for analyzing multiple protein structures using statistical regression methods, and it is revealed that globins are most strongly conserved structurally in helical regions, particularly in the mid-regions of the E, F, and G helices.
Abstract: A general framework is presented for analyzing multiple protein structures using statistical regression methods. The regression approach can superimpose protein structures rigidly or with shear. Also, this approach can superimpose multiple structures explicitly, without resorting to pairwise superpositions. The algorithm alternates between matching corresponding landmarks among the protein structures and superimposing these landmarks. Matching is performed using a robust dynamic programming technique that uses gap penalties that adapt to the given data. Superposition is performed using either orthogonal transformations, which impose the rigid-body assumption, or affine transformations, which allow shear. The resulting regression model of a protein family measures the amount of structural variability at each landmark. A variation of our algorithm permits a separate weight for each landmark, thereby allowing one to emphasize particular segments of a protein structure or to compensate for variances that differ at various positions in a structure. In addition, a method is introduced for finding an initial correspondence, by measuring the discrete curvature along each protein backbone. Discrete curvature also characterizes the secondary structure of a protein backbone, distinguishing among helical, strand, and loop regions. An example is presented involving a set of seven globin structures. Regression analysis, using both affine and orthogonal transformations, reveals that globins are most strongly conserved structurally in helical regions, particularly in the mid-regions of the E, F, and G helices.

Journal ArticleDOI
TL;DR: It is found that under the conditions required to obtain single nucleotide specificity in the hybridization process, hybridization efficiency is low, compromising the utility of singleucleotide encoding for DNA computing applications in the absence of some additional mechanism for increasing specificity.
Abstract: The feasibility of encoding a bit (0 or 1) of information for DNA-based computations at the single nucleotide level is evaluated, particularly with regard to the efficiency and specificity of hybridization discrimination. Hybridization experiments are performed on addressed arrays of 32 (25) distinct oligonucleotides immobilized on chemically modified glass and gold surfaces with information encoded in a binary (base 2) format. Similar results are obtained on both glass and gold surfaces and the results are generally consistent with thermodynamic calculations of matched and mismatched duplex stabilities. It is found that under the conditions required to obtain single nucleotide specificity in the hybridization process, hybridization efficiency is low, compromising the utility of single nucleotide encoding for DNA computing applications in the absence of some additional mechanism for increasing specificity. Several methods are suggested to provide such increased discrimination.

Journal ArticleDOI
TL;DR: This paper shows that the three methods that combine natural hierarchies with empirical hierarchies create decompositions which increase the efficiency of computations by as much as 50-fold and suggests that a speedup of about five can be expected just by virtue of having a decomposition.
Abstract: The task of computing molecular structure from combinations of experimental and theoretical constraints is expensive because of the large number of estimated parameters (the 3D coordinates of each atom) and the rugged landscape of many objective functions. For large molecular ensembles with multiple protein and nucleic acid components, the problem of maintaining tractability in structural computations becomes critical. A well-known strategy for solving difficult problems is divide-and-conquer. For molecular computations, there are two ways in which problems can be divided: (1) using the natural hierarchy within biological macromolecules (taking advantage of primary sequence, secondary structural subunits and tertiary structural motifs, when they are known); and (2) using the hierarchy that results from analyzing the distribution of structural constraints (providing information about which substructures are constrained to one another). In this paper, we show that these two hierarchies can be complementary and can provide information for efficient decomposition of structural computations. We demonstrate five methods for building such hierarchies--two automated heuristics that use both natural and empirical hierarchies, one knowledge-based process using both hierarchies, one method based on the natural hierarchy alone, and for completeness one random hierarchy oblivious to auxiliary information--and apply them to a data set for the procaryotic 30S ribosomal subunit using our probabilistic least squares structure estimation algorithm. We show that the three methods that combine natural hierarchies with empirical hierarchies create decompositions which increase the efficiency of computations by as much as 50-fold. There is only half this gain when using the natural decomposition alone, while the random hierarchy suggests that a speedup of about five can be expected just by virtue of having a decomposition. Although the knowledge-based method performs marginally better, the automatic heuristics are easier to use, scale more reliably to larger problems, and can match the performance of knowledge-based methods if provided with basic structural information.

Journal ArticleDOI
TL;DR: This work explains in detail how the observations of Evans and Speed lead to a simple, computationally feasible algorithm for constructing a minimal generating set for the ideal of invariants, and proves that the cardinality of such a generating set can be computed using a simple "degrees of freedom" formula.
Abstract: The method of invariants is an approach to the problem of reconstructing the phylogenetic tree of a collection of m taxa using nucleotide sequence data. Models for the respective probabilities of the 4m possible vectors of bases at a given site will have unknown parameters that describe the random mechanism by which substitution occurs along the branches of a putative phylogenetic tree. An invariant is a polynomial in these probabilities that, for a given phylogeny, is zero for all choices of the substitution mechanism parameters. If the invariant is typically non-zero for another phylogenetic tree, then estimates of the invariant can be used as evidence to support one phylogeny over another. Previous work of Evans and Speed showed that, for certain commonly used substitution models, the problem of finding a minimal generating set for the ideal of invariants can be reduced to the linear algebra problem of finding a basis for a certain lattice (that is, a free [unk]-module). They also conjectured ...