scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2000"


Journal ArticleDOI
TL;DR: A new greedy alignment algorithm is introduced with particularly good performance and it is shown that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data.
Abstract: For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.

4,628 citations


Journal ArticleDOI
TL;DR: A new framework for discovering interactions between genes based on multiple expression measurements is proposed and a method for recovering gene interactions from microarray data is described using tools for learning Bayesian networks.
Abstract: DNA hybridization arrays simultaneously measure the expression level for thousands of genes. These measurements provide a "snapshot" of transcription levels within the cell. A major challenge in computational biology is to uncover, from such measurements, gene/protein interactions and key biological features of cellular systems. In this paper, we propose a new framework for discovering interactions between genes based on multiple expression measurements. This framework builds on the use of Bayesian networks for representing statistical dependencies. A Bayesian network is a graph-based model of joint multivariate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes and because they provide a clear methodology for learning from (noisy) observations. We start by showing how Bayesian networks can describe interactions between genes. We then describe a method for recovering gene interactions from microarray data using tools for learning Bayesian networks. Finally, we demonstrate this method on the S. cerevisiae cell-cycle measurements of Spellman et al. (1998).

3,507 citations


Journal ArticleDOI
TL;DR: It is demonstrated that ANOVA methods can be used to normalize microarray data and provide estimates of changes in gene expression that are corrected for potential confounding effects and establishes a framework for the general analysis and interpretation of micro array data.
Abstract: Spotted cDNA microarrays are emerging as a powerful and cost-effective tool for large-scale analysis of gene expression. Microarrays can be used to measure the relative quantities of specific mRNAs in two or more tissue samples for thousands of genes simultaneously. While the power of this technology has been recognized, many open questions remain about appropriate analysis of microarray data. One question is how to make valid estimates of the relative expression for genes that are not biased by ancillary sources of variation. Recognizing that there is inherent "noise" in microarray data, how does one estimate the error variation associated with an estimated change in expression, i.e., how does one construct the error bars? We demonstrate that ANOVA methods can be used to normalize microarray data and provide estimates of changes in gene expression that are corrected for potential confounding effects. This approach establishes a framework for the general analysis and interpretation of microarray data.

1,392 citations


Journal ArticleDOI
TL;DR: This work examines three sets of gene expression data measured across sets of tumor(s) and normal clinical samples, and presents results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM, AdaBoost and a novel clustering-based classification technique.
Abstract: Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer-related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples (Alon et al., 1999). The second consists of approximately equal to 100,000 clones, measured in 32 ovarian samples (unpublished extension of data set described in Schummer et al. (1999)). The third set consists of approximately equal to 7,100 genes, measured in 72 bone marrow and peripheral blood samples (Golub et al, 1999). We examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high-dimensional classification methods to assess the classification power of complete expression profiles. We present results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM (Cortes and Vapnik, 1995), AdaBoost (Freund and Schapire, 1997) and a novel clustering-based classification technique. As tumor samples can differ from normal samples in their cell-type composition, we also perform LOOCV experiments using appropriately modified sets of genes, attempting to eliminate the resulting bias. We demonstrate success rate of at least 90% in tumor versus normal classification, using sets of selected genes, with, as well as without, cellular-contamination-related members. These results are insensitive to the exact selection mechanism, over a certain range.

789 citations


Journal ArticleDOI
TL;DR: A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily using a new kernel function derived from a generative statistical model for a protein family, in this case a hidden Markov model.
Abstract: A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a protein family, in this case a hidden Markov model. This general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.

530 citations


Journal ArticleDOI
TL;DR: It is proved that the general problem of predicting RNA secondary structures containing pseudoknots is NP complete for a large class of reasonable models of pseudok nots.
Abstract: RNA molecules are sequences of nucleotides that serve as more than mere intermediaries between DNA and proteins, e.g., as catalytic molecules. Computational prediction of RNA secondary structure is among the few structure prediction problems that can be solved satisfactorily in polynomial time. Most work has been done to predict structures that do not contain pseudoknots. Allowing pseudoknots introduces modeling and computational problems. In this paper we consider the problem of predicting RNA secondary structures with pseudoknots based on free energy minimization. We first give a brief comparison of energy-based methods for predicting RNA secondary structures with pseudoknots. We then prove that the general problem of predicting RNA secondary structures containing pseudoknots is NP complete for a large class of reasonable models of pseudoknots.

350 citations


Journal ArticleDOI
TL;DR: A refined test for differentially expressed genes is reported which does not rely on gene expression ratios but directly compares a series of repeated measurements of the two dye intensities for each gene, using a statistical model to describe multiplicative and additive errors influencing an array experiment.
Abstract: Although two-color fluorescent DNA microarrays are now standard equipment in many molecular biology laboratories, methods for identifying differentially expressed genes in microarray data are still evolving. Here, we report a refined test for differentially expressed genes which does not rely on gene expression ratios but directly compares a series of repeated measurements of the two dye intensities for each gene. This test uses a statistical model to describe multiplicative and additive errors influencing an array experiment, where model parameters are estimated from observed intensities for all genes using the method of maximum likelihood. A generalized likelihood ratio test is performed for each gene to determine whether, under the model, these intensities are significantly different. We use this method to identify significant differences in gene expression among yeast cells growing in galactose-stimulating versus non-stimulating conditions and compare our results with current approaches for identifyin...

336 citations


Journal ArticleDOI
TL;DR: This extension of Dayhoff's approach enables us to estimate an amino acid substitution model from alignments of varying degree of divergence, and the capability of the new estimator to recover accurately the exchange frequencies among amino acids is shown.
Abstract: The estimation of amino acid replacement frequencies during molecular evolution is crucial for many applications in sequence analysis. Score matrices for database search programs or phylogenetic analysis rely on such models of protein evolution. Pioneering work was done by Dayhoff et al. (1978) who formulated a Markov model of evolution and derived the famous PAM score matrices. Her estimation procedure for amino acid exchange frequencies is restricted to pairs of proteins that have a constant and small degree of divergence. Here we present an improved estimator, called the resolvent method, that is not subject to these limitations. This extension of Dayhoff's approach enables us to estimate an amino acid substitution model from alignments of varying degree of divergence. Extensive simulations show the capability of the new estimator to recover accurately the exchange frequencies among amino acids. Based on the SYSTERS database of aligned protein families (Krause and Vingron, 1998) we recompute a series o...

312 citations


Journal ArticleDOI
TL;DR: Two exact algorithms for extracting conserved structured motifs from a set of DNA sequences are introduced and are efficient enough to be able to infer site consensi, such as promoter sequences or regulatory sites, from aSet of unaligned sequences corresponding to the noncoding regions upstream from all genes of a genome.
Abstract: This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p ≥ 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes - that is, the motifs themselves - are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the noncoding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N2n where n is the average length of the sequences and N their number. An application to the identification of prom...

277 citations


Journal ArticleDOI
TL;DR: An overview of statistical and probabilistic properties of words, as occurring in the analysis of biological sequences, and special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words.
Abstract: In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein’s method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-over...

259 citations


Journal ArticleDOI
TL;DR: This article investigates aspects of pairwise and multiple structure comparison, and the problem of automatically discover common patterns in a set of structures.
Abstract: This article investigates aspects of pairwise and multiple structure comparison, and the problem of automatically discover common patterns in a set of structures. Descriptions and representation of structures and patterns are described, as well as scoring and algorithms for comparison and discovery. A framework and nomenclature is developed for classifying different methods, and many of these are reviewed and placed into this framework.

Journal ArticleDOI
TL;DR: An improved O(momega-2nD + mnD+omegas-3) time Monte-Carlo type randomized algorithm, where omega is the exponent of matrix multiplication and the result is nontrivial and the technique can be applied to several related problems.
Abstract: Due to the recent progress of the DNA microarray technology, a large number of gene expression profile data are being produced. How to analyze gene expression data is an important topic in computational molecular biology. Several studies have been done using the Boolean network as a model of a genetic network. This paper proposes efficient algorithms for identifying Boolean networks of bounded indegree and related biological networks, where identification of a Boolean network can be formalized as a problem of identifying many Boolean functions simultaneously. For the identification of a Boolean network, an O(mnD+1) time naive algorithm and a simple O (mnD) time algorithm are known, where n denotes the number of nodes, m denotes the number of examples, and D denotes the maximum in degree. This paper presents an improved O(momega-2nD + mnD+omega-3) time Monte-Carlo type randomized algorithm, where omega is the exponent of matrix multiplication (currently, omega < 2.376). The algorithm is obtained by combining fast matrix multiplication with the randomized fingerprint function for string matching. Although the algorithm and its analysis are simple, the result is nontrivial and the technique can be applied to several related problems.

Journal ArticleDOI
TL;DR: A probabilistic model of protein sequence/structure relationships in terms of structural segments is developed, and secondary structure prediction is formulated as a general Bayesian inference problem.
Abstract: We present a novel method for predicting the secondary structure of a protein from its amino acid sequence. Most existing methods predict each position in turn based on a local window of residues, sliding this window along the length of the sequence. In contrast, we develop a probabilistic model of protein sequence/structure relationships in terms of structural segments, and formulate secondary structure prediction as a general Bayesian inference problem. A distinctive feature of our approach is the ability to develop explicit probabilistic models for α-helices, β-strands, and other classes of secondary structure, incorporating experimentally and empirically observed aspects of protein structure such as helical capping signals, side chain correlations, and segment length distributions. Our model is Markovian in the segments, permitting efficient exact calculation of the posterior probability distribution over all possible segmentations of the sequence using dynamic programming. The optimal segmentation is...

Journal ArticleDOI
TL;DR: A new notion of spectral similarity is introduced that allows one to identify related spectra even if the corresponding peptides have multiple modifications/mutations, and a new algorithm for mutation-tolerant database search as well as a method for cross-correlating related uncharacterized spectra.
Abstract: Database search in tandem mass spectrometry is a powerful tool for protein identification. High-throughput spectral acquisition raises the problem of dealing with genetic variation and peptide modifications within a population of related proteins. A method that cross-correlates and clusters related spectra in large collections of uncharacterized spectra (i.e., from normal and diseased individuals) would be very valuable in functional proteomics. This problem is far from being simple since very similar peptides may have very different spectra. We introduce a new notion of spectral similarity that allows one to identify related spectra even if the corresponding peptides have multiple modifications/mutations. Based on this notion, we developed a new algorithm for mutation-tolerant database search as well as a method for cross-correlating related uncharacterized spectra.

Journal ArticleDOI
TL;DR: Thedistance-based tree method is illustrated and how it complements the branching tree method, with a CGH data set for renal cancer, and how to reconstruct the distance-based trees using tree-fitting algorithms developed by researchers in phylogenetics.
Abstract: Comparative genomic hybridization (CGH) is a laboratory method to measure gains and losses in the copy number of chromosomal regions in tumor cells. It is hypothesized that certain DNA gains and losses are related to cancer progression and that the patterns of these changes are relevant to the clinical consequences of the cancer. It is therefore of interest to develop models which predict the occurrence of these events, as well as techniques for learning such models from CGH data. We continue our study of the mathematical foundations for inferring a model of tumor progression from a CGH data set that we started in Desper et al. (1999). In that paper, we proposed a class of probabilistic tree models and showed that an algorithm based on maximum-weight branching in a graph correctly infers the topology of the tree, under plausible assumptions. In this paper, we extend that work in the direction of the so-called distance-based trees, in which events are leaves of the tree, in the style of models common in ph...

Journal ArticleDOI
TL;DR: A new class of metrics is defined, the mountain metrics, on the set of RNA secondary structures of a fixed length, and properties of these metrics are compared with other well known metrics onRNA secondary structures.
Abstract: Many different programs have been developed for the prediction of the secondary structure of an RNA sequence. Some of these programs generate an ensemble of structures, all of which have free energy close to that of the optimal structure, making it important to be able to quantify how similar these different structures are. To deal with this problem, we define a new class of metrics, the mountain metrics, on the set of RNA secondary structures of a fixed length. We compare properties of these metrics with other well known metrics on RNA secondary structures. We also study some global and local properties of these metrics.

Journal ArticleDOI
TL;DR: The simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution is considered, showing that full tree annotations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of significance.
Abstract: Words that are, by some measure, over- or underrepresented in the context of larger sequences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. Here we take the global approach of annotating the suffix tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree annotations can be carried out i...

Journal ArticleDOI
TL;DR: The JIGSAW algorithm as discussed by the authors applies graph algorithms and probabilistic reasoning techniques, enforcing first-principles consistency rules in order to overcome a 5-10% signal-to-noise ratio.
Abstract: High-throughput, data-directed computational protocols for Structural Genomics (or Proteomics) are required in order to evaluate the protein products of genes for structure and function at rates comparable to current gene-sequencing technology. This paper presents the JIGSAW algorithm, a novel high-throughput, automated approach to protein structure characterization with nuclear magnetic resonance (NMR). JIGSAW applies graph algorithms and probabilistic reasoning techniques, enforcing first-principles consistency rules in order to overcome a 5-10% signal-to-noise ratio. It consists of two main components: (1) graph-based secondary structure pattern identification in unassigned heteronuclear NMR data, and (2) assignment of spectral peaks by probabilistic alignment of identified secondary structure elements against the primary sequence. Deferring assignment eliminates the bottleneck faced by traditional approaches, which begin by correlating peaks among dozens of experiments. JIGSAW utilizes only four experiments, none of which requires 13C-labeled protein, thus dramatically reducing both the amount and expense of wet lab molecular biology and the total spectrometer time. Results for three test proteins demonstrate that JIGSAW correctly identifies 79-100% of alpha-helical and 46-65% of beta-sheet NOE connectivities and correctly aligns 33-100% of secondary structure elements. JIGSAW is very fast, running in minutes on a Pentium-class Linux workstation. This approach yields quick and reasonably accurate (as opposed to the traditional slow and extremely accurate) structure calculations. It could be useful for quick structural assays to speed data to the biologist early in an investigation and could in principle be applied in an automation-like fashion to a large fraction of the proteome.

Journal ArticleDOI
TL;DR: This work uses a simple thermodynamic model to cast this design problem in a formal mathematical framework, and derives an efficient construction for the design problem and proves that the construction is near-optimal.
Abstract: Custom-designed DNA arrays offer the possibility of simultaneously monitoring thousands of hybridization reactions. These arrays show great potential for many medical and scientific applications, such as polymorphism analysis and genotyping. Relatively high costs are associated with the need to specifically design and synthesize problem-specific arrays. Recently, an alternative approach was suggested that utilizes fixed, universal arrays. This approach presents an interesting design problem-the arrays should contain as many probes as possible, while minimizing experimental errors caused by cross-hybridization. We use a simple thermodynamic model to cast this design problem in a formal mathematical framework. Employing new combinatorial ideas, we derive an efficient construction for the design problem and prove that our construction is near-optimal.

Journal ArticleDOI
TL;DR: The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: Learning the automaton, for any L, takes O (n) time, and prediction of a string of m symbols by the Automaton takes O(m) time.
Abstract: Statistical modeling of sequences is a central paradigm of machine learning that finds multiple uses in computational molecular biology and many other domains. The probabilistic automata typically built in these contexts are subtended by uniform, fixed-memory Markov models. In practice, such automata tend to be unnecessarily bulky and computationally imposing both during their synthesis and use. Recently, D. Ron, Y. Singer, and N. Tishby built much more compact, tree-shaped variants of probabilistic automata under the assumption of an underlying Markov process of variable memory length. These variants, called Probabilistic Suffix Trees (PSTs) were subsequently adapted by G. Bejerano and G. Yona and applied successfully to learning and prediction of protein families. The process of learning the automaton from a given training set S of sequences requires theta(Ln2) worst-case time, where n is the total length of the sequences in S and L is the length of a longest substring of S to be considered for a candidate state in the automaton. Once the automaton is built, predicting the likelihood of a query sequence of m characters may cost time theta(m2) in the worst case. The main contribution of this paper is to introduce automata equivalent to PSTs but having the following properties: Learning the automaton, for any L, takes O (n) time. Prediction of a string of m symbols by the automaton takes O (m) time. Along the way, the paper presents an evolving learning scheme and addresses notions of empirical probability and related efficient computation, which is a by-product possibly of more general interest.

Journal ArticleDOI
TL;DR: A new off-lattice bead model, capable of simulating several different fold classes of small proteins, is proposed, and the sequence for an a/~ protein resembling the IgG-binding proteins L and G is presented.
Abstract: Simulations of simplified protein folding models have provided much insight into solving the protein folding problem. We propose here a new off-lattice bead model, capable of simulating several different fold classes of small proteins. We present the sequence for an alpha/beta protein resembling the IgG-binding proteins L and G. The thermodynamics of the folding process for this model are characterized using the multiple multihistogram method combined with constant-temperature Langevin simulations. The folding is shown to be highly cooperative, with chain collapse nearly accompanying folding. Two parallel folding pathways are shown to exist on the folding free energy landscape. One pathway contains an intermediate--similar to experiments on protein G, and one pathway contains no intermediates-similar to experiments on protein L. The folding kinetics are characterized by tabulating mean-first passage times, and we show that the onset of glasslike kinetics occurs at much lower temperatures than the folding temperature. This model is expected to be useful in many future contexts: investigating questions of the role of local versus nonlocal interactions in various fold classes, addressing the effect of sequence mutations affecting secondary structure propensities, and providing a computationally feasible model for studying the role of solvation forces in protein folding.

Journal ArticleDOI
TL;DR: A new, more powerful, sequencing algorithm for the gapped-probe scheme is presented and it is proved that the new algorithm exploits the full potential of the SBH technology with high-confidence performance that comes within a small constant factor of the information-theory bound.
Abstract: In a recent paper (Preparata et aL, 1999) we introduced a novel probing scheme for DNA sequencing by hybridization (SBH). The new gapped-probe scheme combines natural and universal bases in a well-defined periodic pattern. It has been shown (Preparata et al, 1999) that the performance of the gapped-probe scheme (in terms of the length of a sequence that can be uniquely reconstructed using a given size library of probes) is significantly better than the standard scheme based on oligomer probes. In this paper we present and analyze a new, more powerful, sequencing algorithm for the gapped-probe scheme. We prove that the new algorithm exploits the full potential of the SBH technology with high-confidence performance that comes within a small constant factor (about 2) of the information-theory bound. Moreover, this performance is achieved while maintaining running time linear in the target sequence length.

Journal ArticleDOI
TL;DR: A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of an MSA, based on the use of a solution to the Traveling Salesman Problem, which identifies a circular tour through an evolutionary tree connecting the sequences in a protein family.
Abstract: Multiple sequence alignments (MSAs) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of the structure, functionality and, ultimately, the evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of an MSA. It is based on the use of a solution to the Traveling Salesman Problem, which identifies a circular tour through an evolutionary tree connecting the sequences in a protein family. With this approach, the calculation of an evolutionary tree and the errors that it would introduce can be avoided altogether. The algorithm gives an upper bound, the best score that can possibly be achieved by any MSA for a given set of protein sequences. Alternatively, if presented with a specific MSA, the algorithm provides a formal score for the MSA, which serves as an absolute measure of the quality of the MSA. The CS measure yields a direct connection between an MSA and the...

Journal ArticleDOI
TL;DR: An implementation of McCaskill's algorithm for computing the base pair probabilities of an RNA molecule for massively parallel message passing architectures is presented and applications to complete viral genomes are discussed.
Abstract: We present an implementation of McCaskill’s algorithm for computing the base pair probabilities of an RNA molecule for massively parallel message passing architectures. The program can be used to routinely fold RNA sequences of more than 10,000 nucleotides. Applications to complete viral genomes are discussed.

Journal ArticleDOI
TL;DR: The Bayesian estimator, which is applicable for both short and long segments, is used to obtain the measure of homogeneity and an exact optimal segmentation is found via the dynamic programming technique.
Abstract: We present a new approach to DNA segmentation into compositionally homogeneous blocks. The Bayesian estimator, which is applicable for both short and long segments, is used to obtain the measure of homogeneity. An exact optimal segmentation is found via the dynamic programming technique. After completion of the segmentation procedure, the sequence composition on different scales can be analyzed with filtration of boundaries via the partition function approach.

Journal ArticleDOI
TL;DR: The A* algorithm together with a standard bounding technique is superior to the well-known Carrillo-Lipman bounding since it excludes more nodes from consideration and can speed up computations considerably.
Abstract: Multiple alignment is an important problem in computational biology. It is well known that it can be solved exactly by a dynamic programming algorithm which in turn can be interpreted as a shortest path computation in a directed acyclic graph. The A* algorithm (or goal-directed unidirectional search) is a technique that speeds up the computation of a shortest path by transforming the edge lengths without losing the optimality of the shortest path. We implemented the A* algorithm in a computer program similar to MSA (Gupta et al., 1995) and FMA (Shibuya and Imai, 1997). We incorporated in this program new bounding strategies for both lower and upper bounds and show that the A* algorithm, together with our improvements, can speed up computations considerably. Additionally, we show that the A* algorithm together with a standard bounding technique is superior to the well-known Carrillo-Lipman bounding since it excludes more nodes from consideration.

Journal ArticleDOI
TL;DR: A mathematical model for error-prone PCR is developed and methods to estimate the mutation rate during error- prone PCR without assuming low mutation rate are presented.
Abstract: Error-prone polymerase chain reaction (PCR) is widely used to introduce point mutations during in vitro evolution experiments. Accurate estimation of the mutation rate during error-prone PCR is important in studying the diversity of error-prone PCR product. Although many methods for estimating the mutation rate during PCR are available, all the existing methods depend on the assumption that the mutation rate is low and mutations occur at different places whenever they occur. The available methods may not be applicable to estimate the mutation rate during error-prone PCR. We develop a mathematical model for error-prone PCR and present methods to estimate the mutation rate during error-prone PCR without assuming low mutation rate. We also develop a computer program to simulate error-prone PCR. Using the program, we compare the newly developed methods with two other methods. We show that when the mutation rate is relatively low ( 5 x 10(-3) per base per PCR cycle, the mutation rate for most error-prone PCR experiments), the previous methods underestimate the mutation rate and the newly developed methods approximate the true mutation rate.

Journal ArticleDOI
TL;DR: An overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters and how these results can be used to study the statistical frequency of motifs in a given sequence is given.
Abstract: In this paper, we give an overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters. Results concerning the number of overlapping occurrences, the number of renewals and the number of clumps will be presented. Counts of single words and also multiple words are considered. Most of the results are approximations as the length of the sequence tends to infinity. We will see that Gaussian approximations switch to (compound) Poisson approximations for rare words. Modeling DNA sequences or proteins by stationary Markov chains, these results can be used to study the statistical frequency of motifs in a given sequence.

Journal ArticleDOI
TL;DR: A Bayesian model for the changes in topology caused by recombination events is described here, which relaxes the assumption of one topology for all sites in an alignment and uses the theory of Hidden Markov models to facilitate calculations.
Abstract: Most phylogenetic tree estimation methods assume that there is a single set of hierarchical relationships among sequences in a data set for all sites along an alignment. Mosaic sequences produced by past recombination events will violate this assumption and may lead to misleading results from a phylogenetic analysis due to the imposition of a single tree along the entire alignment. Therefore, the detection of past recombination is an important first step in an analysis. A Bayesian model for the changes in topology caused by recombination events is described here. This model relaxes the assumption of one topology for all sites in an alignment and uses the theory of Hidden Markov models to facilitate calculations, the hidden states being the underlying topologies at each site in the data set. Changes in topology along the multiple sequence alignment are estimated by means of the maximum a posteriori (MAP) estimate. The performance of the MAP estimate is assessed by application of the model to data sets of f...

Journal ArticleDOI
TL;DR: An optimal algorithm using circular orders to compare the topology of two trees given by their distance matrices, which allows us to compute the Robinson and Foulds topologic distance between two trees.
Abstract: It has been postulated that existing species have been linked in the past in a way that can be described using an additive tree structure. Any such tree structure reflecting species relationships is associated with a matrix of distances between the species considered which is called a distance matrix or a tree metric matrix. A circular order of elements of X corresponds to a circular (clockwise) scanning of the subset X of vertices of a tree drawn on a plane. This paper describes an optimal algorithm using circular orders to compare the topology of two trees given by their distance matrices. This algorithm allows us to compute the Robinson and Foulds topologic distance between two trees. It employs circular order tree reconstruction to compute an ordered bipartition table of the tree edges for both given distance matrices. These bipartition tables are then compared to determine the Robinson and Foulds topologic distance, known to be an important criterion of tree similarity. The described algorithm has op...