scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2002"


Journal ArticleDOI
TL;DR: This paper reviews formalisms that have been employed in mathematical biology and bioinformatics to describe genetic regulatory systems, in particular directed graphs, Bayesian networks, Boolean networks and their generalizations, ordinary and partial differential equations, qualitative differential equation, stochastic equations, and so on.
Abstract: The spatiotemporal expression of genes in an organism is determined by regulatory systems that involve a large number of genes connected through a complex network of interactions. As an intuitive understanding of the behavior of these systems is hard to obtain, computer tools for the modeling and simulation of genetic regulatory networks will be indispensable. This report reviews formalisms that have been employed in mathematical biology and bioinformatics to describe genetic regulatory systems, in particular directed graphs, Bayesian networks, ordinary and partial differential equations, stochastic equations, Boolean networks and their generalizations, qualitative differential equations, and rule-based formalisms. In addition, the report discusses how these formalisms have been used in the modeling and simulation of regulatory systems.

2,739 citations


Journal ArticleDOI
TL;DR: A novel motif-discovery algorithm, PROJECTION, is introduced, designed to enhance the performance of existing motif finders using random projections of the input's substrings, and is robust to nonuniform background sequence distributions and scales to larger amounts of sequence than that specified in the original challenge.
Abstract: The DNA motif discovery problem abstracts the task of discovering short, conserved sites in genomic DNA. Pevzner and Sze recently described a precise combinatorial formulation of motif discovery that motivates the following algorithmic challenge: find twenty planted occurrences of a motif of length fifteen in roughly twelve kilobases of genomic sequence, where each occurrence of the motif differs from its consensus in four randomly chosen positions. Such "subtle" motifs, though statistically highly significant, expose a weakness in existing motif-finding algorithms, which typically fail to discover them. Pevzner and Sze introduced new algorithms to solve their (15,4)-motif challenge, but these methods do not scale efficiently to more difficult problems in the same family, such as the (14,4)-, (16,5)-, and (18,6)-motif problems. We introduce a novel motif-discovery algorithm, PROJECTION, designed to enhance the performance of existing motif finders using random projections of the input's substrings. Experiments on synthetic data demonstrate that PROJECTION remedies the weakness observed in existing algorithms, typically solving the difficult (14,4)-, (16,5)-, and (18,6)-motif problems. Our algorithm is robust to nonuniform background sequence distributions and scales to larger amounts of sequence than that specified in the original challenge. A probabilistic estimate suggests that related motif-finding problems that PROJECTION fails to solve are in all likelihood inherently intractable. We also test the performance of our algorithm on realistic biological examples, including transcription factor binding sites in eukaryotes and ribosome binding sites in prokaryotes.

517 citations


Journal ArticleDOI
TL;DR: A greedy approach to minimum evolution which produces a starting topology in O(n(2)) time and yields a very significant improvement over NJ and other distance-based algorithms, especially with large trees, in terms of topological accuracy.
Abstract: The Minimum Evolution (ME) approach to phylogeny estimation has been shown to be statistically consistent when it is used in conjunction with ordinary least-squares (OLS) fitting of a metric to a tree structure. The traditional approach to using ME has been to start with the Neighbor Joining (NJ) topology for a given matrix and then do a topological search from that starting point. The first stage requires O(n(3)) time, where n is the number of taxa, while the current implementations of the second are in O(p n(3)) or more, where p is the number of swaps performed by the program. In this paper, we examine a greedy approach to minimum evolution which produces a starting topology in O(n(2)) time. Moreover, we provide an algorithm that searches for the best topology using nearest neighbor interchanges (NNIs), where the cost of doing p NNIs is O(n(2) + p n), i.e., O(n(2)) in practice because p is always much smaller than n. The Greedy Minimum Evolution (GME) algorithm, when used in combination with NNIs, produces trees which are fairly close to NJ trees in terms of topological accuracy. We also examine ME under a balanced weighting scheme, where sibling subtrees have equal weight, as opposed to the standard "unweighted" OLS, where all taxa have the same weight so that the weight of a subtree is equal to the number of its taxa. The balanced minimum evolution scheme (BME) runs slower than the OLS version, requiring O(n(2) x diam(T)) operations to build the starting tree and O(p n x diam(T)) to perform the NNIs, where diam(T) is the topological diameter of the output tree. In the usual Yule-Harding distribution on phylogenetic trees, the diameter expectation is in log(n), so our algorithms are in practice faster that NJ. Moreover, this BME scheme yields a very significant improvement over NJ and other distance-based algorithms, especially with large trees, in terms of topological accuracy.

444 citations


Journal ArticleDOI
TL;DR: Two modifications of the original Gibbs sampling algorithm for motif finding are presented: the use of a probability distribution to estimate the number of copies of the motif in a sequence and the technical aspects of the incorporation of a higher-order background model.
Abstract: Microarray experiments can reveal important information about transcriptional regulation. In our case, we look for potential promoter regulatory elements in the upstream region of coexpressed genes. Here we present two modifications of the original Gibbs sampling algorithm for motif finding (Lawrence et al., 1993). First, we introduce the use of a probability distribution to estimate the number of copies of the motif in a sequence. Second, we describe the technical aspects of the incorporation of a higher-order background model whose application we discussed in Thijs et al. (2001). Our implementation is referred to as the Motif Sampler. We successfully validate our algorithm on several data sets. First, we show results for three sets of upstream sequences containing known motifs: 1) the G-box light-response element in plants, 2) elements involved in methionine response in Saccharomyces cerevisiae, and 3) the FNR O(2)-responsive element in bacteria. We use these data sets to explain the influence of the parameters on the performance of our algorithm. Second, we show results for upstream sequences from four clusters of coexpressed genes identified in a microarray experiment on wounding in Arabidopsis thaliana. Several motifs could be matched to regulatory elements from plant defence pathways in our database of plant cis-acting regulatory elements (PlantCARE). Some other strong motifs do not have corresponding motifs in PlantCARE but are promising candidates for further analysis.

394 citations


Journal ArticleDOI
TL;DR: The prediction paradigm will serve as a good framework for comparing different prediction methods and may accelerate the development of molecular classifiers that are clinically useful.
Abstract: We propose a general framework for prediction of predefined tumor classes using gene expression profiles from microarray experiments The framework consists of 1) evaluating the appropriateness of class prediction for the given data set, 2) selecting the prediction method, 3) performing cross-validated class prediction, and 4) assessing the significance of prediction results by permutation testing We describe an application of the prediction paradigm to gene expression profiles from human breast cancers, with specimens classified as positive or negative for BRCA1 mutations and also for BRCA2 mutations In both cases, the accuracy of class prediction was statistically significant when compared to the accuracy of prediction expected by chance The framework proposed here for the application of class prediction is designed to reduce the occurrence of spurious findings, a legitimate concern for high-dimensional microarray data The prediction paradigm will serve as a good framework for comparing different pr

324 citations


Journal ArticleDOI
TL;DR: This work considers the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence comparisons and proposes an SVM kernel function that is explicitly heterogeneous.
Abstract: In our attempts to understand cellular function at the molecular level, we must be able to synthesize information from disparate types of genomic data. We consider the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence comparisons. We demonstrate the application of the support vector machine (SVM) learning algorithm to this functional inference task. Our results suggest the importance of exploiting prior information about the heterogeneity of the data. In particular, we propose an SVM kernel function that is explicitly heterogeneous. In addition, we describe feature scaling methods for further exploiting prior knowledge of heterogeneity by giving each data type different weights.

259 citations


Journal ArticleDOI
TL;DR: The notion of edit distance is proposed to measure the similarity between two RNA secondary and tertiary structures, by incorporating various edit operations performed on both bases and arcs (i.e., base-pairs).
Abstract: Arc-annotated sequences are useful in representing the structural information of RNA sequences. In general, RNA secondary and tertiary structures can be represented as a set of nested arcs and a set of crossing arcs, respectively. Since RNA functions are largely determined by molecular confirmation and therefore secondary and tertiary structures, the comparison between RNA secondary and tertiary structures has received much attention recently. In this paper, we propose the notion of edit distance to measure the similarity between two RNA secondary and tertiary structures, by incorporating various edit operations performed on both bases and arcs (i.e., base-pairs). Several algorithms are presented to compute the edit distance between two RNA sequences with various arc structures and under various score schemes, either exactly or approximately, with provably good performance. Preliminary experimental tests confirm that our definition of edit distance and the computation model are among the most reasonable ones ever studied in the literature.

218 citations


Journal ArticleDOI
TL;DR: A new motif-finding problem, the Substring Parsimony Problem, is introduced, which is a formalization of the ideas behind phylogenetic footprinting, and an exact dynamic programming algorithm to solve it is presented.
Abstract: Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. We introduce a new motif-finding problem, the Substring Parsimony Problem, which is a formalization of the ideas behind phylogenetic footprinting, and we present an exact dynamic programming algorithm to solve it. We then present a number of algorithmic optimizations that allow our program to run quickly on most biologically interesting datasets. We show how to handle data sets in which only an unknown subset of the sequences contains the regulatory element. Finally, we describe how to empirically assess the statistical significance of the motifs found. Each technique is implemented and successfully identifies a number of known binding sites, as well as several highly conserved but uncharacterized regions. The program is available at http://bio.cs.washington.edu/software.html.

182 citations


Journal ArticleDOI
TL;DR: A framework for studying protein folding pathways and potential landscapes which is based on techniques recently developed in the robotics motion planning community, and appears to differentiate situations in which secondary structure clearly forms first and those in which the tertiary structure is obtained more directly.
Abstract: We present a framework for studying protein folding pathways and potential landscapes which is based on techniques recently developed in the robotics motion planning community. Our focus in this work is to study the protein folding mechanism assuming we know the native fold. That is, instead of performing fold prediction, we aim to study issues related to the folding process, such as the formation of secondary and tertiary structure, and the dependence of the folding pathway on the initial denatured conformation. Our work uses probabilistic roadmap (PRM) motion planning techniques which have proven successful for problems involving high-dimensional configuration spaces. A strength of these methods is their efficiency in rapidly covering the planning space without becoming trapped in local minima. We have applied our PRM technique to several small proteins (~60 residues) and validated the pathways computed by comparing the secondary structure formation order on our paths to known hydrogen exchange experimental results. An advantage of the PRM framework over other simulation methods is that it enables one to easily and efficiently compute folding pathways from any denatured starting state to the (known) native fold. This aspect makes our approach ideal for studying global properties of the protein's potential landscape, most of which are difficult to simulate and study with other methods. For example, in the proteins we study, the folding pathways starting from different denatured states sometimes share common portions when they are close to the native fold, and moreover, the formation order of the secondary structure appears largely independent of the starting denatured conformation. Another feature of our technique is that the distribution of the sampled conformations is correlated with the formation of secondary structure and, in particular, appears to differentiate situations in which secondary structure clearly forms first and those in which the tertiary structure is obtained more directly. Overall, our results applying PRM techniques are very encouraging and indicate the promise of our approach for studying proteins for which experimental results are not available.

172 citations


Journal ArticleDOI
TL;DR: A model-based clustering toolbox is applied to gene-expression clustering based on cDNA microarrays using real data and results include error tables and graphs, confusion matrices, principal-component plots, and validation measures.
Abstract: There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test t...

167 citations


Journal ArticleDOI
TL;DR: The SVM algorithm can predict up to 96% of the molecules correctly, averaging 81.5% over 30 test sets, which comprised of equal numbers of CNS positive and negative molecules.
Abstract: Two different machine-learning algorithms have been used to predict the blood-brain barrier permeability of different classes of molecules, to develop a method to predict the ability of drug compounds to penetrate the CNS. The first algorithm is based on a multilayer perceptron neural network and the second algorithm uses a support vector machine. Both algorithms are trained on an identical data set consisting of 179 CNS active molecules and 145 CNS inactive molecules. The training parameters include molecular weight, lipophilicity, hydrogen bonding, and other variables that govern the ability of a molecule to diffuse through a membrane. The results show that the support vector machine outperforms the neural network. Based on over 30 different validation sets, the SVM can predict up to 96% of the molecules correctly, averaging 81.5% over 30 test sets, which comprised of equal numbers of CNS positive and negative molecules. This is quite favorable when compared with the neural network's average performance of 75.7% with the same 30 test sets. The results of the SVM algorithm are very encouraging and suggest that a classification tool like this one will prove to be a valuable prediction approach.

Journal ArticleDOI
TL;DR: It is argued that this preprocessing leads to estimates of the expression that have a much larger variance than needed when the expression levels are low.
Abstract: Most microarray scanning software for glass spotted arrays provides estimates for the intensity for the "foreground" and "background" of two channels for every spot. The common approach in further analyzing such data is to first subtract the background from the foreground for each channel and to use the ratio of these two results as the estimate of the expression level. The resulting ratios are, after possible averaging over replicates, the usual inputs for further data analysis, such as clustering. If, with this background correction procedure, the foreground intensity was smaller than the background intensity for a channel, that spot (on that array) yields no usable data. In this paper it is argued that this preprocessing leads to estimates of the expression that have a much larger variance than needed when the expression levels are low.

Journal ArticleDOI
TL;DR: By analyzing known regulatory relationships, an edge detection function was designed which identified candidate regulations with greater fidelity than standard correlation methods and developed general methods for integrated analysis of coarse time-series data sets.
Abstract: We address possible limitations of publicly available data sets of yeast gene expression We study the predictability of known regulators via time-series analysis, and show that less than 20% of known regulatory pairs exhibit strong correlations in the Cho/Spellman data sets By analyzing known regulatory relationships, we designed an edge detection function which identified candidate regulations with greater fidelity than standard correlation methods We develop general methods for integrated analysis of coarse time-series data sets These include 1) methods for automated period detection in a predominately cycling data set and 2) phase detection between phase-shifted cyclic data sets We show how to properly correct for the problem of comparing correlation coefficients between pairs of sequences of different lengths and small alphabets Finally, we note that the correlation coefficient of sequences over alphabets of size two can exhibit very counterintuitive behavior when compared with the Hamming distance

Journal ArticleDOI
TL;DR: A new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.
Abstract: A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.

Journal ArticleDOI
TL;DR: This work presents a class of mathematical models that help in understanding the connections between transcription factors and functional classes of genes based on genetic and genomic data and introduces a new search method that rapidly learns a model according to a Bayesian score.
Abstract: The recent growth in genomic data and measurements of genome-wide expression patterns allows us to apply computational tools to examine gene regulation by transcription factors. In this work, we present a class of mathematical models that help in understanding the connections between transcription factors and functional classes of genes based on genetic and genomic data. Such a model represents the joint distribution of transcription factor binding sites and of expression levels of a gene in a unified probabilistic model. Learning a combined probability model of binding sites and expression patterns enables us to improve the clustering of the genes based on the discovery of putative binding sites and to detect which binding sites and experiments best characterize a cluster. To learn such models from data, we introduce a new search method that rapidly learns a model according to a Bayesian score. We evaluate our method on synthetic data as well as on real life data and analyze the biological insights it provides. Finally, we demonstrate the applicability of the method to other data analysis problems in gene expression data.

Journal ArticleDOI
TL;DR: This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry.
Abstract: For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers...

Journal ArticleDOI
TL;DR: The aim of this paper is to propose a methodological framework for studies that investigate differential gene expression through microarrays technology that is based on a fully Bayesian mixture approach (Richardson and Green, 1997).
Abstract: Recent developments in microarrays technology enable researchers to study simultaneously the expression of thousands of genes from one cell line or tissue sample. This new technology is often used to assess changes in mRNA expression upon a specified transfection for a cell line in order to identify target genes. For such experiments, the range of differential expression is moderate, and teasing out the modified genes is challenging and calls for detailed modeling. The aim of this paper is to propose a methodological framework for studies that investigate differential gene expression through microarrays technology that is based on a fully Bayesian mixture approach (Richardson and Green, 1997). A case study that investigated those genes that were differentially expressed in two cell lines (normal and modified by a gene transfection) is provided to illustrate the performance and usefulness of this approach.

Journal ArticleDOI
TL;DR: The generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs, is introduced and shown how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding and describe applications to DNA-cDNA and DNA-protein alignment.
Abstract: Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene finding and annotation. Alignment problems can be solved with pair HMMs, while gene finding programs rely on generalized HMMs in order to model exon lengths. In this paper, we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding and describe applications to DNA-cDNA and DNA-protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems.

Journal ArticleDOI
TL;DR: An algorithm for detecting motifs in protein sequences based on methods from Data Mining and Knowledge Discovery and implemented a program called GYM, which provides a lot of useful information about a given protein sequence.
Abstract: We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein sequences. The dictionary is subsequently used to detect motifs in new protein sequences. Statistical significance of the detection results are ensured by statistically determining the various parameters of the algorithm. Based on this approach, we have implemented a program called GYM. The Helix-Turn-Helix motif was used as a model system on which to test our program. The program was also extended to detect Homeodomain motifs. The detection results for the two motifs compare favorably with existing programs. In addition, the GYM program provides a lot of useful information about a given protein sequence.

Journal ArticleDOI
TL;DR: This work describes an approach that allows conformational flexibility for the side chains while keeping the protein backbone rigid, and proposes a fast heuristic approach and an exact method that uses branch-and-cut techniques to solve this problem.
Abstract: Rigid-body docking approaches are not sufficient to predict the structure of a protein complex from the unbound (native) structures of the two proteins. Accounting for side chain flexibility is an important step towards fully flexible protein docking. This work describes an approach that allows conformational flexibility for the side chains while keeping the protein backbone rigid. Starting from candidates created by a rigid-docking algorithm, we demangle the side chains of the docking site, thus creating reasonable approximations of the true complex structure. These structures are ranked with respect to the binding free energy. We present two new techniques for side chain demangling. Both approaches are based on a discrete representation of the side chain conformational space by the use of a rotamer library. This leads to a combinatorial optimization problem. For the solution of this problem, we propose a fast heuristic approach and an exact, albeit slower, method that uses branch-and-cut techniques. As a test set, we use the unbound structures of three proteases and the corresponding protein inhibitors. For each of the examples, the highest-ranking conformation produced was a good approximation of the true complex structure.

Journal ArticleDOI
TL;DR: The development of the analysis software CAROl at Freie Universität Berlin has reconsidered the two problems of identifying protein spots and computing a matching between two images and obtained new solutions which rely on methods from computational geometry.
Abstract: In proteomics, two-dimensional gel electrophoresis (2-DE) is a separation technique for proteins. The resulting protein spots can be identified either by using picking robots and subsequent mass spectrometry or by visual cross inspection of a new gel image with an already analyzed master gel. Difficulties especially arise from inherent noise and irregular geometric distortions in 2-DE images. Aiming at the automated analysis of large series of 2-DE images, or at the even more difficult interlaboratory gel comparisons, the bottleneck is to solve the two most basic algorithmic problems with high quality: Identifying protein spots and computing a matching between two images. For the development of the analysis software CAROl at Freie Universitat Berlin, we have reconsidered these two problems and obtained new solutions which rely on methods from computational geometry. Their novelties are: 1. Spot detection is also possible for complex regions formed by several "merged" (usually saturated) spots; 2. User-defined landmarks are not necessary for the matching. Furthermore, images for comparison are allowed to represent different parts of the entire protein pattern, which only partially "overlap." The implementation is done in a client server architecture to allow queries via the internet. We also discuss and point at related theoretical questions in computational geometry.

Journal ArticleDOI
TL;DR: The approach proves to be general enough to compute the average order of a secondary structure together with all the r-th moments and to enumerate substructures such as hairpins or bulges in dependence on the order of the secondary structures considered.
Abstract: The secondary structure of an RNA molecule is of great importance and possesses influence, e.g., on the interaction of tRNA molecules with proteins or on the stabilization of mRNA molecules. The classification of secondary structures by means of their order proved useful with respect to numerous applications. In 1978, Waterman, who gave the first precise formal framework for the topic, suggested to determine the number an,p of secondary structures of size n and given order p. Since then, no satisfactory result has been found. Based on an observation due to Viennot et al., we will derive generating functions for the secondary structures of order p from generating functions for binary tree structures with Horton-Strahler number p. These generating functions enable us to compute a precise asymptotic equivalent for an,p. Furthermore, we will determine the related number of structures when the number of unpaired bases shows up as an additional parameter. Our approach proves to be general enough to compute the ...

Journal ArticleDOI
Miklós Csürös1
TL;DR: This work presents a novel distance-based algorithm for evolutionary tree reconstruction that reconstructs the topology of a tree with n leaves in O(n2) time using O( n) working space.
Abstract: We present a novel distance-based algorithm for evolutionary tree reconstruction. Our algorithm reconstructs the topology of a tree with n leaves in O(n(2)) time using O(n) working space. In the general Markov model of evolution, the algorithm recovers the topology successfully with (1 - o(1)) probability from sequences with polynomial length in n. Moreover, for almost all trees, our algorithm achieves the same success probability on polylogarithmic sample sizes. The theoretical results are supported by simulation experiments involving trees with 500, 1,895, and 3,135 leaves. The topologies of the trees are recovered with high success from 2,000 bp DNA sequences.

Journal ArticleDOI
TL;DR: This paper proposes approximations of the probability of occurrences of such a structured motif in a given sequence, composed of two ordered parts separated by a variable distance and allowing for substitutions, which are applied to evaluate candidate promoters in E. coli and B. subtilis.
Abstract: The problem of extracting from a set of nucleic acid sequences motifs which may have biological function is more and more important. In this paper, we are interested in particular motifs that may be implicated in the transcription process. These motifs, called structured motifs, are composed of two ordered parts separated by a variable distance and allowing for substitutions. In order to assess their statistical significance, we propose approximations of the probability of occurrences of such a structured motif in a given sequence. An application of our method to evaluate candidate promoters in E. coli and B. subtilis is presented. Simulations show the goodness of the approximations.

Journal ArticleDOI
TL;DR: A method is presented that uses beta-strand interactions to predict the parallel right-handed beta-helix super-secondary structural motif in protein sequences and may generalize to other beta-structures for which strand topology and profiles of residue accessibility are well conserved.
Abstract: A method is presented that uses beta-strand interactions to predict the parallel right-handed beta-helix super-secondary structural motif in protein sequences. A program called BetaWrap implements this method and is shown to score known beta-helices above non-beta-helices in the Protein Data Bank in cross-validation. It is demonstrated that BetaWrap learns each of the seven known SCOP beta-helix families, when trained primarily on beta-structures that are not beta-helices, together with structural features of known beta-helices from outside the family. BetaWrap also predicts many bacterial proteins of unknown structure to be beta-helices; in particular, these proteins serve as virulence factors, adhesins, and toxins in bacterial pathogenesis and include cell surface proteins from Chlamydia and the intestinal bacterium Helicobacter pylori. The computational method used here may generalize to other beta-structures for which strand topology and profiles of residue accessibility are well conserved.

Journal ArticleDOI
TL;DR: Much of the effect from reusing membranes was attributable to increasing levels of background radiation and can be reduced by using membranes at most four times, which can be minimized by using the same exposure time for all experiments.
Abstract: Microarray experiments involve many steps, including spotting cDNA, extracting RNA, labeling targets, hybridizing, scanning, and analyzing images. Each step introduces variability, confounding our ability to obtain accurate estimates of the biological differences between samples. We ran repeated experiments using high-density cDNA microarray membranes (Research Genetics Human GeneFilters® Microarrays Version I) and 33P-labeled targets. Total RNA was extracted from a Burkitt lymphoma cell line (GA-10). We estimated the components of variation coming from: (1) image analysis, (2) exposure time to PhosphorImager® screens, (3) differences in membranes, (4) reuse of membranes, and (5) differences in targets prepared from two independent RNA extractions. Variation was assessed qualitatively using a clustering algorithm and quantitatively using a version of ANOVA adapted to multivariate microarray data. The largest contribution to variation came from reusing membranes, which contributed 38% of the total variatio...

Journal ArticleDOI
TL;DR: This paper uses the method to hypothesize the series of duplication events that may have produced the ZNF45 family that appears on human chromosome 19, and provides algorithms to reconstruct a duplication model for a given data set.
Abstract: Zinc finger genes in mammalian genomes are frequently found to occur in clusters with cluster members appearing in a tandem array on the chromosome. It has been suggested that in situ gene duplication events are primarily responsible for the evolution of such clusters. The problem of inferring the series of duplication events responsible for producing clustered families is different from the standard phylogeny problem. In this paper, we study this inference problem using a graph called duplication model that captures the series of duplication events while taking into account the observed order of the genes on the chromosome. We provide algorithms to reconstruct a duplication model for a given data set. We use our method to hypothesize the series of duplication events that may have produced the ZNF45 family that appears on human chromosome 19.

Journal ArticleDOI
TL;DR: A new algorithm for protein classification and the detection of remote homologs that is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment.
Abstract: We describe a new algorithm for protein classification and the detection of remote homologs. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well-balanced manner. This is in contrast to established methods such as profiles and profile hidden Markov models which focus on vertical information as they model the columns of the alignment independently and to family pairwise search which focuses on horizontal information as it treats given sequences separately. In our setting, we want to select from a given database of "candidate sequences" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence is at each position aligned to one sequence of the multiple alignment, called the "reference sequence." In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compare it to profiles, profile hidden Markov models, and family pairwise search on a subset of the SCOP database of protein domains. The discriminative quality is assessed by median false positive counts (med-FP-counts). For moderate med-FP-counts, the number of successful searches with our method is considerably higher than with the competing methods.

Journal ArticleDOI
TL;DR: A new algorithmic approach is presented which allows estimation of the more important of the Gumbel parameters at least five times faster than the traditional methods and brings significance estimation into the realm of interactive applications.
Abstract: In order to assess the significance of sequence alignments, it is crucial to know the distribution of alignment scores of pairs of random sequences. For gapped local alignment, it is empirically known that the shape of this distribution is of the Gumbel form. However, the determination of the parameters of this distribution is a computationally very expensive task. We present a new algorithmic approach which allows estimation of the more important of the Gumbel parameters at least five times faster than the traditional methods. Actual runtimes of our algorithm between less than a second and a few minutes on a workstation bring significance estimation into the realm of interactive applications.

Journal ArticleDOI
TL;DR: An FFT algorithm that can compute the match score of a sequence against a position-specific scoring matrix (PSSM) that finds the PSSM score simultaneously over all offsets of the P SSM with the sequence, although like all previous FFT algorithms, it still disallows gaps.
Abstract: Historically, in computational biology the fast Fourier transform (FFT) has been used almost exclusively to count the number of exact letter matches between two biosequences. This paper presents an FFT algorithm that can compute the match score of a sequence against a position-specific scoring matrix (PSSM). Our algorithm finds the PSSM score simultaneously over all offsets of the PSSM with the sequence, although like all previous FFT algorithms, it still disallows gaps. Although our algorithm is presented in the context of global matching, it can be adapted to local matching without gaps. As a benchmark, our PSSM-modified FFT algorithm computed pairwise match scores. In timing experiments, our most efficient FFT implementation for pairwise scoring appeared to be 10 to 26 times faster than a traditional FFT implementation, with only a factor of 2 in the acceleration attributable to a previously known compression scheme. Many important algorithms for detecting biosequence similarities, e.g., gapped BLAST o...