scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2005"


Journal ArticleDOI
TL;DR: This algorithm provides an objective and noise-resistant method for quantification of qRT-PCR results that is independent of the specific equipment used to perform PCR reactions.
Abstract: Quantitative real-time polymerase chain reactions (qRT-PCR) have become the method of choice for rapid, sensitive, quantitative comparison of RNA transcript abundance. Useful data from this method depend on fitting data to theoretical curves that allow computation of mRNA levels. Calculating accurate mRNA levels requires important parameters such as reaction efficiency and the fractional cycle number at threshold (CT) to be used; however, many algorithms currently in use estimate these important parameters. Here we describe an objective method for quantifying qRT-PCR results using calculations based on the kinetics of individual PCR reactions without the need of the standard curve, independent of any assumptions or subjective judgments which allow direct calculation of efficiency and CT. We use a four-parameter logistic model to fit the raw fluorescence data as a function of PCR cycles to identify the exponential phase of the reaction. Next, we use a three-parameter simple exponent model to fit the exponential phase using an iterative nonlinear regression algorithm. Within the exponential portion of the curve, our technique automatically identifies candidate regression values using the P-value of regression and then uses a weighted average to compute a final efficiency for quantification. For CT determination, we chose the first positive second derivative maximum from the logistic model. This algorithm provides an objective and noise-resistant method for quantification of qRT-PCR results that is independent of the specific equipment used to perform PCR reactions.

1,186 citations


Journal ArticleDOI
TL;DR: It is suggested that the system producing the measured intensities is too complex to be fully described with these relatively simple physical models, and an empirically motivated stochastic models that complement the above-mentioned molecular hybridization theory are proposed to provide a comprehensive description of the data.
Abstract: High density oligonucleotide expression arrays are a widely used tool for the measurement of gene expression on a large scale. Affymetrix GeneChip arrays appear to dominate this market. These arrays use short oligonucleotides to probe for genes in an RNA sample. Due to optical noise, nonspecific hybridization, probe-specific effects, and measurement error, ad hoc measures of expression that summarize probe intensities can lead to imprecise and inaccurate results. Various researchers have demonstrated that expression measures based on simple statistical models can provide great improvements over the ad hoc procedure offered by Affymetrix. Recently, physical models based on molecular hybridization theory have been proposed as useful tools for prediction of, for example, nonspecific hybridization. These physical models show great potential in terms of improving existing expression measures. In this paper, we suggest that the system producing the measured intensities is too complex to be fully described with these relatively simple physical models, and we propose empirically motivated stochastic models that complement the above-mentioned molecular hybridization theory to provide a comprehensive description of the data. We discuss how the proposed model can be used to obtain improved measures of expression useful for the data analysts.

268 citations


Journal ArticleDOI
TL;DR: Generators and Gröbner bases are determined for the Jukes-Cantor and Kimura models on a binary tree and for several widely used models for biological sequences that have transition matrices that can be diagonalized by means of the Fourier transform of an abelian group.
Abstract: Statistical models of evolution are algebraic varieties in the space of joint probability distributions on the leaf colorations of a phylogenetic tree. The phylogenetic invariants of a model are the polynomials which vanish on the variety. Several widely used models for biological sequences have transition matrices that can be diagonalized by means of the Fourier transform of an Abelian group. Their phylogenetic invariants form a toric ideal in the Fourier coordinates. We determine generators and Grobner bases for these toric ideals. For the Jukes–Cantor and Kimura models on a binary tree, our Grobner bases consist of certain explicitly constructed polynomials of degree at most four.

182 citations


Journal ArticleDOI
TL;DR: A detailed probabilistic model for protein complexes in a single species and a model for the conservation of complexes between two species are developed and it is proposed that the corresponding bacterial proteins function as a coherent cellular membrane transport system.
Abstract: Mounting evidence shows that many protein complexes are conserved in evolution. Here we use conservation to find complexes that are common to the yeast S. cerevisiae and the bacteria H. pylori. Our analysis combines protein interaction data that are available for each of the two species and orthology information based on protein sequence comparison. We develop a detailed probabilistic model for protein complexes in a single species and a model for the conservation of complexes between two species. Using these models, one can recast the question of finding conserved complexes as a problem of searching for heavy subgraphs in an edge- and node-weighted graph, whose nodes are orthologous protein pairs. We tested this approach on the data currently available for yeast and bacteria and detected 11 significantly conserved complexes. Several of these complexes match very well with prior experimental knowledge on complexes in yeast only and serve for validation of our methodology. The complexes suggest new functions for a variety of uncharacterized proteins. By identifying a conserved complex whose yeast proteins function predominantly in the nuclear pore complex, we propose that the corresponding bacterial proteins function as a coherent cellular membrane transport system. We also compare our results to two alternative methods for detecting complexes and demonstrate that our methodology obtains a much higher specificity.

181 citations


Journal ArticleDOI
TL;DR: Two new methods for reconstructing reticulate evolution of species due to events such as horizontal transfer or hybrid speciation are presented, based upon extensions of Wayne Maddison's approach in his seminal 1997 paper.
Abstract: We present new methods for reconstructing reticulate evolution of species due to events such as horizontal transfer or hybrid speciation; both methods are based upon extensions of Wayne Maddison's approach in his seminal 1997 paper. Our first method is a polynomial time algorithm for constructing phylogenetic networks from two gene trees contained inside the network. We allow the network to have an arbitrary number of reticulations, but we limit the reticulation in the network so that the cycles in the network are node-disjoint ("galled"). Our second method is a polynomial time algorithm for constructing networks with one reticulation, where we allow for errors in the estimated gene trees. Using simulations, we demonstrate improved performance of this method over both NeighborNet and Maddison's method.

174 citations


Journal ArticleDOI
TL;DR: A generative probabilistic model for the development of drug resistance in HIV that agrees with biological knowledge is obtained and is statistically validated as a density estimator.
Abstract: We introduce a mixture model of trees to describe evolutionary processes that are characterized by the ordered accumulation of permanent genetic changes. The basic building block of the model is a directed weighted tree that generates a probability distribution on the set of all patterns of genetic events. We present an EM-like algorithm for learning a mixture model of K trees and show how to determine K with a maximum likelihood approach. As a case study, we consider the accumulation of mutations in the HIV-1 reverse transcriptase that are associated with drug resistance. The fitted model is statistically validated as a density estimator, and the stability of the model topology is analyzed. We obtain a generative probabilistic model for the development of drug resistance in HIV that agrees with biological knowledge. Further applications and extensions of the model are discussed.

133 citations


Journal ArticleDOI
TL;DR: This work presents a novel algorithm for protein redesign, which combines a statistical mechanics-derived ensemble-based approach to computing the binding constant with the speed and completeness of a branch-and-bound pruning algorithm, and developed an efficient deterministic approximation algorithm, capable of approximating the authors' scoring function to arbitrary precision.
Abstract: Realization of novel molecular function requires the ability to alter molecular complex formation. Enzymatic function can be altered by changing enzyme-substrate interactions via modification of an enzyme's active site. A redesigned enzyme may either perform a novel reaction on its native substrates or its native reaction on novel substrates. A number of computational approaches have been developed to address the combinatorial nature of the protein redesign problem. These approaches typically search for the global minimum energy conformation among an exponential number of protein conformations. We present a novel algorithm for protein redesign, which combines a statistical mechanics-derived ensemble-based approach to computing the binding constant with the speed and completeness of a branch-and-bound pruning algorithm. In addition, we developed an efficient deterministic approximation algorithm, capable of approximating our scoring function to arbitrary precision. In practice, the approximation algorithm decreases the execution time of the mutation search by a factor of ten. To test our method, we examined the Phe-specific adenylation domain of the nonribosomal peptide synthetase gramicidin synthetase A (GrsA-PheA). Ensemble scoring, using a rotameric approximation to the partition functions of the bound and unbound states for GrsA-PheA, is first used to predict binding of the wildtype protein and a previously described mutant (selective for leucine), and second, to switch the enzyme specificity toward leucine, using two novel active site sequences computationally predicted by searching through the space of possible active site mutations. The top scoring in silico mutants were created in the wetlab and dissociation/binding constants were determined by fluorescence quenching. These tested mutations exhibit the desired change in specificity from Phe to Leu. Our ensemble-based algorithm, which flexibly models both protein and ligand using rotamer-based partition functions, has application in enzyme redesign, the prediction of protein-ligand binding, and computer-aided drug design.

119 citations


Journal ArticleDOI
TL;DR: This study shows that recovering the actual local tree topologies can be done more accurately than estimating the actual number of recombination events, and indicates that the new lower bound is an improvement on earlier bounds.
Abstract: By viewing the ancestral recombination graph as defining a sequence of trees, we show how possible evolutionary histories consistent with given data can be constructed using the minimum number of r...

115 citations


Journal ArticleDOI
TL;DR: This work investigates several regression models for RSA prediction using linear L1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression, and compares the performance of the SVR with that of LS regression and NN-based methods.
Abstract: The relative solvent accessibility (RSA) of an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. The problem of predicting the RSA from the primary amino acid sequence can therefore be cast as a regression problem. Nevertheless, RSA prediction has so far typically been cast as a classification problem. Consequently, various machine learning techniques have been used within the classification framework to predict whether a given amino acid exceeds some (arbitrary) RSA threshold and would thus be predicted to be "exposed," as opposed to "buried." We have recently developed novel methods for RSA prediction using nonlinear regression techniques which provide accurate estimates of the real-valued RSA and outperform classification-based approaches with respect to commonly used two-class projections. However, while their performance seems to provide a significant improvement over previously published approaches, these Neural Network (NN) based methods are computationally expensive to train and involve several thousand parameters. In this work, we develop alternative regression models for RSA prediction which are computationally much less expensive, involve orders-of-magnitude fewer parameters, and are still competitive in terms of prediction quality. In particular, we investigate several regression models for RSA prediction using linear L1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression. Using rigorously derived validation sets of protein structures and extensive cross-validation analysis, we compare the performance of the SVR with that of LS regression and NN-based methods. In particular, we show that the flexibility of the SVR (as encoded by metaparameters such as the error insensitivity and the error penalization terms) can be very beneficial to optimize the prediction accuracy for buried residues. We conclude that the simple and computationally much more efficient linear SVR performs comparably to nonlinear models and thus can be used in order to facilitate further attempts to design more accurate RSA prediction methods, with applications to fold recognition and de novo protein structure prediction methods.

106 citations


Journal ArticleDOI
TL;DR: A novel combination of four knowledge-based potentials recognizing different features of native protein structures is introduced and tested, and the torsion angle potential is found to show the strongest correlation with model quality.
Abstract: Scoring functions are widely used in the final step of model selection in protein structure prediction. This is of interest both for comparative modeling targets, where it is important to select the best model among a set of many good, "correct" ones, as well as for other (fold recognition or novel fold) targets, where the set may contain many incorrect models. A novel combination of four knowledge-based potentials recognizing different features of native protein structures is introduced and tested. The pairwise, solvation, hydrogen bond, and torsion angle potentials contain largely orthogonal information. Of these, the torsion angle potential is found to show the strongest correlation with model quality. Combining these features with a linear weighting function, it was possible to construct a robust energy function capable of discriminating native-like structures on several benchmarking sets. In a recent blind test (CAFASP–4 MQAP), the scoring function ranked consistently well and was able to reliably di...

105 citations


Journal ArticleDOI
TL;DR: The current state of research in word sense disambiguation (WSD) is reviewed and the current direction of research points towards statistically based algorithms that use existing curated data and can be applied to large sets of biomedical literature.
Abstract: There is a trend towards automatic analysis of large amounts of literature in the biomedical domain. However, this can be effective only if the ambiguity in natural language is resolved. In this paper, the current state of research in word sense disambiguation (WSD) is reviewed. Several methods for WSD have already been proposed, but many systems have been tested only on evaluation sets of limited size. There are currently only very few applications of WSD in the biomedical domain. The current direction of research points towards statistically based algorithms that use existing curated data and can be applied to large sets of biomedical literature. There is a need for manually tagged evaluation sets to test WSD algorithms in the biomedical domain. WSD algorithms should preferably be able to take into account both known and unknown senses of a word. Without WSD, automatic metaanalysis of large corpora of text will be error prone.

Journal ArticleDOI
TL;DR: It is found that graphs based on almost-Delaunay edges significantly reduce the number of edges in the graph representation and hence present computational advantage, yet the patterns extracted from such graphs have a biological interpretation approximately equivalent to that of those extracted from distance based graphs.
Abstract: We find recurring amino-acid residue packing patterns, or spatial motifs, that are characteristic of protein structural families, by applying a novel frequent subgraph mining algorithm to graph representations of protein three-dimensional structure. Graph nodes represent amino acids, and edges are chosen in one of three ways: first, using a threshold for contact distance between residues; second, using Delaunay tessellation; and third, using the recently developed almost-Delaunay edges. For a set of graphs representing a protein family from the Structural Classification of Proteins (SCOP) database, subgraph mining typically identifies several hundred common subgraphs corresponding to spatial motifs that are frequently found in proteins in the family but rarely found outside of it. We find that some of the large motifs map onto known functional regions in two protein families explored in this study, i.e., serine proteases and kinases. We find that graphs based on almost-Delaunay edges significantly reduce the number of edges in the graph representation and hence present computational advantage, yet the patterns extracted from such graphs have a biological interpretation approximately equivalent to that of those extracted from distance based graphs.

Journal ArticleDOI
TL;DR: In this article, the authors introduce permuted variable length Markov models (PVLMM) which could capture the potentially important dependencies among positions and apply them to the problem of detecting splice and TFB sites.
Abstract: Many short DNA motifs, such as transcription factor binding sites (TFBS) and splice sites, exhibit strong local as well as nonlocal dependence. We introduce permuted variable length Markov models (PVLMM) which could capture the potentially important dependencies among positions and apply them to the problem of detecting splice and TFB sites. They have been satisfactory from the viewpoint of prediction performance and also give ready biological interpretations of the sequence dependence observed. The issue of model selection is also studied.

Journal ArticleDOI
TL;DR: The results presented in this paper provide an efficient way to compute the Fourier power spectrum at N/3 and the noise signal in gene-finding methods by calculating the nucleotide distributions in the three codon positions.
Abstract: The 3-base periodicity, identified as a pronounced peak at the frequency N/3 (N is the length of the DNA sequence) of the Fourier power spectrum of protein coding regions, is used as a marker in gene-finding algorithms to distinguish protein coding regions (exons) and noncoding regions (introns) of genomes. In this paper, we reveal the explanation of this phenomenon which results from a nonuniform distribution of nucleotides in the three coding positions. There is a linear correlation between the nucleotide distributions in the three codon positions and the power spectrum at the frequency N/3. Furthermore, this study indicates the relationship between the length of a DNA sequence and the variance of nucleotide distributions and the average Fourier power spectrum, which is the noise signal in gene-finding methods. The results presented in this paper provide an efficient way to compute the Fourier power spectrum at N/3 and the noise signal in gene-finding methods by calculating the nucleotide distributions ...

Journal ArticleDOI
TL;DR: This paper presents algorithms for the planted (l, d)-motif problem that always find the correct answer(s) and is confident that the techniques introduced in this paper will find independent applications.
Abstract: The problem of identifying meaningful patterns (i.e., motifs) from biological data has been studied extensively due to its paramount importance. Three versions of this problem have been identified in the literature. One of these three problems is the planted (l, d)-motif problem. Several instances of this problem have been posed as a challenge. Numerous algorithms have been proposed in the literature that address this challenge. Many of these algorithms fall under the category of heuristic algorithms. In this paper we present algorithms for the planted (l, d)-motif problem that always find the correct answer(s). Our algorithms are very simple and are based on some ideas that are fundamentally different from the ones employed in the literature. We believe that the techniques we introduce in this paper will find independent applications.

Journal ArticleDOI
TL;DR: It is proved that various simplified versions of the problem of designing a pair of primers with prescribed degeneracy that match a maximum number of given input sequences are hard, the polynomiality of some restricted cases is shown, and approximation algorithms for one variant are developed.
Abstract: A PCR primer sequence is called degenerate if some of its positions have several possible bases. The degeneracy of the primer is the number of unique sequence combinations it contains. We study the problem of designing a pair of primers with prescribed degeneracy that match a maximum number of given input sequences. Such problems occur when studying a family of genes that is known only in part, or is known in a related species. We prove that various simplified versions of the problem are hard, show the polynomiality of some restricted cases, and develop approximation algorithms for one variant. Based on these algorithms, we implemented a program called HYDEN for designing highly degenerate primers for a set of genomic sequences. We report on the success of the program in several applications, one of which is an experimental scheme for identifying all human olfactory receptor (OR) genes. In that project, HYDEN was used to design primers with degeneracies up to 10(10) that amplified with high specificity many novel genes of that family, tripling the number of OR genes known at the time.

Journal ArticleDOI
TL;DR: An efficient algorithm is presented for detecting approximate tandem repeats in genomic sequences based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeat definitions.
Abstract: An efficient algorithm is presented for detecting approximate tandem repeats in genomic sequences. The algorithm is based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeats. The ideas and methods underlying the algorithm are described and its effectiveness on genomic data is demonstrated.

Journal ArticleDOI
TL;DR: This work describes algorithmic improvements to seed design that address the problem of designing a set of n seeds to be used simultaneously, and gives a new local search method to optimize the sensitivity of seed sets.
Abstract: The challenge of similarity search in massive DNA sequence databases has inspired major changes in BLAST-style alignment tools, which accelerate search by inspecting only pairs of sequences sharing a common short "seed," or pattern of matching residues Some of these changes raise the possibility of improving search performance by probing sequence pairs with several distinct seeds, any one of which is sufficient for a seed match However, designing a set of seeds to maximize their combined sensitivity to biologically meaningful sequence alignments is computationally difficult, even given recent advances in designing single seeds This work describes algorithmic improvements to seed design that address the problem of designing a set of n seeds to be used simultaneously We give a new local search method to optimize the sensitivity of seed sets The method relies on efficient incremental computation of the probability that an alignment contains a match to a seed pi, given that it has already failed to match any of the seeds in a set Pi We demonstrate experimentally that multi-seed designs, even with relatively few seeds, can be significantly more sensitive than even optimized single-seed designs

Journal ArticleDOI
TL;DR: This univariate accumulation procedure is compared to new multivariate "greedy" and "maximin" algorithms for choosing marker panels, and although none of the approaches necessarily chooses the panel with optimal performance, the algorithms all likely select panels with performance near enough to the maximum that they all are suitable for practical use.
Abstract: Given a set of potential source populations, genotypes of an individual of unknown origin at a collection of markers can be used to predict the correct source population of the individual. For improved efficiency, informative markers can be chosen from a larger set of markers to maximize the accuracy of this prediction. However, selecting the loci that are individually most informative does not necessarily produce the optimal panel. Here, using genotypes from eight species—carp, cat, chicken, dog, fly, grayling, human, and maize—this univariate accumulation procedure is compared to new multivariate "greedy" and "maximin" algorithms for choosing marker panels. The procedures generally suggest similar panels, although the greedy method often recommends inclusion of loci that are not chosen by the other algorithms. In seven of the eight species, when applied to five or more markers, all methods achieve at least 94% assignment accuracy on simulated individuals, with one species—dog— producing this level of ac...

Journal ArticleDOI
TL;DR: An effective integer linear programming (ILP) formulation of the MRHC problem with missing data and a branch-and-bound strategy that utilizes a partial order relationship and some other special relationships among variables to decide the branching order is presented.
Abstract: We study the problem of reconstructing haplotype configurations from genotypes on pedigree data with missing alleles under the Mendelian law of inheritance and the minimumrecombination principle, which is important for the construction of haplotype maps and genetic linkage/association analyses. Our previous results show that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This paper presents an effective integer linear programming (ILP) formulation of the MRHC problem with missing data and a branch-and-bound strategy that utilizes a partial order relationship and some other special relationships among variables to decide the branching order. Nontrivial lower and upper bounds on the optimal number of recombinants are introduced at each branching node to effectively prune the search tree. When multiple solutions exist, a best haplotype configuration is selected based on a maximum likelihood approach. The paper also shows for the first time how to incorporat...

Journal ArticleDOI
TL;DR: A new model and algorithm for identifying conserved gene clusters from pairwise genome comparison is presented, which generalizes a recent model called "gene teams", and removes the constraint in the original model that each gene must have a unique occurrence in each chromosome.
Abstract: The study of conserved gene clusters is important for understanding the forces behind genome organization and evolution, as well as the function of individual genes or gene groups. In this paper, we present a new model and algorithm for identifying conserved gene clusters from pairwise genome comparison. This generalizes a recent model called "gene teams." A gene team is a set of genes that appear homologously in two or more species, possibly in a different order yet with the distance of adjacent genes in the team for each chromosome always no more than a certain threshold. We remove the constraint in the original model that each gene must have a unique occurrence in each chromosome and thus allow the analysis on complex prokaryotic or eukaryotic genomes with extensive paralogs. Our algorithm analyzes a pair of chromosomes in O(mn) time and uses O(m+n) space, where m and n are the number of genes in the respective chromosomes. We demonstrate the utility of our methods by studying two bacterial genomes, E....

Journal ArticleDOI
TL;DR: A new algorithm to recognize TISs with a very high accuracy is presented and a class of new sequence-similarity kernels based on string editing, called edit kernels, are introduced for use with support vector machines (SVMs) in a discriminative approach to predict T ISs.
Abstract: The prediction of translation initiation sites (TISs) in eukaryotic mRNAs has been a challenging problem in computational molecular biology. In this paper, we present a new algorithm to recognize TISs with a very high accuracy. Our algorithm includes two novel ideas. First, we introduce a class of new sequence-similarity kernels based on string editing, called edit kernels, for use with support vector machines (SVMs) in a discriminative approach to predict TISs. The edit kernels are simple and have significant biological and probabilistic interpretations. Although the edit kernels are not positive definite, it is easy to make the kernel matrix positive definite by adjusting the parameters. Second, we convert the region of an input mRNA sequence downstream to a putative TIS into an amino acid sequence before applying SVMs to avoid the high redundancy in the genetic code. The algorithm has been implemented and tested on previously published data. Our experimental results on real mRNA data show that both ideas improve the prediction accuracy greatly and that our method performs significantly better than those based on neural networks and SVMs with polynomial kernels or Salzberg kernels.

Journal ArticleDOI
TL;DR: A new stochastic model for genotype generation using a hidden Markov model and infer its parameters by an expectation-maximization algorithm that reflects a general blocky structure of haplotypes, but also allows for "exchange" of haplotype at nonboundary SNP sites.
Abstract: We present a new stochastic model for genotype generation. The model offers a compromise between rigid block structure and no structure altogether: It reflects a general blocky structure of haplotypes, but also allows for "exchange" of haplotypes at nonboundary SNP sites; it also accommodates rare haplotypes and mutations. We use a hidden Markov model and infer its parameters by an expectation-maximization algorithm. The algorithm was implemented in a software package called HINT (haplotype inference tool) and tested on 58 datasets of genotypes. To evaluate the utility of the model in association studies, we used biological human data to create a simple disease association search scenario. When comparing HINT to three other models, HINT predicted association most accurately.

Journal ArticleDOI
TL;DR: A statistically sound two-stage co-expression detection algorithm that controls both statistical significance (false discovery rate, FDR) and biological significance (minimum acceptable strength, MAS) of the discovered co-expressions is designed and implemented.
Abstract: Motivation: Many exploratory microarray data analysis tools such as gene clustering and relevance networks rely on detecting pairwise gene co-expression. Traditional screening of pairwise co-expression either controls biological significance or statistical significance, but not both. The former approach does not provide stochastic error control, and the later approach screens many co-expressions with excessively low correlation. Methods: We have designed and implemented a statistically sound two-stage co-expression detection algorithm that controls both statistical significance (False Discovery Rate, FDR) and biological significance (Minimum Acceptable Strength, MAS) of the discovered co-expressions. Based on estimation of pairwise gene correlation, the algorithm provides an initial co-expression discovery that controls only FDR, which is then followed by a second stage co-expression discovery which controls both FDR and MAS. It also computes and thresholds the set of FDR p-values for each correlation that satisfied the MAS criterion. Results: We validated asymptotic null distributions of the Pearson and Kendall correlation coefficients and the twostage error-control procedure using simulated data. We then used yeast galactose metabolism data (Ideker et al. 2001) to illustrate the advantage of our method for clustering genes and constructing a relevance network. In gene clustering, the algorithm screens a seeded cluster of co-expressed genes with controlled FDR and MAS. In constructing the relevance network, the algorithm discovers a set of edges with controlled FDR and MAS. Availability: The method has been implemented in an R package ”GeneNT” that is freely available from: http://wwwpersonal.umich.edu/ v zhud/genent.htm.

Journal ArticleDOI
TL;DR: Investigating possible biases introduced by eliminating short blocks, focusing on the notion of breakpoint reuse introduced by Pevzner and Tesler, shows that reuse is very sensitive to the proportion of blocks excluded.
Abstract: In order to apply gene-order rearrangement algorithms to the comparison of genome sequences, Pevzner and Tesler bypass gene finding and ortholog identification and use the order of homologous blocks of unannotated sequence as input. The method excludes blocks shorter than a threshold length. Here we investigate possible biases introduced by eliminating short blocks, focusing on the notion of breakpoint reuse introduced by these authors. Analytic and simulation methods show that reuse is very sensitive to the proportion of blocks excluded. As is pertinent to the comparison of mammalian genomes, this exclusion risks randomizing the comparison partially or entirely.

Journal ArticleDOI
TL;DR: A novel, motion planning based approach to approximately map the energy landscape of an RNA molecule, based on probabilistic roadmap motion planners that have been successfully applied to protein folding is proposed.
Abstract: We propose a novel, motion planning based approach to approximately map the energy landscape of an RNA molecule. A key feature of our method is that it provides a sparse map that captures the main features of the energy landscape which can be analyzed to compute folding kinetics. Our method is based on probabilistic roadmap motion planners that we have previously successfully applied to protein folding. In this paper, we provide evidence that this approach is also well suited to RNA. We compute population kinetics and transition rates on our roadmaps using the master equation for a few moderately sized RNA and show that our results compare favorably with results of other existing methods.

Journal ArticleDOI
TL;DR: A description of the compressed suffix array (CSA) and how it is used to obtain matches is presented, and an implementation is demonstrated by aligning two mammalian genomes on a modest workstation equipped with under 2 GB of free RAM in time superior to that of the implementations of other data structures.
Abstract: The starting point for any alignment of mammalian genomes is the computation of exact matches satisfying various criteria. Time-efficient, O(n), data structures for this computation, such as the suffix tree, require O(n log(n)) space, several times the space of the genomes themselves. Thus, any reasonable whole-genome comparative project finds itself requiring tens of Gigabytes of RAM to maintain time-efficiency. This is beyond most modern workstations. With a new data structure, the compressed suffix array (CSA) implemented via the Burrows–Wheeler transform, we can trade time-efficiency for space-efficiency, taking O(n log(n)) time, but running in O(n) space, typically in total space less than or equal to that of the genomes themselves. If space is more expensive than time, this is an appropriate approach to consider. The most space-efficient implementation of this data structure requires 5 bits per nucleotide character to build on-line, in the worst case, and 2.5 bits per character to store once built. ...

Journal ArticleDOI
TL;DR: It is proved that OHI is NP-hard and can be formulated as an integer quadratic programming (IQP) problem, and an iterative semidefinite programming-based approximation algorithm, (called SDPHapInfer) is proposed, which finds a solution within a factor of O(log n) of the optimal solution.
Abstract: This paper studies haplotype inference by maximum parsimony using population data. We define the optimal haplotype inference (OHI) problem as given a set of genotypes and a set of related haplotypes, find a minimum subset of haplotypes that can resolve all the genotypes. We prove that OHI is NP-hard and can be formulated as an integer quadratic programming (IQP) problem. To solve the IQP problem, we propose an iterative semidefinite programming-based approximation algorithm, (called SDPHapInfer). We show that this algorithm finds a solution within a factor of O(log n) of the optimal solution, where n is the number of genotypes. This algorithm has been implemented and tested on a variety of simulated and biological data. In comparison with three other methods, (1) HAPAR, which was implemented based on the branching and bound algorithm, (2) HAPLOTYPER, which was implemented based on the expectation-maximization algorithm, and (3) PHASE, which combined the Gibbs sampling algorithm with an approximate coalescent prior, the experimental results indicate that SDPHapInfer and HAPLOTYPER have similar error rates. In addition, the results generated by PHASE have lower error rates on some data but higher error rates on others. The error rates of HAPAR are higher than the others on biological data. In terms of efficiency, SDPHapInfer, HAPLOTYPER, and PHASE output a solution in a stable and consistent way, and they run much faster than HAPAR when the number of genotypes becomes large.

Journal ArticleDOI
TL;DR: A new tool for representation and detection of gene clusters in multiple genomes, using PQ trees (Booth and Leuker, 1976), which describes the inner structure and the relations between clusters succinctly, aids in filtering meaningful from apparently meaningless clusters, and gives a natural and meaningful way of visualizing complex clusters.
Abstract: Permutations on strings representing gene clusters on genomes have been studied earlier by Uno and Yagiura (2000), Heber and Stoye (2001), Bergeron et al. (2002), Eres et al. (2003), and Schmidt and Stoye (2004) and the idea of a maximal permutation pattern was introduced by Eres et al. (2003). In this paper, we present a new tool for representation and detection of gene clusters in multiple genomes, using PQ trees (Booth and Leuker, 1976): this describes the inner structure and the relations between clusters succinctly, aids in filtering meaningful from apparently meaningless clusters, and also gives a natural and meaningful way of visualizing complex clusters. We identify a minimal consensus PQ tree and prove that it is equivalent to a maximal π pattern (Eres et al., 2003) and each subgraph of the PQ tree corresponds to a nonmaximal permutation pattern. We present a general scheme to handle multiplicity in permutations and also give a linear time algorithm to construct the minimal consensus PQ tree. Fur...

Journal ArticleDOI
TL;DR: In this article, a permutation group-based algorithm for solving the block-interchange distance problem for the circular chromosomes is proposed, where n is the length of the circular chromosome and δ is the minimum number of blocks required for the transformation, which can be calculated in 𝒪(n) time in advance.
Abstract: In the study of genome rearrangement, the block-interchanges have been proposed recently as a new kind of global rearrangement events affecting a genome by swapping two nonintersecting segments of any length. The so-called block-interchange distance problem, which is equivalent to the sorting by block-interchange problem, is to find a minimum series of block-interchanges for transforming one chromosome into another. In this paper, we study this problem by considering the circular chromosomes and propose a 𝒪(δn) time algorithm for solving it by making use of permutation groups in algebra, where n is the length of the circular chromosome and δ is the minimum number of block-interchanges required for the transformation, which can be calculated in 𝒪(n) time in advance. Moreover, we obtain analogous results by extending our algorithm to linear chromosomes. Finally, we have implemented our algorithm and applied it to the circular genomic sequences of three human vibrio pathogens for predicting their evolutionar...