scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2003"


Journal ArticleDOI
TL;DR: A probabilistic model in which an OPSM is hidden within an otherwise random matrix is defined, and an efficient algorithm is developed for finding the hidden OPSM in the random matrix.
Abstract: This paper concerns the discovery of patterns in gene expression matrices, in which each element gives the expression level of a given gene in a given experiment. Most existing methods for pattern discovery in such matrices are based on clustering genes by comparing their expression levels in all experiments, or clustering experiments by comparing their expression levels for all genes. Our work goes beyond such global approaches by looking for local patterns that manifest themselves when we focus simultaneously on a subset G of the genes and a subset T of the experiments. Specifically, we look for order-preserving submatrices (OPSMs), in which the expression levels of all genes induce the same linear ordering of the experiments (we show that the OPSM search problem is NP-hard in the worst case). Such a pattern might arise, for example, if the experiments in T represent distinct stages in the progress of a disease or in a cellular process and the expression levels of all genes in G vary across the stages in the same way. We define a probabilistic model in which an OPSM is hidden within an otherwise random matrix. Guided by this model, we develop an efficient algorithm for finding the hidden OPSM in the random matrix. In data generated according to the model, the algorithm recovers the hidden OPSM with a very high success rate. Application of the methods to breast cancer data seem to reveal significant local patterns.

503 citations


Journal ArticleDOI
TL;DR: A means of representing proteins using pairwise sequence similarity scores, combined with a discriminative classification algorithm known as the support vector machine (SVM), provides a powerful means of detecting subtle structural and evolutionary relationships among proteins.
Abstract: One key element in understanding the molecular machinery of the cell is to understand the structure and function of each protein encoded in the genome. A very successful means of inferring the structure or function of a previously unannotated protein is via sequence similarity with one or more proteins whose structure or function is already known. Toward this end, we propose a means of representing proteins using pairwise sequence similarity scores. This representation, combined with a discriminative classification algorithm known as the support vector machine (SVM), provides a powerful means of detecting subtle structural and evolutionary relationships among proteins. The algorithm, called SVM-pairwise, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better performance than SVM-Fisher, profile HMMs, and PSI-BLAST.

360 citations


Journal ArticleDOI
TL;DR: In this article, the authors examine the evolution of graphs by node duplication processes and derive exact analytical relationships between the exponent of the power law and the parameters of the preferential choice model.
Abstract: Are biological networks different from other large complex networks? Both large biological and nonbiological networks exhibit power-law graphs (number of nodes with degree k, N(k)~k-β), yet the exponents, β, fall into different ranges. This may be because duplication of the information in the genome is a dominant evolutionary force in shaping biological networks (like gene regulatory networks and protein-protein interaction networks) and is fundamentally different from the mechanisms thought to dominate the growth of most nonbiological networks (such as the Internet). The preferential choice models used for nonbiological networks like web graphs can only produce power-law graphs with exponents greater than 2. We use combinatorial probabilistic methods to examine the evolution of graphs by node duplication processes and derive exact analytical relationships between the exponent of the power law and the parameters of the model. Both full duplication of nodes (with all their connections) as well as partial d...

305 citations


Journal ArticleDOI
TL;DR: In this paper, the theory of Markov random fields is employed to infer a protein's functions using protein-protein interaction data and the functional annotations of protein's interaction partners, and the probability that the protein has such a function using Bayesian approaches.
Abstract: Assigning functions to novel proteins is one of the most important problems in the postgenomic era. Several approaches have been applied to this problem, including the analysis of gene expression patterns, phylogenetic profiles, protein fusions, and protein-protein interactions. In this paper, we develop a novel approach that employs the theory of Markov random fields to infer a protein's functions using protein-protein interaction data and the functional annotations of protein's interaction partners. For each function of interest and protein, we predict the probability that the protein has such function using Bayesian approaches. Unlike other available approaches for protein annotation in which a protein has or does not have a function of interest, we give a probability for having the function. This probability indicates how confident we are about the prediction. We employ our method to predict protein functions based on "biochemical function," "subcellular location," and "cellular role" for yeast proteins defined in the Yeast Proteome Database (YPD, www.incyte.com), using the protein-protein interaction data from the Munich Information Center for Protein Sequences (MIPS, mips.gsf.de). We show that our approach outperforms other available methods for function prediction based on protein interaction data. The supplementary data is available at www-hto.usc.edu/~msms/ProteinFunction.

302 citations


Journal ArticleDOI
TL;DR: This paper surveys the disciplines involved in unstructured-text analysis, categorizes current work in biomedical literature mining with respect to these disciplines, and provides examples of text analysis methods applied towards meeting some of the current challenges in bioinformatics.
Abstract: The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of genomics and proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last few years, there has been a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature and find the nuggets of information most relevant and useful for specific analysis tasks. This paper provides a road map to the various literature-mining methods, both in general and within bioinformatics. It surveys the disciplines involved in unstructured-text analysis, categorizes current work in biomedical literature mining with respect to these disciplines, and provides examples of text analysis methods applied towards meeting some of the current challenges in bioinformatics.

277 citations


Journal ArticleDOI
TL;DR: This work presents algorithms for time-series gene expression analysis that permit the principled estimation of unobserved time points, clustering, and dataset alignment, and shows a specific application to yeast knock-out data that produces biologically meaningful results.
Abstract: We present algorithms for time-series gene expression analysis that permit the principled estimation of unobserved time points, clustering, and dataset alignment. Each expression profile is modeled as a cubic spline (piecewise polynomial) that is estimated from the observed data and every time point influences the overall smooth expression curve. We constrain the spline coefficients of genes in the same class to have similar expression patterns, while also allowing for gene specific parameters. We show that unobserved time points can be reconstructed using our method with 10-15% less error when compared to previous best methods. Our clustering algorithm operates directly on the continuous representations of gene expression profiles, and we demonstrate that this is particularly effective when applied to nonuniformly sampled data. Our continuous alignment algorithm also avoids difficulties encountered by discrete approaches. In particular, our method allows for control of the number of degrees of freedom of...

276 citations


Journal ArticleDOI
TL;DR: A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced, based on fitting inverse power-law models to construct empirical learning curves.
Abstract: A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced. The goal is to use existing classification results to estimate dataset size requirements for future classification experiments and to evaluate the gain in accuracy and significance of classifiers built with additional data. The method is based on fitting inverse power-law models to construct empirical learning curves. It also includes a permutation test procedure to assess the statistical significance of classification performance for a given dataset size. This procedure is applied to several molecular classification problems representing a broad spectrum of levels of complexity.

274 citations


Journal ArticleDOI
TL;DR: The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques and can provide clues as to the molecular identities of differentially expressed proteins and peptides.
Abstract: We have developed an algorithm called Q5 for probabilistic classification of healthy vs. disease whole serum samples using mass spectrometry. The algorithm employs Principal Components Analysis (PCA) followed by Linear Discriminant Analysis (LDA) on whole spectrum SurfaceEnhanced Laser Desorption/Ionization Time of Flight (SELDI-TOF) Mass Spectrometry (MS) data, and is demonstrated on four real datasets from complete, complex SELDI spectra of human blood serum. Q5 is a closed-form, exact solution to the problem of classification of complete mass spectra of a complex protein mixture. Q5 employs a probabilistic classification algorithm built upon a dimension-reduced linear discriminant analysis. Our solution is computationally ecient; it is non-iterative and computes the optimal linear discriminant using closed-form equations. The optimal discriminant is computed and verified for datasets of complete, complex SELDI spectra of human blood serum. Replicate experiments of dierent training/testing splits of each dataset are employed to verify robustness of the algorithm. The probabilistic classification method achieves excellent performance. We achieve sensitivity, specificity, and positive predictive values above 97% on three ovarian cancer datasets and one prostate cancer dataset. The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques, and can provide clues as to the molecular identities of dierentially-exp

187 citations


Journal ArticleDOI
TL;DR: A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays as discussed by the authors.
Abstract: A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtaine...

169 citations


Journal ArticleDOI
TL;DR: The PMB (Probability Matrix from Blocks) defines a new evolutionary model for protein evolution that can be used for evolutionary analyses of protein sequences and is directly derived from, and thus compatible with, the BLOSUM matrices.
Abstract: Substitution matrices have been useful for sequence alignment and protein sequence comparisons. The BLOSUM series of matrices, which had been derived from a database of alignments of protein blocks, improved the accuracy of alignments previously obtained from the PAM-type matrices estimated from only closely related sequences. Although BLOSUM matrices are scoring matrices now widely used for protein sequence alignments, they do not describe an evolutionary model. BLOSUM matrices do not permit the estimation of the actual number of amino acid substitutions between sequences by correcting for multiple hits. The method presented here uses the Blocks database of protein alignments, along with the additivity of evolutionary distances, to approximate the amino acid substitution probabilities as a function of actual evolutionary distance. The PMB (Probability Matrix from Blocks) defines a new evolutionary model for protein evolution that can be used for evolutionary analyses of protein sequences. Our model is di...

141 citations


Journal ArticleDOI
TL;DR: The algorithmic implications of the no-recombination in long blocks observation is explored, for the problem of inferring haplotypes in populations, and a simple, easy-to-program, O(nm(2))-time algorithm is established that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions.
Abstract: A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. ...

Journal ArticleDOI
TL;DR: Stochastic roadmap simulation (SRS) is introduced as a new computational approach for exploring the kinetics of molecular motion by simultaneously examining multiple pathways and converges to the same distribution as Monte Carlo simulation.
Abstract: Classic molecular motion simulation techniques, such as Monte Carlo (MC) simulation, generate motion pathways one at a time and spend most of their time in the local minima of the energy landscape defined over a molecular conformation space. Their high computational cost prevents them from being used to compute ensemble properties (properties requiring the analysis of many pathways). This paper introduces stochastic roadmap simulation (SRS) as a new computational approach for exploring the kinetics of molecular motion by simultaneously examining multiple pathways. These pathways are compactly encoded in a graph, which is constructed by sampling a molecular conformation space at random. This computation, which does not trace any particular pathway explicitly, circumvents the local-minima problem. Each edge in the graph represents a potential transition of the molecule and is associated with a probability indicating the likelihood of this transition. By viewing the graph as a Markov chain, ensemble properties can be efficiently computed over the entire molecular energy landscape. Furthermore, SRS converges to the same distribution as MC simulation. SRS is applied to two biological problems: computing the probability of folding, an important order parameter that measures the "kinetic distance" of a protein's conformation from its native state; and estimating the expected time to escape from a ligand-protein binding site. Comparison with MC simulations on protein folding shows that SRS produces arguably more accurate results, while reducing computation time by several orders of magnitude. Computational studies on ligand-protein binding also demonstrate SRS as a promising approach to study ligand-protein interactions.

Journal ArticleDOI
TL;DR: The ability to produce large sets of unrelated folding pathways may potentially provide crucial insight into some aspects of folding kinetics, such as proteins that exhibit both two-state and three-state kinetics that are not captured by other theoretical techniques.
Abstract: We investigate a novel approach for studying the kinetics of protein folding. Our framework has evolved from robotics motion planning techniques called probabilistic roadmap methods (PRMs) that have been applied in many diverse fields with great success. In our previous work, we presented our PRM-based technique and obtained encouraging results studying protein folding pathways for several small proteins. In this paper, we describe how our motion planning framework can be used to study protein folding kinetics. In particular, we present a refined version of our PRM-based framework and describe how it can be used to produce potential energy landscapes, free energy landscapes, and many folding pathways all from a single roadmap which is computed in a few hours on a desktop PC. Results are presented for 14 proteins. Our ability to produce large sets of unrelated folding pathways may potentially provide crucial insight into some aspects of folding kinetics, such as proteins that exhibit both two-state and three-state kinetics that are not captured by other theoretical techniques.

Journal ArticleDOI
TL;DR: This work shows how to decrease the complexity of modeling flexibility in proteins by reducing the number of dimensions necessary to model important macromolecular motions such as the induced-fit process by using the principal component analysis method, a dimensionality reduction technique.
Abstract: This work shows how to decrease the complexity of modeling flexibility in proteins by reducing the number of dimensions necessary to model important macromolecular motions such as the induced-fit process. Induced fit occurs during the binding of a protein to other proteins, nucleic acids, or small molecules (ligands) and is a critical part of protein function. It is now widely accepted that conformational changes of proteins can affect their ability to bind other molecules and that any progress in modeling protein motion and flexibility will contribute to the understanding of key biological functions. However, modeling protein flexibility has proven a very difficult task. Experimental laboratory methods, such as x-ray crystallography, produce rather limited information, while computational methods such as molecular dynamics are too slow for routine use with large systems. In this work, we show how to use the principal component analysis method, a dimensionality reduction technique, to transform the original high-dimensional representation of protein motion into a lower dimensional representation that captures the dominant modes of motions of proteins. For a medium-sized protein, this corresponds to reducing a problem with a few thousand degrees of freedom to one with less than fifty. Although there is inevitably some loss in accuracy, we show that we can obtain conformations that have been observed in laboratory experiments, starting from different initial conformations and working in a drastically reduced search space.

Journal ArticleDOI
TL;DR: A model of DNA sequence evolution which can account for biases in mutation rates that depend on the identity of the neighboring bases is introduced and may be used as a null model for various sequence analysis applications.
Abstract: We introduce a model of DNA sequence evolution which can account for biases in mutation rates that depend on the identity of the neighboring bases. An analytic solution for this class of models is developed by adopting well-known methods of nonlinear dynamics. Results are presented for the CpG-methylation-deamination process, which dominates point substitutions in vertebrates. The dinucleotide frequencies generated by the model (using empirically obtained mutation rates) match the overall pattern observed in noncoding DNA. A web-based tool has been constructed to compute single- and dinucleotide frequencies for arbitrary neighbor-dependent mutation rates. Also provided is the backward procedure to infer the mutation rates using maximum likelihood analysis given the observed single- and dinucleotide frequencies. Reasonable estimates of the mutation rates can be obtained very efficiently, using generic noncoding DNA sequences as input, after masking out long homonucleotide subsequences. Our method is much more convenient and versatile to use than the traditional method of deducing mutation rates by counting mutation events in carefully chosen sequences. More generally, our approach provides a more realistic but still tractable description of noncoding genomic DNA and may be used as a null model for various sequence analysis applications.

Journal ArticleDOI
TL;DR: A novel approach to designing a DNA library for molecular computation that takes into account the ability of DNA strands to hybridize in complex structures like hairpins, internal loops, or bulge loops and computes the stability of the hybrids formed based on thermodynamic data.
Abstract: A novel approach to designing a DNA library for molecular computation is presented. The method is employed for encoding binary information in DNA molecules. It aims to achieve a practical discrimination between perfectly matched DNA oligomers and those with mismatches in a large pool of different molecules. The approach takes into account the ability of DNA strands to hybridize in complex structures like hairpins, internal loops, or bulge loops and computes the stability of the hybrids formed based on thermodynamic data. A dynamic programming algorithm is applied to calculate the partition function for the ensemble of structures, which play a role in the hybridization reaction. The applicability of the method is demonstrated by the design of a twelve-bit DNA library. The library is constructed and experimentally tested using molecular biology tools. The results show a high level of specific hybridization achieved for all library words under identical conditions. The method is also applicable for the desig...

Journal ArticleDOI
TL;DR: This paper demonstrates the equivalence of some of the linear models proposed for analyzing two-color microarray data and focuses on choices in micro array data analysis that have a larger impact on the results than the choice of linear model.
Abstract: In the past several years many linear models have been proposed for analyzing two-color microarray data. As presented in the literature, many of these models appear dramatically different. However, many of these models are reformulations of the same basic approach to analyzing microarray data. This paper demonstrates the equivalence of some of these models. Attention is directed at choices in microarray data analysis that have a larger impact on the results than the choice of linear model.

Journal ArticleDOI
TL;DR: It is proved that it is impossible to reconstruct ancestral data at the root of "deep" phylogenetic trees with high mutation rates from a number of characters smaller than a low-degree polynomial in the number of leaves.
Abstract: We prove that it is impossible to reconstruct ancestral data at the root of "deep" phylogenetic trees with high mutation rates. Moreover, we prove that it is impossible to reconstruct the topology of "deep" trees with high mutation rates from a number of characters smaller than a low-degree polynomial in the number of leaves. Our impossibility results hold for all reconstruction methods. The proofs apply tools from information theory and percolation theory.

Journal ArticleDOI
TL;DR: An extensive analysis of monotonicities of exceptionally frequent or rare words in bio-sequences for a broader variety of scores supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
Abstract: The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

Journal ArticleDOI
TL;DR: Two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques are designed, and a criterion based on the form of the CBG to choose a priori the fastest between both are proposed.
Abstract: The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.

Journal ArticleDOI
TL;DR: This work gives the first polynomial time suboptimal algorithm that finds all theSuboptimal solutions (peptides) in O(p|E|) time, where p is the number of solutions.
Abstract: Tandem mass spectrometry has emerged to be one of the most powerful high-throughput techniques for protein identification. Tandem mass spectrometry selects and fragments peptides of interest into N-terminal ions and C-terminal ions, and it measures the mass/charge ratios of these ions. The de novo peptide sequencing problem is to derive the peptide sequences from given tandem mass spectral data of k ion peaks without searching against protein databases. By transforming the spectral data into a matrix spectrum graph G = (V, E), where |V| = O(k(2)) and |E| = O(k(3)), we give the first polynomial time suboptimal algorithm that finds all the suboptimal solutions (peptides) in O(p|E|) time, where p is the number of solutions. The algorithm has been implemented and tested on experimental data. The program is available at http://hto-c.usc.edu:8000/msms/menu/denovo.htm.

Journal ArticleDOI
TL;DR: This paper constructs tests for significant groupings against null hypotheses of random gene order, taking incomplete clusters, multiple genomes, and gene families into account, and considers the significance of individual clusters of prespecified genes and the overall degree of clustering in whole genomes.
Abstract: Comparing chromosomal gene order in two or more related species is an important approach to studying the forces that guide genome organization and evolution. Linked clusters of similar genes found in related genomes are often used to support arguments of evolutionary relatedness or functional selection. However, as the gene order and the gene complement of sister genomes diverge progressively due to large scale rearrangements, horizontal gene transfer, gene duplication and gene loss, it becomes increasingly difficult to determine whether observed similarities in local genomic structure are indeed remnants of common ancestral gene order, or are merely coincidences. A rigorous comparative genomics requires principled methods for distinguishing chance commonalities, within or between genomes, from genuine historical or functional relationships. In this paper, we construct tests for significant groupings against null hypotheses of random gene order, taking incomplete clusters, multiple genomes, and gene families into account. We consider both the significance of individual clusters of prespecified genes and the overall degree of clustering in whole genomes.

Journal ArticleDOI
Magnus Åstrand1
TL;DR: A nonlinear normalization procedure aimed for normalizing feature intensities is proposed for Affymetrix high-density oligonucleotide array, which has the capacity to simultaneously measure the abundance of thousands of mRNA sequences in biological samples.
Abstract: Affymetrix high-density oligonucleotide array is a tool that has the capacity to simultaneously measure the abundance of thousands of mRNA sequences in biological samples. In order to allow direct ...

Journal ArticleDOI
TL;DR: It is shown that the SUBTREEVALUE reconstruction problem for the sigma-index is NP-hard, and algorithmic and analytical solutions for the inverse problems of the four indices are given.
Abstract: In the original paper, Goldman et al. (2000) launched the study of the inverse problems in combinatorial chemistry, which is closely related to the design of combinatorial libraries for drug discovery. Following their ideas, we investigate four other topological indices, i.e., the sigma-index, the c-index, the Z-index, and the M(1)-index, with a special emphasis on the sigma-index. Like the Wiener index, these four indices are very popular in combinatorial chemistry and reflect many chemical and physical properties. We give algorithmic and analytical solutions for the inverse problems of the four indices. We also show that the SUBTREEVALUE reconstruction problem for the sigma-index is NP-hard.

Journal ArticleDOI
TL;DR: It seems likely that methods that do not arbitrarily impose block boundaries among correlated SNPs might perform better than block-based methods.
Abstract: In this report, we examine the validity of the haplotype block concept by comparing block decompositions derived from public data sets by variants of several leading methods of block detection. We first develop a statistical method for assessing the concordance of two block decompositions. We then assess the robustness of inferred haplotype blocks to the specific detection method chosen, to arbitrary choices made in the block-detection algorithms, and to the sample analyzed. Although the block decompositions show levels of concordance that are very unlikely by chance, the absolute magnitude of the concordance may be low enough to limit the utility of the inference. For purposes of SNP selection, it seems likely that methods that do not arbitrarily impose block boundaries among correlated SNPs might perform better than block-based methods.

Journal ArticleDOI
TL;DR: The number of microarrays that is required in order to gain reliable results from a common type of study: the pairwise comparison of different classes of samples is estimated and current knowledge allows for the construction of models that look realistic with respect to searches for individual differentially expressed genes and derive prototypical parameters from real data sets.
Abstract: We estimate the number of microarrays that is required in order to gain reliable results from a common type of study: the pairwise comparison of different classes of samples. We show that current knowledge allows for the construction of models that look realistic with respect to searches for individual differentially expressed genes and derive prototypical parameters from real data sets. Such models allow investigation of the dependence of the required number of samples on the relevant parameters: the biological variability of the samples within each class, the fold changes in expression that are desired to be detected, the detection sensitivity of the microarrays, and the acceptable error rates of the results. We supply experimentalists with general conclusions as well as a freely accessible Java applet at www.scai.fhg.de/special/bio/howmanyarrays/ for fine tuning simulations to their particular settings.

Journal ArticleDOI
TL;DR: An efficient algorithm based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary k-leaved phylogenetic tree is presented, able to sum away the states, thus improving the time complexity to O(2(k)L(k)) and considerably reducing memory requirements.
Abstract: We present an efficient algorithm for statistical multiple alignment based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary k-leaved phylogenetic tree. The existing algorithms use a hidden Markov model approach, which requires at least O( radical 5(k)) states and leads to a time complexity of O(5(k)L(k)), where L is the geometric mean sequence length. Using a combinatorial technique reminiscent of inclusion/exclusion, we are able to sum away the states, thus improving the time complexity to O(2(k)L(k)) and considerably reducing memory requirements. This makes statistical multiple alignment under the TKF91 model a definite practical possibility in the case of a phylogenetic tree with a modest number of leaves.

Journal ArticleDOI
TL;DR: A set of procedures that corrects most of the sequencing errors, changes quality values accordingly, and produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome.
Abstract: The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. For every fragment, the sequence of bases at each end is determined, albeit imprecisely, resulting in a sequence of letters called a “read”. Each letter in a read is assigned a quality value, an estimate of the probability that a sequencing error occurred in determining that letter. Reads are typically truncated after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of “overlaps”, i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures can be run iteratively and as a preprocessor for other assemblers. We tested our procedures on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.

Journal ArticleDOI
TL;DR: A polynomial-time 2-approximation algorithm for finding a minimum-length sorting sequence of short swaps for a given permutation is presented, and bounds for the short-swap diameter are obtained.
Abstract: A short swap is an operation on a permutation that switches two elements that have at most one element between them. This paper investigates the problem of finding a minimum-length sorting sequence of short swaps for a given permutation. A polynomial-time 2-approximation algorithm for this problem is presented, and bounds for the short-swap diameter (the length of the longest minimum sorting sequence among all permutations of a given length) are also obtained.

Journal ArticleDOI
TL;DR: The set-association method combines information over SNPs by forming sums of relevant single-marker statistics and successfully addresses the "curse of dimensionality" problem--too many variables should be estimated with a comparatively small number of observations.
Abstract: Common heritable diseases ("complex traits") are assumed to be due to multiple underlying susceptibility genes. While genetic mapping methods for Mendelian disorders have been very successful, the search for genes underlying complex traits has been difficult and often disappointing. One of the reasons may be that most current gene-mapping approaches are still based on conventional methodology of testing one or a few SNPs at a time. Here, we demonstrate a simple strategy that allows for the joint analysis of multiple disease-associated SNPs in different genomic regions. Our set-association method combines information over SNPs by forming sums of relevant single-marker statistics. As previously hypothesized, we show here that this approach successfully addresses the "curse of dimensionality" problem--too many variables should be estimated with a comparatively small number of observations. We also report results of simulation studies showing that our method furnishes unbiased and accurate significance levels. Power calculations demonstrate good power even in the presence of large numbers of nondisease associated SNPs. We extended our method to microarray expression data, where expression levels for large numbers of genes should be compared between two tissue types. In applications to such data, our approach turned out to be highly efficient.