scispace - formally typeset
Search or ask a question

Showing papers in "Statistical Applications in Genetics and Molecular Biology in 2005"


Journal ArticleDOI
TL;DR: A general framework for `soft' thresholding that assigns a connection weight to each gene pair is described and several node connectivity measures are introduced and provided empirical evidence that they can be important for predicting the biological significance of a gene.
Abstract: Gene co-expression networks are increasingly used to explore the system-level functionality of genes. The network construction is conceptually straightforward: nodes represent genes and nodes are connected if the corresponding genes are significantly co-expressed across appropriately chosen tissue samples. In reality, it is tricky to define the connections between the nodes in such networks. An important question is whether it is biologically meaningful to encode gene co-expression using binary information (connected=1, unconnected=0). We describe a general framework for ;soft' thresholding that assigns a connection weight to each gene pair. This leads us to define the notion of a weighted gene co-expression network. For soft thresholding we propose several adjacency functions that convert the co-expression measure to a connection weight. For determining the parameters of the adjacency function, we propose a biologically motivated criterion (referred to as the scale-free topology criterion). We generalize the following important network concepts to the case of weighted networks. First, we introduce several node connectivity measures and provide empirical evidence that they can be important for predicting the biological significance of a gene. Second, we provide theoretical and empirical evidence that the ;weighted' topological overlap measure (used to define gene modules) leads to more cohesive modules than its ;unweighted' counterpart. Third, we generalize the clustering coefficient to weighted networks. Unlike the unweighted clustering coefficient, the weighted clustering coefficient is not inversely related to the connectivity. We provide a model that shows how an inverse relationship between clustering coefficient and connectivity arises from hard thresholding. We apply our methods to simulated data, a cancer microarray data set, and a yeast microarray data set.

4,448 citations


Journal ArticleDOI
TL;DR: This work proposes a novel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analytic calculation of the optimal shrinkage intensity and applies it to the problem of inferring large-scale gene association networks.
Abstract: Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problem in bioinformatics. Clearly, the widely used standard covariance and correlation estimators are ill-suited for this purpose. As statistically efficient and computationally fast alternative we propose a novel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analytic calculation of the optimal shrinkage intensity. Subsequently, we apply this improved covariance estimator (which has guaranteed minimum mean squared error, is well-conditioned, and is always positive definite even for small sample sizes) to the problem of inferring large-scale gene association networks. We show that it performs very favorably compared to competing approaches both in simulations as well as in application to real expression data.

1,641 citations


Journal ArticleDOI
TL;DR: The authors apply these concepts to a seminal paper in bioinformatics, namely The Molecular Classification of Cancer, Golub et al (1999), and demonstrate that such a reproduction is possible and instead concentrate on demonstrating the usefulness of the compendium concept itself.
Abstract: While scientific research and the methodologies involved have gone through substantial technological evolution the technology involved in the publication of the results of these endeavors has remained relatively stagnant. Publication is largely done in the same manner today as it was fifty years ago. Many journals have adopted electronic formats, however, their orientation and style is little different from a printed document. The documents tend to be static and take little advantage of computational resources that might be available. Recent work, Gentleman and Temple Lang (2003), suggests a methodology and basic infrastructure that can be used to publish documents in a substantially different way. Their approach is suitable for the publication of papers whose message relies on computation. Stated quite simply, Gentleman and Temple Lang (2003) propose a paradigm where documents are mixtures of code and text. Such documents may be self-contained or they may be a component of a compendium which provides the infrastructure needed to provide access to data and supporting software. These documents, or compendiums, can be processed in a number of different ways. One transformation will be to replace the code with its output -- thereby providing the familiar, but limited, static document. In this paper we apply these concepts to a seminal paper in bioinformatics, namely The Molecular Classification of Cancer, Golub et al (1999). The authors of that paper have generously provided data and other information that have allowed us to largely reproduce their results. Rather than reproduce this paper exactly we demonstrate that such a reproduction is possible and instead concentrate on demonstrating the usefulness of the compendium concept itself.

142 citations


Journal ArticleDOI
TL;DR: It is reported that evidence from these analyses that this impact of the correlation between gene expression levels on the statistical inference based on the empirical Bayes methodology may be quite strong, leading to a high variance of the number of differentially expressed genes.
Abstract: Stochastic dependence between gene expression levels in microarray data is of critical importance for the methods of statistical inference that resort to pooling test statistics across genes. The empirical Bayes methodology in the nonparametric and parametric formulations, as well as closely related methods employing a two-component mixture model, represent typical examples. It is frequently assumed that dependence between gene expressions (or associated test statistics) is sufficiently weak to justify the application of such methods for selecting differentially expressed genes. By applying resampling techniques to simulated and real biological data sets, we have studied a potential impact of the correlation between gene expression levels on the statistical inference based on the empirical Bayes methodology. We report evidence from these analyses that this impact may be quite strong, leading to a high variance of the number of differentially expressed genes. This study also pinpoints specific components of the empirical Bayes method where the reported effect manifests itself.

116 citations


Journal ArticleDOI
TL;DR: An extension of the simulation algorithm characterized by a much more general penetrance function, which allows for the joint action of up to two genes and up to three environmental covariates in the simulated pedigrees, with all possible multiplicative interaction effects between them.
Abstract: We have previously distributed a software package, SIMLA (SIMulation of Linkage and Association), which can be used to generate disease phenotype and marker genotype data in three-generational pedigrees of user-specified structure. To our knowledge, SIMLA is the only publicly available program that can simulate variable levels of both linkage (recombination) and linkage disequilibrium (LD) between marker and disease loci in general pedigrees. While the previous SIMLA version provided flexibility in choosing many parameters relevant for linkage and association mapping of complex human diseases, it did not allow for the segregation of more than one disease locus in a given pedigree and did not incorporate environmental covariates possibly interacting with disease susceptibility genes. Here, we present an extension of the simulation algorithm characterized by a much more general penetrance function, which allows for the joint action of up to two genes and up to two environmental covariates in the simulated pedigrees, with all possible multiplicative interaction effects between them. This makes the program even more useful for comparing the performance of different linkage and association analysis methods applied to complex human phenotypes. SIMLA can assist investigators in planning and designing a variety of linkage and association studies, and can help interpret results of real data analyses by comparing them to results obtained under a user-controlled data generation mechanism.A free download of the SIMLA package is available at http://wwwchg.duhs.duke.edu/software.

64 citations


Journal ArticleDOI
TL;DR: The aim is to detect avoided and/or favored distances between two motifs, for instance, suggesting possible interactions at a molecular level, and the method can be of great interest for functional motif detection, or to improve knowledge of some biological mechanisms.
Abstract: We propose an original statistical method to estimate how the occurrences of a given process along a genome, genes or motifs for instance, may be influenced by the occurrences of a second process. More precisely, the aim is to detect avoided and/or favored distances between two motifs, for instance, suggesting possible interactions at a molecular level. For this, we consider occurrences along the genome as point processes and we use the so-called Hawkes' model. In such model, the intensity at position t depends linearly on the distances to past occurrences of both processes via two unknown profile functions to estimate. We perform a non parametric estimation of both profiles by using B-spline decompositions and a constrained maximum likelihood method. Finally, we use the AIC criterion for the model selection. Simulations show the excellent behavior of our estimation procedure. We then apply it to study (i) the dependence between gene occurrences along the E. coli genome and the occurrences of a motif known to be part of the major promoter for this bacterium, and (ii) the dependence between the yeast S. cerevisiae genes and the occurrences of putative polyadenylation signals. The results are coherent with known biological properties or previous predictions, meaning this method can be of great interest for functional motif detection, or to improve knowledge of some biological mechanisms.

54 citations


Journal ArticleDOI
TL;DR: The maximum likelihood estimator can be found by the expectation maximization (EM) algorithm and an expression for the information matrix is derived and explicit analytical solutions for the EM algorithm and information matrix are provided.
Abstract: We describe statistical inference in continuous time Markov processes of DNA sequences related by a phylogenetic tree. The maximum likelihood estimator can be found by the expectation maximization (EM) algorithm and an expression for the information matrix is also derived. We provide explicit analytical solutions for the EM algorithm and information matrix.

53 citations


Journal ArticleDOI
TL;DR: A new re-sampling based multiple testing procedure asymptotically controlling the probability that the proportion of false positives among the set of rejections exceeds q at level alpha, where q and alpha are user supplied numbers.
Abstract: Simultaneously testing a collection of null hypotheses about a data generating distribution based on a sample of independent and identically distributed observations is a fundamental and important statistical problem involving many applications. In this article we propose a new re-sampling based multiple testing procedure asymptotically controlling the probability that the proportion of false positives among the set of rejections exceeds q at level alpha, where q and alpha are user supplied numbers. The procedure involves 1) specifying a conditional distribution for a guessed set of true null hypotheses, given the data, which asymptotically is degenerate at the true set of null hypotheses, and 2) specifying a generally valid null distribution for the vector of test-statistics proposed in Pollard & van der Laan (2003), and generalized in our subsequent article Dudoit, van der Laan, & Pollard (2004), van der Laan, Dudoit, & Pollard (2004), and van der Laan, Dudoit, & Pollard (2004b). Ingredient 1) is established by fitting the empirical Bayes two component mixture model (Efron (2001b)) to the data to obtain an upper bound for marginal posterior probabilities of the null being true, given the data. We establish the finite sample rational behind our proposal, and prove that this new multiple testing procedure asymptotically controls the wished tail probability for the proportion of false positives under general data generating distributions. In addition, we provide simulation studies establishing that this method is generally more powerful in finite samples than our previously proposed augmentation multiple testing procedure (van der Laan, Dudoit, & Pollard (2004b)) and competing procedures from the literature. Finally, we illustrate our methodology with a data analysis.

49 citations


Journal ArticleDOI
TL;DR: This model has been employed to map QTL affecting stem height and diameter growth trajectories in an interspecific hybrid progeny of Populus, leading to the successful discovery of three pleiotropic QTL on different linkage groups.
Abstract: In this article, we present a statistical model for mapping quantitative trait loci (QTL) that determine growth trajectories of two correlated traits during ontogenetic development. This model is derived within the maximum likelihood context, incorporated by mathematical aspects of growth processes to model the mean vector and by structured antedependence (SAD) models to approximate time-dependent covariance matrices for longitudinal traits. It provides a quantitative framework for testing the relative importance of two mechanisms, pleiotropy and linkage, in contributing to genetic correlations during ontogeny. This model has been employed to map QTL affecting stem height and diameter growth trajectories in an interspecific hybrid progeny of Populus, leading to the successful discovery of three pleiotropic QTL on different linkage groups. The implications of this model for genetic mapping within a broader context are discussed.

47 citations


Journal ArticleDOI
TL;DR: In this article, a Bayesian estimation of a vector known to have a large number of zero components is proposed to compare the expression of thousands of genes in two different cell lines, and the prior knowledge on expression changes using mixture priors that incorporate a mass at zero.
Abstract: Gene microarray technology is often used to compare the expression of thousand of genes in two different cell lines. Typically, one does not expect measurable changes in transcription amounts for a large number of genes; furthermore, the noise level of array experiments is rather high in relation to the available number of replicates. For the purpose of statistical analysis, inference on the "population'' difference in expression for genes across the two cell lines is often cast in the framework of hypothesis testing, with the null hypothesis being no change in expression. Given that thousands of genes are investigated at the same time, this requires some multiple comparison correction procedure to be in place. We argue that hypothesis testing, with its emphasis on type I error and family analogues, may not address the exploratory nature of most microarray experiments. We instead propose viewing the problem as one of estimation of a vector known to have a large number of zero components. In a Bayesian framework, we describe the prior knowledge on expression changes using mixture priors that incorporate a mass at zero, and we choose a loss function that favors the selection of sparse solutions. We consider two different models applicable to the microarray problem, depending on the nature of replicates available, and show how to explore the posterior distributions of the parameters using MCMC. Simulations show an interesting connection between this Bayesian estimation framework and false discovery rate (FDR) control. Finally, two empirical examples illustrate the practical advantages of this Bayesian estimation paradigm.

35 citations


Journal ArticleDOI
Wei Pan1
TL;DR: This work considers how to fuse biological information into a mixture model to analyze microarray data, and finds that the proposal improves over the standard approach, resulting in reduced false discovery rates (FDR), and hence it is a useful alternative to the current practice.
Abstract: Currently the practice of using existing biological knowledge in analyzing high throughput genomic and proteomic data is mainly for the purpose of validations. Here we take a different approach of incorporating biological knowledge into statistical analysis to improve statistical power and efficiency. Specifically, we consider how to fuse biological information into a mixture model to analyze microarray data. In contrast to a standard mixture model where it is assumed that all the genes come from the same (marginal) distribution, including an equal prior probability of having an event, such as having differential expression or being bound by a transcription factor (TF), our proposed mixture model allows the genes in different groups to have different distributions while the grouping of the genes reflects biological information. Using a list of about 800 putative cell cycle-regulated genes as prior biological knowledge, we analyze a genome-wide location data to detect binding sites of TF Fkh1. We find that our proposal improves over the standard approach, resulting in reduced false discovery rates (FDR), and hence it is a useful alternative to the current practice.

Journal ArticleDOI
TL;DR: A probabilistic methodology for making disease-mapping inferences in large-scale case-control genetic studies using a semi-Bayesian approach that automatically adjusts for the effects of diffuse population stratification.
Abstract: Recent analytic and technological breakthroughs have set the stage for genome-wide linkage disequilibrium studies to map disease-susceptibility variants. This paper discusses a probabilistic methodology for making disease-mapping inferences in large-scale case-control genetic studies. The semi-Bayesian approach promoted compares the probability of the observed data under disease hypotheses to the probability of the data under a null hypothesis defined by data at all the markers interrogated in a large study. This method automatically adjusts for the effects of diffuse population stratification. It is claimed that this characterization of the evidence for or against disease models may facilitate more appropriate inductions for large-scale genetic studies. Results include (i) an analytic solution for the population stratification-adjusted Bayes' factor, (ii) the relationship between sample size and Bayes' factors, (iii) an extension to an approximate Bayes' factor calculated across closely-linked sites, and (iv) an extension across multiple studies. Although this paper deals exclusively with genetic studies, it is possible to generalize the approach to treat many different large-scale experiments including studies of gene expression and proteomics.

Journal ArticleDOI
TL;DR: A method for the generation of knowledge-based potentials is described and applied to the observed torsional angles of known protein structures using Bayesian reasoning, which allows for the use of the function as a force term in the energy minimization of appropriately described structures.
Abstract: We describe a method for the generation of knowledge-based potentials and apply it to the observed torsional angles of known protein structures. The potential is derived using Bayesian reasoning, and is useful as a prior for further such reasoning in the presence of additional data. The potential takes the form of a probability density function, which is described by a small number of coefficients with the number of necessary coefficients determined by tests based on statistical significance and entropy. We demonstrate the methods in deriving one such potential corresponding to two dimensions, the Ramachandran plot. In contrast to traditional histogram-based methods, the function is continuous and differentiable. These properties allow us to use the function as a force term in the energy minimization of appropriately described structures. The method can easily be extended to other observable angles and higher dimensions, or to include sequence dependence and should find applications in structure determination and validation.

Journal ArticleDOI
TL;DR: This work proposes another alternative of reference structure, denoted by blocked reference design (BRD), in which each set of treated samples is co-hybridized to an independent experimental unit of the control (reference) group, and shows that the BRD is more efficient and less expensive than the traditional reference designs.
Abstract: We compare four variants of the reference design for microarray experiments in terms of their relative efficiency. A common reference sample across arrays is the most extensively used variation in practice, but independent samples from a reference group have also been considered in previous works. The relative efficiency of these designs depends of the number of treatments and the ratio between biological and technical variances. Here, we propose another alternative of reference structure, denoted by blocked reference design (BRD), in which each set (replication) of the treated samples is co-hybridized to an independent experimental unit of the control (reference) group. We provide efficiency curves for each pair of designs under different scenarios of variance ratio and number of treatments groups. The results show that the BRD is more efficient and less expensive than the traditional reference designs. Among the situations where the BRD is likely to be preferable we list time course experiments with a baseline and drug experiments with a placebo group.

Journal ArticleDOI
TL;DR: A graph-theoretic/statistical algorithm is discussed for local dynamic modeling of protein complexes using data from affinity purification-mass spectrometry experiments that readily accommodates multicomplex membership by individual proteins and dynamic complex composition, two biological realities not accounted for in existing topological descriptions of the overall protein network.
Abstract: Accurate systems biology modeling requires a complete catalog of protein complexes and their constituent proteins. We discuss a graph-theoretic/statistical algorithm for local dynamic modeling of protein complexes using data from affinity purification-mass spectrometry experiments. The algorithm readily accommodates multicomplex membership by individual proteins and dynamic complex composition, two biological realities not accounted for in existing topological descriptions of the overall protein network. A likelihood-based objective function guides the protein complex modeling algorithm. With an accurate complex membership catalog in place, systems biology can proceed with greater precision.

Journal ArticleDOI
TL;DR: A statistical model and a corresponding analysis method is suggested for experiments with pairing, including designs with individuals observed before and after treatment and many experiments with two-colour spotted arrays, of mixed type with some parameters estimated by an empirical Bayes method.
Abstract: In microarray experiments quality often varies, for example between samples and between arrays. The need for quality control is therefore strong. A statistical model and a corresponding analysis method is suggested for experiments with pairing, including designs with individuals observed before and after treatment and many experiments with two-colour spotted arrays. The model is of mixed type with some parameters estimated by an empirical Bayes method. Differences in quality are modelled by individual variances and correlations between repetitions. The method is applied to three real and several simulated datasets. Two of the real datasets are of Affymetrix type with patients profiled before and after treatment, and the third dataset is of two-colour spotted cDNA type. In all cases, the patients or arrays had different estimated variances, leading to distinctly unequal weights in the analysis. We suggest also plots which illustrate the variances and correlations that affect the weights computed by our analysis method. For simulated data the improvement relative to previously published methods without weighting is shown to be substantial.

Journal ArticleDOI
TL;DR: A loss-based, cross-validated Deletion/Substitution/Addition regression algorithm is applied to a data set consisting of 317 patients, each with 282 sequenced protease and reverse transcriptase codons to obtain a prediction of viral replication capacity based on an entire mutant/non-mutant sequence profile.
Abstract: Analysis of viral strand sequence data and viral replication capacity could potentially lead to biological insights regarding the replication ability of HIV-1. Determining specific target codons on the viral strand will facilitate the manufacturing of target-specific antiretrovirals. Various algorithmic and analysis techniques can be applied to this application. In this paper, we apply two techniques to a data set consisting of 317 patients, each with 282 sequenced protease and reverse transcriptase codons. The first application is recently developed multiple testing procedures to find codons which have significant univariate associations with the replication capacity of the virus. A single-step multiple testing procedure (Pollard and van der Laan 2003) method was used to control the family wise error rate (FWER) at the five percent alpha level as well as the application of augmentation multiple testing procedures to control the generalized family wise error (gFWER) or the tail probability of the proportion of false positives (TPPFP). We also applied a data adaptive multiple regression algorithm to obtain a prediction of viral replication capacity based on an entire mutant/non-mutant sequence profile. This is a loss-based, cross-validated Deletion/Substitution/Addition regression algorithm (Sinisi and van der Laan 2004), which builds candidate estimators in the prediction of a univariate outcome by minimizing an empirical risk. These methods are two separate techniques with distinct goals used to analyze this structure of viral data.

Journal ArticleDOI
TL;DR: The study showed that different methods' capability to predict is dependent on the data, hence the ideal choice of method and number of components are different for each data set.
Abstract: Gene expression microarray experiments generate data sets with multiple missing expression values. In some cases, analysis of gene expression requires a complete matrix as input. Either genes with missing values can be removed, or the missing values can be replaced using prediction. We propose six imputation methods. A comparative study of the methods was performed on data from mice and data from the bacterium Enterococcus faecalis, and a linear mixed model was used to test for differences between the methods. The study showed that different methods' capability to predict is dependent on the data, hence the ideal choice of method and number of components are different for each data set. For data with correlation structure methods based on K-nearest neighbours seemed to be best, while for data without correlation structure using the average of the gene was to be preferred.

Journal ArticleDOI
TL;DR: This work examines the application of statistical model selection methods to reverse-engineering the control of galactose utilization in yeast from DNA microarray experiment data by selecting predictors using a variety of methods, taking into account the variance in each measurement.
Abstract: We examine the application of statistical model selection methods to reverse-engineering the control of galactose utilization in yeast from DNA microarray experiment data. In these experiments, relationships among gene expression values are revealed through modifications of galactose sugar level and genetic perturbations through knockouts. For each gene variable, we select predictors using a variety of methods, taking into account the variance in each measurement. These methods include maximization of log-likelihood with Cp, AIC, and BIC penalties, bootstrap and cross-validation error estimation, and coefficient shrinkage via the Lasso. Copyright ©2005 by the authors. All rights reserved.

Journal ArticleDOI
TL;DR: Three potential one-way ANOVA statistics in a parametric statistical framework are presented to separate genes that are differently regulated across several treatment conditions from those with equal regulation.
Abstract: In the exploding field of gene expression techniques such as DNA microarrays, there are still few general probabilistic methods for analysis of variance. Linear models and ANOVA are heavily used to ...

Journal ArticleDOI
TL;DR: A linear mixed model where the random effects are assumed to follow a mixture distribution, and a new type of non-linear shrinkage estimation, where a proportion of estimates is shrunk to zero, while the rest follows standard linear shrinkage is proposed.
Abstract: Microarray experiments produce expression measurements for thousands of genes simultaneously, though usually for a small number of RNA samples. The most common problem is the identification of genes that are differentially expressed between different groups of samples or biological conditions. As the number of genes far exceeds the number of RNA samples, the inherent multiplicity poses a severe problem in both hypothesis testing and effect estimation. While much of the recent literature is focused on the hypothesis aspects, we concentrate in this paper on effect estimation as a tool for the identification of differentially expressed genes. We propose a linear mixed model where the random effects are assumed to follow a mixture distribution, and study in detail the case of three normals, corresponding to genes that are down-, up- or non regulated. Our approach leads to a new type of non-linear shrinkage estimation, where a proportion of estimates is shrunk to zero, while the rest follows standard linear shrinkage. This allows us to estimate the log fold-change of the genes involved and to identify those that are differentially expressed within the same model framework. We investigate the operating characteristics of our method using simulation and spike-in studies, and illustrate its application to real data using a breast-cancer dataset.

Journal ArticleDOI
TL;DR: The statistical problems of microarray fingerprinting are described, similarities with and differences from more conventional microarray applications are outlined, and a statistical measurement error model is illustrated to fingerprint 10 closely related strains from three Bacillus species, and 3 strains from non-Bacillus species.
Abstract: Detecting subtle genetic differences between microorganisms is an important problem in molecular epidemiology and microbial forensics. In a typical investigation, gel electrophoresis is used to compare randomly amplified DNA fragments between microbial strains, where the patterns of DNA fragment sizes are proxies for a microbe's genotype. The limited genomic sample captured on a gel is often insufficient to discriminate nearly identical strains. This paper examines the application of microarray technology to DNA fingerprinting as a high-resolution alternative to gel-based methods. The so-called universal microarray, which uses short oligonucleotide probes that do not target specific genes or species, is intended to be applicable to all microorganisms because it does not require prior knowledge of genomic sequence. In principle, closely related strains can be distinguished if the number of probes on the microarray is sufficiently large, i.e., if the genome is sufficiently sampled. In practice, we confront noisy data, imperfectly matched hybridizations, and a high-dimensional inference problem. We describe the statistical problems of microarray fingerprinting, outline similarities with and differences from more conventional microarray applications, and illustrate the statistical fingerprinting problem for 10 closely related strains from three Bacillus species, and 3 strains from non-Bacillus species.

Journal ArticleDOI
TL;DR: In this article, the authors developed permutation based rank-tests for generalized Wilcoxon ranksum test for two-group comparisons of replicated microarray data, and for microarray experiments with randomized block design, they considered generalized signed rank test.
Abstract: Gene expression data from microarray experiments have been studied using several statistical models. Significance Analysis of Microarrays (SAM), for example, has proved to be useful in analyzing microarray data. In the spirit of the SAM procedures, we develop permutation based rank-tests for generalized Wilcoxon ranksum test for two-group comparisons of replicated microarray data. Also, for microarray experiments with randomized block design, we consider generalized signed rank test. The statistical analysis software package is written in R and is freely available in a package.

Journal ArticleDOI
TL;DR: It is found that among a collection of primate sequences, even an optimal sequences-weights approach is only 51% as efficient as the maximum-likelihood approach in inferences of base frequency parameters.
Abstract: Approaches based upon sequence weights, to construct a position weight matrix of nucleotides from aligned inputs, are popular but little effort has been expended to measure their quality. We derive optimal sequence weights that minimize the sum of the variances of the estimators of base frequency parameters for sequences related by a phylogenetic tree. Using these we find that approaches based upon sequence weights can perform very poorly in comparison to approaches based upon a theoretically optimal maximum-likelihood method in the inference of the parameters of a position-weight matrix. Specifically, we find that among a collection of primate sequences, even an optimal sequences-weights approach is only 51% as efficient as the maximum-likelihood approach in inferences of base frequency parameters. We also show how to employ the variance estimators to obtain a greedy ordering of species for sequencing. Application of this ordering for the weighted estimators to a primate collection yields a curve with a long plateau that is not observed with maximum-likelihood estimators. This plateau indicates that the use of weighted estimators on these data seriously limits the utility of obtaining the sequences of more than two or three additional species.

Journal ArticleDOI
TL;DR: In the experimental evaluation of superfamily based screening of the SCOP database it is demonstrated that semi-continuous Profile HMMs significantly outperform their discrete counterparts and the number of false positive predictions could be reduced substantially.
Abstract: The detection of remote homologies is of major importance for molecular biology applications like drug discovery. The problem is still very challenging even for state-of-the-art probabilistic models of protein families, namely Profile HMMs. In order to improve remote homology detection we propose feature based semi-continuous Profile HMMs. Based on a richer sequence representation consisting of features which capture the biochemical properties of residues in their local context, family specific semi-continuous models are estimated completely data-driven. Additionally, for substantially reducing the number of false predictions an explicit rejection model is estimated. Both the family specific semi-continuous Profile HMM and the non-target model are competitively evaluated. In the experimental evaluation of superfamily based screening of the SCOP database we demonstrate that semi-continuous Profile HMMs significantly outperform their discrete counterparts. Using the rejection model the number of false positive predictions could be reduced substantially which is an important prerequisite for target identification applications.

Journal ArticleDOI
TL;DR: A multinomial logistic regression method which permits estimation and likelihood ratio tests for allele effects, their interactions with continuous covariates, and assessment of the degree of population stratification in genetic association studies of case-parent triads has an important application in epidemiological family-based association studies.
Abstract: We propose a multinomial logistic regression method which permits estimation and likelihood ratio tests for allele effects, their interactions with continuous covariates, and assessment of the degree of population stratification in genetic association studies of case-parent triads. Our approach overcomes the constraint imposed by the categorical nature of explanatory variables in the log-linear model. We also demonstrate that the multinomial logistic method can yield efficient inference in the presence of missing parental genotype data via the use of the Expectation-Maximization (EM) algorithm. We performed simulations to compare the multinomial logistic model with the case-pseudosibling conditional logistic model approach, both of which permit the incorporation of continuous covariates. Simulation results indicate that the multinomial logistic model and the conditional logistic model lead to similar estimates in large samples. A simulation-based method of sample size estimation is also used to show that the two models are approximately equivalent in sample size requirements. When parental genotype data are missing, either completely at random or dependent on covariates, the use of the EM algorithm gives multinomial logistic model greater power. Since the multinomial logistic model offers the possibility of assessing the degree of population stratification in the sample and can also provide efficient inference in the presence of missing parental genotypes, the proposed model has an important application in epidemiological family-based association studies.

Journal ArticleDOI
TL;DR: The conclusion is that association tests are comparatively more efficient than linkage tests for strong association, weak penetration models, small families and non-extreme phenotypes, whereas the linkage test is more efficient for weak association, strong penetrance models, large families and extreme phenotypes.
Abstract: A combined score test for association and linkage analysis is introduced, based on a biologically plausible model with association between markers and causal genes and penetrance between phenotypes and the causal gene. The test is based on a retrospective likelihood of marker data given phenotypes, treating the alleles of the causal gene as hidden data. It is defined for arbitrary outbred pedigrees, a wide class of genetic models including polygenic and shared environmental effects and allows for missing marker data. It is multipoint, taking marker genotypes from several loci into account simultaneously. The score vector has one association and one linkage component, which can be used to define separate tests for association and linkage. For complete marker data, we give closed form expressions for the efficiency of the linkage, association and combined tests. These are examplified for binary and quantitative phenotypes with or without polygenic effects. The conclusion is that association tests are comparatively more efficient than linkage tests for strong association, weak penetrance models, small families and non-extreme phenotypes, whereas the linkage test is more efficient for weak association, strong penetrance models, large families and extreme phenotypes. The combined test is a robust alternative, which never performs much worse than the best of the linkage and association tests, and sometimes significantly better than both of them. It should be particularly useful when little is known about the genetic model.

Journal ArticleDOI
TL;DR: A novel, cost efficient two-phase design for predictive clinical gene expression studies: early marker panel determination (EMPD), which finds that in Phase-1 already 16 patients per group are sufficient to determine a suitable marker panel of 10 genes, and that this early decision compromises the final performance only marginally.
Abstract: We present a novel, cost efficient two-phase design for predictive clinical gene expression studies: early marker panel determination (EMPD). In Phase-1, genome-wide microarrays are used only for a small number of individual patient samples. From this Phase-1 data a panel of marker genes is derived. In Phase-2, the expression values of these marker panel genes are measured for a large group of patients and a predictive classification model is learned from this data. Phase-2 does not require the use of expensive whole genome microarrays, thus making EMPD a cost efficient alternative for current trials. The expected performance loss of EMPD is compared to designs which use genome-wide microarrays for all patients. We also examine the trade-off between the number of patients included in Phase-1 and the number of marker genes required in Phase-2. By analysis of five published datasets we find that in Phase-1 already 16 patients per group are sufficient to determine a suitable marker panel of 10 genes, and that this early decision compromises the final performance only marginally.

Journal ArticleDOI
TL;DR: A new method is proposed that generalizes the model of Epstein and Satten to incorporate both (i) and (ii) and is recommended with data from two genotypes, a recessive or dominant model linking haplotypes to disease, and estimates of haplotype effects among haplotypes with a frequency greater than 10%.
Abstract: Because haplotypes may parsimoniously summarize the effect of genes on disease, there is great interest in using haplotypes in case-control studies of unphased genotype data. Previous methods for investigating haplotypes effects in case-control studies have not allowed for both of the following two scenarios that could have a large impact on results (i) departures from Hardy-Weinberg equilibrium in controls as well as cases, and (ii) an interactive effect of haplotypes and environmental covariates on the probability of disease. A new method is proposed that generalizes the model of Epstein and Satten to incorporate both (i) and (ii). Computations are relatively simple involving a single loglinear design matrix for parameters modeling the distribution of haplotype frequencies in controls, parameters modeling the effect of haplotypes and covariate-haplotype interactions on disease, and nuisance parameters required for correct inference. Based on simulations with realistic sample sizes, the method is recommended with data from two genotypes, a recessive or dominant model linking haplotypes to disease, and estimates of haplotype effects among haplotypes with a frequency greater than 10%. The methodology is most useful with candidate genotype pairs or for searching through pairs of genotypes when scenarios (i) and (ii) are likely. An example without a covariate illustrates the importance of modeling a departure from Hardy-Weinberg equilibrium in controls.

Journal ArticleDOI
TL;DR: Five spatial correlation structures, exponential, Gaussian, linear, rational quadratic and spherical, are compared for a dataset with 50-mer two-colour oligonucleotide microarrays and 452 probes for selected Arabidopsis genes and it is concluded that for the data set analysed the correlation seems negligible for non-neighbouring pixels.
Abstract: Statistical models for spot shapes and signal intensities are used in image analysis of laser scans of microarrays. Most models have essentially been based on the assumption of independent pixel intensity values, but models that allow for spatial correlation among neighbouring pixels can accommodate errors in the microarray slide and should improve the model fit. Five spatial correlation structures, exponential, Gaussian, linear, rational quadratic and spherical, are compared for a dataset with 50-mer two-colour oligonucleotide microarrays and 452 probes for selected Arabidopsis genes. Substantial improvement in model fit is obtained for all five correlation structures compared to the model with independent pixel values, and the Gaussian and the spherical models seem to be slightly better than the other three models. We also conclude that for the data set analysed the correlation seems negligible for non-neighbouring pixels.