scispace - formally typeset
Search or ask a question

Showing papers in "Statistical Applications in Genetics and Molecular Biology in 2009"


Journal ArticleDOI
TL;DR: The sparse canonical correlation analysis (sparse CCA) is a method for identifying sparse linear combinations of the two sets of variables that are correlated with each other and associated with the outcome as mentioned in this paper.
Abstract: In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.

417 citations


Journal ArticleDOI
TL;DR: This paper presents Sparse Canonical Correlation Analysis (SCCA) which examines the relationships between two types of variables and provides sparse solutions that include only small subsets of variables of each type by maximizing the correlation between the subsetsOf variables of different types while performing variable selection.
Abstract: Large scale genomic studies with multiple phenotypic or genotypic measures may require the identification of complex multivariate relationships. In multivariate analysis a common way to inspect the relationship between two sets of variables based on their correlation is canonical correlation analysis, which determines linear combinations of all variables of each type with maximal correlation between the two linear combinations. However, in high dimensional data analysis, when the number of variables under consideration exceeds tens of thousands, linear combinations of the entire sets of features may lack biological plausibility and interpretability. In addition, insufficient sample size may lead to computational problems, inaccurate estimates of parameters and non-generalizable results. These problems may be solved by selecting sparse subsets of variables, i.e. obtaining sparse loadings in the linear combinations of variables of each type. In this paper we present Sparse Canonical Correlation Analysis (SCCA) which examines the relationships between two types of variables and provides sparse solutions that include only small subsets of variables of each type by maximizing the correlation between the subsets of variables of different types while performing variable selection. We also present an extension of SCCA--adaptive SCCA. We evaluate their properties using simulated data and illustrate practical use by applying both methods to the study of natural variation in human gene expression.

350 citations


Journal ArticleDOI
TL;DR: In this article, a low-order conditional dependence graph (LODG) is proposed for dynamic Bayesian networks, which makes it possible to face with a number of time measurements n much smaller than the number of genes p.
Abstract: In this paper, we propose a novel inference method for dynamic genetic networks which makes it possible to face with a number of time measurements n much smaller than the number of genes p. The approach is based on the concept of low order conditional dependence graph that we extend here in the case of Dynamic Bayesian Networks. Most of our results are based on the theory of graphical models associated with the Directed Acyclic Graphs (DAGs). In this way, we define a minimal DAG G which describes exactly the full order conditional dependencies given the past of the process. Then, to face with the large p and small n estimation case, we propose to approximate DAG G by considering low order conditional independencies. We introduce partial qth order conditional dependence DAGs G(q) and analyze their probabilistic properties. In general, DAGs G(q) differ from DAG G but still reflect relevant dependence facts for sparse networks such as genetic networks. By using this approximation, we set out a non-bayesian inference method and demonstrate the effectiveness of this approach on both simulated and real data analysis. The inference procedure is implemented in the R package 'G1DBN' freely available from the CRAN archive.

104 citations


Journal ArticleDOI
TL;DR: The procedure is called Cox univariate shrinkage and has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics.
Abstract: We propose a method for prediction in Cox's proportional model, when the number of features (regressors), p, exceeds the number of observations, n. The method assumes that the features are independent in each risk set, so that the partial likelihood factors into a product. As such, it is analogous to univariate thresholding in linear regression and nearest shrunken centroids in classification. We call the procedure Cox univariate shrinkage and demonstrate its usefulness on real and simulated data. The method has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics. We illustrate the new method on real and simulated data, and compare it to other proposed methods for survival prediction with a large number of predictors.

76 citations


Journal ArticleDOI
TL;DR: This paper proposes an outlier detection method based on principal component analysis (PCA) and robust estimation of Mahalanobis distances that is fully automatic and demonstrates that outlier removal improves the prediction accuracy of classifiers.
Abstract: In this paper, we address the problem of detecting outlier samples with highly different expression patterns in microarray data. Although outliers are not common, they appear even in widely used benchmark data sets and can negatively affect microarray data analysis. It is important to identify outliers in order to explore underlying experimental or biological problems and remove erroneous data. We propose an outlier detection method based on principal component analysis (PCA) and robust estimation of Mahalanobis distances that is fully automatic. We demonstrate that our outlier detection method identifies biologically significant outliers with high accuracy and that outlier removal improves the prediction accuracy of classifiers. Our outlier detection method is closely related to existing robust PCA methods, so we compare our outlier detection method to a prominent robust PCA method.

75 citations


Journal ArticleDOI
TL;DR: In this article, adaptations of the elastic net approach are presented for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model for time-to-event data where censoring is present.
Abstract: Use of microarray technology often leads to high-dimensional and low-sample size (HDLSS) data settings. A variety of approaches have been proposed for variable selection in this context. However, only a small number of these have been adapted for time-to-event data where censoring is present. Among standard variable selection methods shown both to have good predictive accuracy and to be computationally efficient is the elastic net penalization approach. In this paper, adaptations of the elastic net approach are presented for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model. Assessment of the two methods is conducted through simulation studies and through analysis of microarray data obtained from a set of patients with diffuse large B-cell lymphoma where time to survival is of interest. The approaches are shown to match or exceed the predictive performance of a Cox-based and an AFT-based variable selection method. The methods are moreover shown to be much more computationally efficient than their respective Cox- and AFT-based counterparts.

66 citations


Journal ArticleDOI
TL;DR: In this paper, an empirical likelihood ratio test (LRT) was proposed for simultaneously testing the null hypothesis of no difference in point-mass proportions and no difference between the continuous component.
Abstract: Data composed of a continuous component plus a point-mass frequently arises in genomic studies. The distribution of this type of data is characterized by the proportion of observations in the point mass and the distribution of the continuous component. Standard statistical methods focus on one of these effects at a time and can fail to detect differences between experimental groups. We propose a novel empirical likelihood ratio test (LRT) statistic for simultaneously testing the null hypothesis of no difference in point-mass proportions and no difference in means of the continuous component. This study evaluates the performance of the empirical LRT and three existing point-mass mixture statistics: 1) Two-part statistic with a t-test for testing mean differences (Two-part t), 2) Two-part statistic with Wilcoxon test for testing mean differences (Two-part W), and 3) parametric LRT. Our investigations begin with an analysis of metabolomics data from Arabidopsis thaliana, which contains many metabolites with a large proportion of observed concentrations in a point-mass at zero. All four point-mass mixture statistics identify more significant differences than standard t-tests and Wilcoxon tests. The empirical LRT appears particularly effective. These findings motivate a large simulation study that assesses Type I and Type II error of the four test statistics with various choices of null distribution. The parametric LRT is frequently the most powerful test, as long as the model assumptions are correct. As is common in 'omics data, the Arabidopsis metabolites have widely varying concentration distributions. A single parametric distribution cannot effectively represent all of these distributions, and individually selecting the optimal parametric distribution to use in the LRT for each metabolite is not practical. The empirical LRT, which does not require parametric assumptions, provides an attractive alternative to parametric and standard methods.

43 citations


Journal ArticleDOI
TL;DR: A unified statistical model for analyzing time course experiments at the gene set level using random coefficient models, which fall into the more general class of mixed effects models, and is demonstrated using a mouse colon development time course dataset.
Abstract: Methods for gene set analysis test for coordinated changes of a group of genes involved in the same biological process or molecular pathway. Higher statistical power is gained for gene set analysis by combining weak signals from a number of individual genes in each group. Although many gene set analysis methods have been proposed for microarray experiments with two groups, few can be applied to time course experiments. We propose a unified statistical model for analyzing time course experiments at the gene set level using random coefficient models, which fall into the more general class of mixed effects models. These models include a systematic component that models the mean trajectory for the group of genes, and a random component (the random coefficients) that models how each gene's trajectory varies about the mean trajectory. We show that the proposed model (1) outperforms currently available methods at discriminating gene sets differentially changed over time from null gene sets; (2) provides more stable results that are less affected by sampling variations; (3) models dependency among genes adequately and preserves type I error rate; and (4) allows for gene ranking based on predicted values of the random effects. We describe simulation studies using gene expression data with “real life” correlations and we demonstrate the proposed random coefficient model using a mouse colon development time course dataset. The agreement between results of the proposed random coefficient model and the previous reports for this proof-of-concept trial further validates this methodology, which provides a unified statistical model for systems analysis of microarray experiments with complex experimental designs when re-sampling based methods are difficult to apply.

40 citations


Journal ArticleDOI
TL;DR: In vivo factor profiling of an in vitro signature generates biological insights related to underlying pathway activities and chromosomal structure, and leads to refinements of cancer recurrence risk stratification across several cancer studies.
Abstract: We describe a strategy for the analysis of experimentally derived gene expression signatures and their translation to human observational data. Sparse multivariate regression models are used to identify expression signature gene sets representing downstream biological pathway events following interventions in designed experiments. When translated into in vivo human observational data, analysis using sparse latent factor models can yield multiple quantitative factors characterizing expression patterns that are often more complex than in the controlled, in vitro setting. The estimation of common patterns in expression that reflect all aspects of covariation evident in vivo offers an enhanced, modular view of the complexity of biological associations of signature genes. This can identify substructure in the biological process under experimental investigation and improved biomarkers of clinical outcomes. We illustrate the approach in a detailed study from an oncogene intervention experiment where in vivo factor profiling of an in vitro signature generates biological insights related to underlying pathway activities and chromosomal structure, and leads to refinements of cancer recurrence risk stratification across several cancer studies.

34 citations


Journal ArticleDOI
TL;DR: In this paper, a score test was proposed to detect the existence of quantitative trait loci (QTL) in a backcross population using score test statistics, which is computationally simpler than the likelihood ratio test (LRT) since the score test only uses the MLEs of parameters under the null hypothesis.
Abstract: We propose a method to detect the existence of quantitative trait loci (QTL) in a backcross population using a score test. Since the score test only uses the MLEs of parameters under the null hypothesis, it is computationally simpler than the likelihood ratio test (LRT). Moreover, because the location parameter of the QTL is unidentifiable under the null hypothesis, the distribution of the maximum of the LRT statistics, typically the statistic of choice for testing H0: no QTL, does not have the standard chi-square distribution asymptotically under the null hypothesis. From the simple structure of the score test statistics, the asymptotic null distribution can be derived for the maximum of the square of score test statistics. Numerical methods are proposed to compute the asymptotic null distribution and the critical thresholds can be obtained accordingly. We show that the maximum of the LR test statistics and the maximum of the square of score statistics are asymptotically equivalent. Therefore, the critical threshold for the score test can be used for the LR test also. A simple backcross design is used to demonstrate the application of the score test to QTL mapping.

32 citations


Journal ArticleDOI
TL;DR: A unified framework to model and test the two sets of genetic conflicts via a regularized regression approach is developed and offers a testable framework for the genetic conflict hypothesis previously proposed.
Abstract: Human diseases developed during pregnancy could be caused by the direct effects of both maternal and fetal genes, and/or by the indirect effects caused by genetic conflicts. Genetic conflicts exist when the effects of fetal genes are opposed by the effects of maternal genes, or when there is a conflict between the maternal and paternal genes within the fetal genome. The two types of genetic conflicts involve the functions of different genes in different genomes and are genetically distinct. Differentiating and further dissecting the two sets of genetic conflict effects that increase disease risk during pregnancy present statistical challenges, and have been traditionally pursued as two separate endeavors. In this article, we develop a unified framework to model and test the two sets of genetic conflicts via a regularized regression approach. Our model is developed considering real situations in which the paternal information is often completely missing; an assumption that fails most of the current family-based studies. A mixture model-based penalized logistic regression is proposed for data sampled from a natural population. We develop a variable selection procedure to select significant genetic features. Simulation studies show that the model has high power and good false positive control under reasonable sample sizes and disease allele frequency. A case study of small for gestational age (SGA) is provided to show the utility of the proposed approach. Our model provides a powerful tool for dissecting genetic conflicts that increase disease risk during pregnancy, and offers a testable framework for the genetic conflict hypothesis previously proposed.

Journal ArticleDOI
TL;DR: This study extends previous work by examining this issue analytically using the non-centrality parameter of the asymptotic distribution of the chi-squared test and linear trend test when there is no difference between case and control genotype frequencies, but there is differential misclassification with SNP data.
Abstract: Genotyping error adversely affects the statistical power of case-control association studies and introduces bias in the estimated parameters when the same error mechanism and probabilities apply to both affected and unaffected individuals; that is, when there is non-differential genotype misclassification. Simulation studies have shown that differential genotype misclassification leads to a rejection rate that is higher than the nominal significance level (type I error rate) for some tests of association. This study extends previous work by examining this issue analytically using the non-centrality parameter of the asymptotic distribution of the chi-squared test and linear trend test (LTT) when there is no difference between case and control genotype frequencies, but there is differential misclassification with SNP data. The parameters examined are the minor allele frequency (MAF) and sample size. When MAF is less than 0.2, differential genotyping errors lead to a rejection rate much larger than the nominal significance level. As the MAF decreases to zero, the increase in the rejection rate becomes larger. The errors that most increase the rejection rate are differential recording of the more common homozygote as the other homozygote and differential recording of the more common homozygote as the heterozygote. The rejection rate increases as the sample size increases for fixed differential genotyping error rates and nominal significance level for each test.

Journal ArticleDOI
TL;DR: In this paper, a rotation test was proposed for estimating significance in GSEA of direct comparison data with a limited number of samples, which can in addition be used on indirect comparison data and for testing significance of other types of test statistics outside the GSEA framework.
Abstract: Gene Set Enrichment Analysis (GSEA) is a method for analysing gene expression data with a focus on a priori defined gene sets. The permutation test generally used in GSEA for testing the significance of gene set enrichment involves permutation of a phenotype vector and is developed for data from an indirect comparison design, i.e. unpaired data. In some studies the samples representing two phenotypes are paired, e.g. samples taken from a patient before and after treatment, or if samples representing two phenotypes are hybridised to the same two-channel array (direct comparison design). In this paper we will focus on data from direct comparison experiments, but the methods can be applied to paired data in general. For these types of data, a standard permutation test for paired data that randomly re-signs samples can be used. However, if the sample size is very small, which is often the case for a direct comparison design, a permutation test will give very imprecise estimates of the p-values. Here we propose using a rotation test rather than a permutation test for estimation of significance in GSEA of direct comparison data with a limited number of samples. Our proposed rotation test makes GSEA applicable to direct comparison data with few samples, by depending on rotations of the data instead of permutations. The rotation test is a generalisation of the permutation test, and can in addition be used on indirect comparison data and for testing significance of other types of test statistics outside the GSEA framework.

Journal ArticleDOI
TL;DR: The results show that applying sequential methods can reduce the number of microarrays without substantial loss of power and propose a meta-analysis approach to combining the results of the interim analyses at different stages.
Abstract: MOTIVATION Transcriptomic studies using microarray technology have become a standard tool in life sciences in the last decade. Nevertheless the cost of these experiments remains high and forces scientists to work with small sample sizes at the expense of statistical power. In many cases, little or no prior knowledge on the underlying variability is available, which would allow an accurate estimation of the number of samples (microarrays) required to answer a particular biological question of interest. We investigate sequential methods, also called group sequential or adaptive designs in the context of clinical trials, for microarray analysis. Through interim analyses at different stages of the experiment and application of a stopping rule a decision can be made as to whether more samples should be studied or whether the experiment has yielded enough information already. RESULTS The high dimensionality of microarray data facilitates the sequential approach. Since thousands of genes simultaneously contribute to the stopping decision, the marginal distribution of any single gene is nearly independent of the global stopping rule. For this reason, the interim analysis does not seriously bias the final p-values. We propose a meta-analysis approach to combining the results of the interim analyses at different stages. We consider stopping rules that are either based on the estimated number of true positives or on a sensitivity estimate and particularly discuss the difficulty of estimating the latter. We study this sequential method in an extensive simulation study and also apply it to several real data sets. The results show that applying sequential methods can reduce the number of microarrays without substantial loss of power. An R-package SequentialMA implementing the approach is available from the authors.

Journal ArticleDOI
TL;DR: Theoretical proof and/or simulation studies show that the GS Šidák procedure can have higher power than the GS Bonferroni procedure when their corresponding optimal weights are used, and that both of these GS procedures can have much higherPower than the weighted ŠIDák and the weighted Bonferronsi procedures.
Abstract: Multiple hypothesis testing is commonly used in genome research such as genome-wide studies and gene expression data analysis (Lin, 2005). The widely used Bonferroni procedure controls the family-wise error rate (FWER) for multiple hypothesis testing, but has limited statistical power as the number of hypotheses tested increases. The power of multiple testing procedures can be increased by using weighted p-values (Genovese et al., 2006). The weights for the p-values can be estimated by using certain prior information. Wasserman and Roeder (2006) described a weighted Bonferroni procedure, which incorporates weighted p-values into the Bonferroni procedure, and Rubin et al. (2006) and Wasserman and Roeder (2006) estimated the optimal weights that maximize the power of the weighted Bonferroni procedure under the assumption that the means of the test statistics in the multiple testing are known (these weights are called optimal Bonferroni weights). This weighted Bonferroni procedure controls FWER and can have higher power than the Bonferroni procedure, especially when the optimal Bonferroni weights are used. To further improve the power of the weighted Bonferroni procedure, first we propose a weighted Sidak procedure that incorporates weighted p-values into the Sidak procedure, and then we estimate the optimal weights that maximize the average power of the weighted Sidak procedure under the assumption that the means of the test statistics in the multiple testing are known (these weights are called optimal Sidak weights). This weighted Sidak procedure can have higher power than the weighted Bonferroni procedure. Second, we develop a generalized sequential (GS) Sidak procedure that incorporates weighted p-values into the sequential Sidak procedure (Scherrer, 1984). This GS Sidak procedure is an extension of and has higher power than the GS Bonferroni procedure of Holm (1979). Finally, under the assumption that the means of the test statistics in the multiple testing are known, we incorporate the optimal Sidak weights and the optimal Bonferroni weights into the GS Sidak procedure and the GS Bonferroni procedure, respectively. Theoretical proof and/or simulation studies show that the GS Sidak procedure can have higher power than the GS Bonferroni procedure when their corresponding optimal weights are used, and that both of these GS procedures can have much higher power than the weighted Sidak and the weighted Bonferroni procedures. All proposed procedures control the FWER well and are useful when prior information is available to estimate the weights.

Journal ArticleDOI
TL;DR: In this article, the exact value of the variance of the D2 statistic for the case of a uniform letter distribution is computed, and a method to provide accurate approximations to the variance in the remaining cases is introduced.
Abstract: Word matches are often used in sequence comparison methods, either as a measure of sequence similarity or in the first search steps of algorithms such as BLAST or BLAT. The D2 statistic is the number of matches of words of k letters between two sequences. Recent advances have been made in the characterization of this statistic and in the approximation of its distribution. Here, these results are extended to the case of approximate word matches. We compute the exact value of the variance of the D2 statistic for the case of a uniform letter distribution, and introduce a method to provide accurate approximations of the variance in the remaining cases. This enables the distribution of D2 to be approximated for typical situations arising in biological research. We apply these results to the identification of cis-regulatory modules, and show that this method detects such sequences with a high accuracy. The ability to approximate the distribution of D2 for both exact and approximate word matches will enable the use of this statistic in a more precise manner for sequence comparison, database searches, and identification of transcription factor binding sites.

Journal ArticleDOI
TL;DR: This algorithm was devised with a set of random unrelated families, each including a father, a mother and a varying number of offspring, sampled from a population at Hardy-Weinberg equilibrium to construct a joint linkage-linkage disequilibrium map.
Abstract: The extent and pattern of linkage disequilibrium (LD) determine the feasibility of association studies to map genes that underlie complex traits. Here we present a statistical algorithm for constructing a joint linkage-linkage disequilibrium map by simultaneously estimating the recombination fraction and linkage disequilibrium between different molecular markers in a natural human population. This algorithm was devised with a set of random unrelated families, each including a father, a mother and a varying number of offspring, sampled from a population at Hardy-Weinberg equilibrium. A two-level hierarchical mixture model framework was built, in which the likelihood of genotype data for the parents was formulated in terms of linkage disequilibrium at an upper level, whereas the likelihood of genetic transmission from the parents to offspring formulated in terms of the recombination fraction at a lower level. The EM algorithm was implemented to obtain a closed system of maximum likelihood estimates of marker co-segregation and co-transmission. The model allows a number of testable hypotheses about population genetic parameters, opening a broad gateway to understand the genetic structure and dynamics of an outcrossing population under natural selection. The new strategy will provide a platform for studying the genetic control of inherited diseases in which genetic material is accurately copied before being passed onto the offspring from a parent.

Journal ArticleDOI
TL;DR: A normal random effects model that allows for right-censored observations and includes covariates, and draws statistical inference based on the likelihood function is applied to the hereditary nonpolyposis colorectal cancer/Lynch syndrome family cohort from the national Danish HNPCC register.
Abstract: Anticipation, i.e. a decreasing age-at-onset in subsequent generations has been observed in a number of genetically triggered diseases. The impact of anticipation is generally studied in affected parent-child pairs. These analyses are restricted to pairs in which both individuals have been affected and are sensitive to right truncation of the data. We propose a normal random effects model that allows for right-censored observations and includes covariates, and draw statistical inference based on the likelihood function. We applied the model to the hereditary nonpolyposis colorectal cancer (HNPCC)/Lynch syndrome family cohort from the national Danish HNPCC register. Age-at-onset was analyzed in 824 individuals from 2-4 generations in 125 families with proved disease-predisposing mutations. A significant effect from anticipation was identified with a mean of 3 years earlier age-at-onset per generation. The suggested model corrects for incomplete observations and considers families rather than affected pairs and thereby allows for studies of large sample sets, facilitates subgroup analyses and provides generation effect estimates.

Journal ArticleDOI
TL;DR: A novel neighboring sites model is introduced as an alternative methodology for considering dependence in methylation patterns, and three models are compared in their ability to generate simulated sequences statistically similar to Sat2 and NBL2 carcinoma samples.
Abstract: Changes in cytosine methylation at CpG nucleotides are observed in many cancers and offer great potential for translational research. Diseases such as ovarian cancer that are especially challenging to diagnose and treat are of particular interest, and abnormal methylation in the tandem repeats Sat2 and NBL2 has been observed in a collection of ovarian carcinomas. In earlier analyses of double-stranded methylation patterns in 0.2 kb regions of Sat2 and NBL2, we detected clusters of identically methylated sites in close proximity. These clusters could not be explained by random variation, and our findings suggested a high degree of site-to-site dependence. However, previously developed stochastic models for methylation change have either treated CpG sites independently or employed a context dependent approach to adjust model parameters according to regional methylation levels. In this paper, we introduce a novel neighboring sites model as an alternative methodology for considering dependence in methylation patterns, and we compare the three models in their ability to generate simulated sequences statistically similar to our Sat2 and NBL2 carcinoma samples.

Journal ArticleDOI
TL;DR: A non-homogeneous hidden-state model based on first order differences of experimental data along genomic coordinates that bypasses the need for local detrending and can automatically detect nucleosome positions of various occupancy levels is proposed.
Abstract: The ability to map individual nucleosomes accurately across genomes enables the study of relationships between dynamic changes in nucleosome positioning/occupancy and gene regulation. However, the highly heterogeneous nature of nucleosome densities across genomes and short linker regions pose challenges in mapping nucleosome positions based on high-throughput microarray data of micrococcal nuclease (MNase) digested DNA. Previous works rely on additional detrending and careful visual examination to detect low-signal nucleosomes, which may exist in a subpopulation of cells. We propose a non-homogeneous hidden-state model based on first order differences of experimental data along genomic coordinates that bypasses the need for local detrending and can automatically detect nucleosome positions of various occupancy levels. Our proposed approach is applicable to both low and high resolution MNase-Chip and MNase-Seq (high throughput sequencing) data, and is able to map nucleosome-linker boundaries accurately. This automated algorithm is also computationally efficient and only requires a simple preprocessing step. We provide several examples illustrating the pitfalls of existing methods, the difficulties of detrending the observed hybridization signals and demonstrate the advantages of utilizing first order differences in detecting nucleosome occupancies via simulations and case studies involving MNase-Chip and MNase-Seq data of nucleosome occupancy in yeast S. cerevisiae.

Journal ArticleDOI
TL;DR: A hierarchical mixture model framework to simultaneously identify non-differentially expressed genes and normalize arrays using these genes is proposed and the Fisher's information matrix corresponding to array effects is derived, which provides useful intuition for guiding the choice of array normalization method.
Abstract: Normalization is an important step in the analysis of microarray data of transcription profiles as systematic non-biological variations often arise from the multiple steps involved in any transcription profiling experiment. Existing methods for data normalization often assume that there are few or symmetric differential expression, but this assumption does not always hold. Alternatively, non-differentially expressed genes may be used for array normalization. However, it is unknown at the outset which genes are non-differentially expressed. In this paper we propose a hierarchical mixture model framework to simultaneously identify non-differentially expressed genes and normalize arrays using these genes. The Fisher"s information matrix corresponding to array effects is derived, which provides useful intuition for guiding the choice of array normalization method. The operating characteristics of the proposed method are evaluated using simulated data. The simulations conducted under a wide range of parametric configurations suggest that the proposed method provides a useful alternative for array normalization. For example, the proposed method has better sensitivity than median normalization under modest prevalence of differentially expressed genes and when the magnitudes of over-expression and under-expression are not the same. Further, the proposed method has properties similar to median normalization when the prevalence of differentially expressed genes is very small. Empirical illustration of the proposed method is provided using a liposarcoma study from MSKCC to identify genes differentially expressed between normal fat tissue versus liposarcoma tissue samples.

Journal ArticleDOI
TL;DR: A genetic signature for the basal-like subtype of breast cancer found across a number of previous gene expression array studies is reported and it is found that this signature also arises from clustering on the microRNA expression data and appears derivative from this data.
Abstract: We propose Bayesian generative models for unsupervised learning with two types of data and an assumed dependency of one type of data on the other. We consider two algorithmic approaches, based on a correspondence model, where latent variables are shared across datasets. These models indicate the appropriate number of clusters in addition to indicating relevant features in both types of data. We evaluate the model on artificially created data. We then apply the method to a breast cancer dataset consisting of gene expression and microRNA array data derived from the same patients. We assume partial dependence of gene expression on microRNA expression in this study. The method ranks genes within subtypes which have statistically significant abnormal expression and ranks associated abnormally expressing microRNA. We report a genetic signature for the basal-like subtype of breast cancer found across a number of previous gene expression array studies. Using the two algorithmic approaches we find that this signature also arises from clustering on the microRNA expression data and appears derivative from this data.

Journal ArticleDOI
Reiji Teramoto1
TL;DR: The authors proposed balanced gradient boosting (BalaBoost) which reformulates gradient boosting to avoid the overfitting to the majority class and is sensitive to the minority class by making use of the equal class distribution instead of the empirical class distribution.
Abstract: In clinical outcome prediction, such as disease diagnosis and prognosis, it is often assumed that the class, e.g., disease and control, is equally distributed. However, in practice we often encounter biological or clinical data whose class distribution is highly skewed. Since standard supervised learning algorithms intend to maximize the overall prediction accuracy, a prediction model tends to show a strong bias toward the majority class when it is trained on such imbalanced data. Therefore, the class distribution should be incorporated appropriately to learn from imbalanced data. To address this practically important problem, we proposed balanced gradient boosting (BalaBoost) which reformulates gradient boosting to avoid the overfitting to the majority class and is sensitive to the minority class by making use of the equal class distribution instead of the empirical class distribution. We applied BalaBoost to cancer tissue diagnosis based on miRNA expression data, premature death prediction for diabetes patients based on biochemical and clinical variables and tumor grade prediction of renal cell carcinoma based on tumor marker expressions whose class distribution is highly skewed. Experimental results showed that BalaBoost outperformed the representative supervised learning algorithms, i.e., gradient boosting, Random Forests and Support Vector Machine. Our results led us to the conclusion that BalaBoost is promising for clinical outcome prediction from imbalanced data.

Journal ArticleDOI
TL;DR: A modified marginal Benjamini & Hochberg step-up FDR controlling procedure for multi-stage analyses (FDR-MSA), which correctly controls Type I error in terms of the entire variable set when only a subset of the initial set of variables is tested.
Abstract: Multiple testing has become an integral component in genomic analyses involving microarray experiments where a large number of hypotheses are tested simultaneously. However, before applying more computationally intensive methods, it is often desirable to complete an initial truncation of the variable set using a simpler and faster supervised method such as univariate regression. Once such a truncation is completed, multiple testing methods applied to any subsequent analysis no longer control the appropriate Type I error rates. Here we propose a modified marginal Benjamini & Hochberg step-up FDR controlling procedure for multi-stage analyses (FDR-MSA), which correctly controls Type I error in terms of the entire variable set when only a subset of the initial set of variables is tested. The method is presented with respect to a variable importance application. As the initial subset size increases, we observe convergence to the standard Benjamini & Hochberg step-up FDR controlling multiple testing procedures. We demonstrate the power and Type I error control through simulation and application to the Golub Leukemia data from 1999.

Journal ArticleDOI
TL;DR: The aim is to detect genes whose temporal expression is significantly different across a number of biological conditions, and a new method to approach this problem is presented, applying the method to detect differentially expressed genes from time-course microarray data on muscular dystrophy.
Abstract: In this paper, we explore the use of M-quantile regression and M-quantile coefficients to detect statistical differences between temporal curves that belong to different experimental conditions. In particular, we consider the application of temporal gene expression data. Here, the aim is to detect genes whose temporal expression is significantly different across a number of biological conditions. We present a new method to approach this problem. Firstly, the temporal profiles of the genes are modelled by a parametric M-quantile regression model. This model is particularly appealing to small-sample gene expression data, as it is very robust against outliers and it does not make any assumption on the error distribution. Secondly, we further increase the robustness of the method by summarising the M-quantile regression models for a large range of quantile values into an M-quantile coefficient. Finally, we fit a polynomial M-quantile regression model to the M-quantile coefficients over time and employ a Hotelling T(2)-test to detect significant differences of the temporal M-quantile coefficients profiles across conditions. Extensive simulations show the increased power and robustness of M-quantile regression methods over standard regression methods and over some of the previously published methods. We conclude by applying the method to detect differentially expressed genes from time-course microarray data on muscular dystrophy.

Journal ArticleDOI
TL;DR: Bayesian methods to estimate selection intensity under k-allele models with overdominance are presented and demonstrated with data at the Human Leukocyte Antigen loci from world-wide populations.
Abstract: A balanced pattern in the allele frequencies of polymorphic loci is a potential sign of selection, particularly of overdominance. Although this type of selection is of some interest in population genetics, there exists no likelihood based approaches specifically tailored to make inference on selection intensity. To fill this gap, we present Bayesian methods to estimate selection intensity under k-allele models with overdominance. Our model allows for an arbitrary number of loci and alleles within a locus. The neutral and selected variability within each locus are modeled with corresponding k-allele models. To estimate the posterior distribution of the mean selection intensity in a multilocus region, a hierarchical setup between loci is used. The methods are demonstrated with data at the Human Leukocyte Antigen loci from world-wide populations.

Journal ArticleDOI
TL;DR: A new statistical model for detecting specific DNA sequence variants that are responsible for viral infection is proposed, integrating, for the first time, the epidemiological principle of viral infection into genetic mapping.
Abstract: Large-scale studies of genetic variation may be helpful for understanding the genetic control mechanisms of viral infection and, ultimately, predicting and eliminating infectious disease outbreaks We propose a new statistical model for detecting specific DNA sequence variants that are responsible for viral infection This model considers additive, dominance and epistatic effects of haplotypes from three different genomes, recipient, transmitter and virus, through an epidemiological process The model is constructed within the maximum likelihood framework and implemented with the EM algorithm A number of hypothesis tests about population genetic structure and diversity and the pattern of genetic control are formulated A series of closed forms for the EM algorithm to estimate haplotype frequencies and haplotype effects in a network of genetic interactions among three genomes are derived Simulation studies were performed to test the statistical properties of the model, recommending necessary sample sizes for obtaining reasonably good accuracy and precision of parameter estimation By integrating, for the first time, the epidemiological principle of viral infection into genetic mapping, the new model shall find an immediate application to studying the genetic architecture of viral infection

Journal ArticleDOI
TL;DR: A method is demonstrated for including duplicate genotype data in linear trend tests of genetic association which yields increased power and it is found that when the relative cost of genotyping to phenotyping and sample acquisition costs is less than or equal to the genotypes error rate it is more powerful to duplicate genotypes the entire sample.
Abstract: The genome-wide association (GWA) study is an increasingly popular way to attempt to identify the causal variants in human disease. Duplicate genotyping (or re-genotyping) a portion of the samples in a GWA study is common, though it is typical for these data to be ignored in subsequent tests of genetic association. We demonstrate a method for including duplicate genotype data in linear trend tests of genetic association which yields increased power. We also consider the cost-effectiveness of collecting duplicate genotype data and find that when the relative cost of genotyping to phenotyping and sample acquisition costs is less than or equal to the genotyping error rate it is more powerful to duplicate genotype the entire sample instead of spending the same money to increase the sample size. Duplicate genotyping is particularly cost-effective when SNP minor allele frequencies are low. Practical advice for the implementation of duplicate genotyping is provided. Free software is provided to compute asymptotic and permutation based tests of association using duplicate genotype data as well as to aid in the duplicate genotyping design decision.

Journal ArticleDOI
TL;DR: A new efficiency robust procedure, referred to as adaptive TDT (aTDT), that uses the Hardy-Weinberg disequilibrium coefficient to identify the potential genetic model underlying the data and then applies the TDT-type test corresponding to the selected model.
Abstract: The transmission disequilibrium test (TDT) is a standard method to detect association using family trio design. It is optimal for an additive genetic model. Other TDT-type tests optimal for recessive and dominant models have also been developed. Association tests using family data, including the TDT-type statistics, have been unified to a class of more comprehensive and flexable family-based association tests (FBAT). TDT-type tests have high efficiency when the genetic model is known or correctly specified, but may lose power if the model is mis-specified. Hence tests that are robust to genetic model mis-specification yet efficient are preferred. Constrained likelihood ratio test (CLRT) and MAX-type test have been shown to be efficiency robust. In this paper we propose a new efficiency robust procedure, referred to as adaptive TDT (aTDT). It uses the Hardy-Weinberg disequilibrium coefficient to identify the potential genetic model underlying the data and then applies the TDT-type test (or FBAT for general applications) corresponding to the selected model. Simulation demonstrates that aTDT is efficiency robust to model mis-specifications and generally outperforms the MAX test and CLRT in terms of power. We also show that aTDT has power close to, but much more robust, than the optimal TDT-type test based on a single genetic model. Applications to real and simulated data from Genetic Analysis Workshop (GAW) illustrate the use of our adaptive TDT.

Journal ArticleDOI
TL;DR: In this article, the authors provide an analytic solution to the incorporation of locus heterogeneity into power and sample size calculations for the TDT statistic, and verify their analytic solution with simulations.
Abstract: Locus heterogeneity is one of the most important issues in gene mapping and can cause significant reductions in statistical power for gene mapping, yet no research to date has provided power and sample size calculations for family-based association methods in the presence of locus heterogeneity. The purpose of this research is three-fold: (i) to provide an analytic solution to the incorporation of locus heterogeneity into power and sample size calculations for the TDT statistic; (ii) to verify our analytic solution with simulations; and (iii) to study how different factors affect sample size requirement for the TDT in the presence of locus heterogeneity. The detection of association in the presence of locus heterogeneity requires a greater sample size than in its absence. This increase is independent of the prevalence of the disease. In addition, as the proportion of families unlinked to the disease locus increases, the sample size necessary to maintain constant power increases. Finally, as the effect size of the disease locus increases, the sample size necessary to detect association decreases in the presence of locus heterogeneity. We provide freely available software that can perform these calculations.