scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2014"


Journal ArticleDOI
TL;DR: This work derives a Bayesian meta-analytic-predictive prior from historical data, which is then combined with the new data, and proposes two- or three-component mixtures of standard priors, which allow for good approximations and, for the one-parameter exponential family, straightforward posterior calculations.
Abstract: Historical information is always relevant for clinical trial design. Additionally, if incorporated in the analysis of a new trial, historical data allow to reduce the number of subjects. This decreases costs and trial duration, facilitates recruitment, and may be more ethical. Yet, under prior-data conflict, a too optimistic use of historical data may be inappropriate. We address this challenge by deriving a Bayesian meta-analytic-predictive prior from historical data, which is then combined with the new data. This prospective approach is equivalent to a meta-analytic-combined analysis of historical and new data if parameters are exchangeable across trials. The prospective Bayesian version requires a good approximation of the meta-analytic-predictive prior, which is not available analytically. We propose two- or three-component mixtures of standard priors, which allow for good approximations and, for the one-parameter exponential family, straightforward posterior calculations. Moreover, since one of the mixture components is usually vague, mixture priors will often be heavy-tailed and therefore robust. Further robustness and a more rapid reaction to prior-data conflicts can be achieved by adding an extra weakly-informative mixture component. Use of historical prior information is particularly attractive for adaptive trials, as the randomization ratio can then be changed in case of prior-data conflict. Both frequentist operating characteristics and posterior summaries for various data scenarios show that these designs have desirable properties. We illustrate the methodology for a phase II proof-of-concept trial with historical controls from four studies. Robust meta-analytic-predictive priors alleviate prior-data conflicts ' they should encourage better and more frequent use of historical data in clinical trials.

282 citations


Journal ArticleDOI
TL;DR: Simulation results show that the proposed lower bounds always reduce bias over the traditional lower bounds and improve accuracy (as measured by mean squared error) when the heterogeneity of species abundances is relatively high.
Abstract: Summary. It is difficult to accurately estimate species richness if there are many almost undetectable species in a hyperdiverse community. Practically, an accurate lower bound for species richness is preferable to an inaccurate point estimator. The traditional nonparametric lower bound developed by Chao (1984, Scandinavian Journal of Statistics 11, 265–270) for individual-based abundance data uses only the information on the rarest species (the numbers of singletons and doubletons) to estimate the number of undetected species in samples. Applying a modified Good–Turing frequency formula, we derive an approximate formula for the first-order bias of this traditional lower bound. The approximate bias is estimated by using additional information (namely, the numbers of tripletons and quadrupletons). This approximate bias can be corrected, and an improved lower bound is thus obtained. The proposed lower bound is nonparametric in the sense that it is universally valid for any species abundance distribution. A similar type of improved lower bound can be derived for incidence data. We test our proposed lower bounds on simulated data sets generated from various species abundance models. Simulation results show that the proposed lower bounds always reduce bias over the traditional lower bounds and improve accuracy (as measured by mean squared error) when the heterogeneity of species abundances is relatively high. We also apply the proposed new lower bounds to real data for illustration and for comparisons with previously developed estimators.

205 citations


Journal ArticleDOI
TL;DR: The linear noise approximation (LNA) is applied to analyze Google Flu Trends data from the North and South Islands of New Zealand, and is able to obtain more accurate short-term forecasts of new flu cases than another recently proposed method, although at a greater computational cost.
Abstract: We consider inference for the reaction rates in discretely observed networks such as those found in models for systems biology, population ecology, and epidemics. Most such networks are neither slow enough nor small enough for inference via the true state-dependent Markov jump process to be feasible. Typically, inference is conducted by approximating the dynamics through an ordinary differential equation (ODE) or a stochastic differential equation (SDE). The former ignores the stochasticity in the true model and can lead to inaccurate inferences. The latter is more accurate but is harder to implement as the transition density of the SDE model is generally unknown. The linear noise approximation (LNA) arises from a first-order Taylor expansion of the approximating SDE about a deterministic solution and can be viewed as a compromise between the ODE and SDE models. It is a stochastic model, but discrete time transition probabilities for the LNA are available through the solution of a series of ordinary differential equations. We describe how a restarting LNA can be efficiently used to perform inference for a general class of reaction networks; evaluate the accuracy of such an approach; and show how and when this approach is either statistically or computationally more efficient than ODE or SDE methods. We apply the LNA to analyze Google Flu Trends data from the North and South Islands of New Zealand, and are able to obtain more accurate short-term forecasts of new flu cases than another recently proposed method, although at a greater computational cost.

99 citations


Journal ArticleDOI
TL;DR: In this paper, a parametric and non-parametric approach for the construction of confidence intervals for the pair of sensitivity and specificity proportions that correspond to the Youden index-based optimal cutoff point is presented.
Abstract: After establishing the utility of a continuous diagnostic marker investigators will typically address the question of determining a cut-off point which will be used for diagnostic purposes in clinical decision making. The most commonly used optimality criterion for cut-off point selection in the context of ROC curve analysis is the maximum of the Youden index. The pair of sensitivity and specificity proportions that correspond to the Youden index-based cut-off point characterize the performance of the diagnostic marker. Confidence intervals for sensitivity and specificity are routinely estimated based on the assumption that sensitivity and specificity are independent binomial proportions as they arise from the independent populations of diseased and healthy subjects, respectively. The Youden index-based cut-off point is estimated from the data and as such the resulting sensitivity and specificity proportions are in fact correlated. This correlation needs to be taken into account in order to calculate confidence intervals that result in the anticipated coverage. In this article we study parametric and non-parametric approaches for the construction of confidence intervals for the pair of sensitivity and specificity proportions that correspond to the Youden index-based optimal cut-off point. These approaches result in the anticipated coverage under different scenarios for the distributions of the healthy and diseased subjects. We find that a parametric approach based on a Box-Cox transformation to normality often works well. For biomarkers following more complex distributions a non-parametric procedure using logspline density estimation can be used.

90 citations


Journal ArticleDOI
TL;DR: It is concluded that in emerging and time-critical outbreaks, nowcasting approaches are a valuable tool to gain information about current trends.
Abstract: A Bayesian approach to the prediction of occurred-but-not-yet-reported events is developed for application in real-time public health surveillance. The motivation was the prediction of the daily number of hospitalizations for the hemolytic-uremic syndrome during the large May-July 2011 outbreak of Shiga toxin-producing Escherichia coli (STEC) O104:H4 in Germany. Our novel Bayesian approach addresses the count data nature of the problem using negative binomial sampling and shows that right-truncation of the reporting delay distribution under an assumption of time-homogeneity can be handled in a conjugate prior-posterior framework using the generalized Dirichlet distribution. Since, in retrospect, the true number of hospitalizations is available, proper scoring rules for count data are used to evaluate and compare the predictive quality of the procedures during the outbreak. The results show that it is important to take the count nature of the time series into account and that changes in the delay distribution occurred due to intervention measures. As a consequence, we extend the Bayesian analysis to a hierarchical model, which combines a discrete time survival regression model for the delay distribution with a penalized spline for the dynamics of the epidemic curve. Altogether, we conclude that in emerging and time-critical outbreaks, nowcasting approaches are a valuable tool to gain information about current trends.

81 citations


Journal ArticleDOI
TL;DR: A method for combining markers for treatment selection which requires modeling the treatment effect as a function of markers is proposed and compared to existing methods in a simulation study based on the change in expected outcome under marker‐based treatment.
Abstract: Markers that predict treatment effect have the potential to improve patient outcomes. For example, the OncotypeDX® RecurrenceScore® has some ability to predict the benefit of adjuvant chemotherapy over and above hormone therapy for the treatment of estrogen-receptor-positive breast cancer, facilitating the provision of chemotherapy to women most likely to benefit from it. Given that the score was originally developed for predicting outcome given hormone therapy alone, it is of interest to develop alternative combinations of the genes comprising the score that are optimized for treatment selection. However, most methodology for combining markers is useful when predicting outcome under a single treatment. We propose a method for combining markers for treatment selection which requires modeling the treatment effect as a function of markers. Multiple models of treatment effect are fit iteratively by upweighting or "boosting" subjects potentially misclassified according to treatment benefit at the previous stage. The boosting approach is compared to existing methods in a simulation study based on the change in expected outcome under marker-based treatment. The approach improves upon methods in some settings and has comparable performance in others. Our simulation study also provides insights as to the relative merits of the existing methods. Application of the boosting approach to the breast cancer data, using scaled versions of the original markers, produces marker combinations that may have improved performance for treatment selection.

78 citations


Journal ArticleDOI
TL;DR: Simulation study results demonstrate the IPW estimators can yield unbiased estimates of the direct, indirect, total, and overall effects of vaccination when there is interference provided the untestable no unmeasured confounders assumption holds and the group‐level propensity score model is correctly specified.
Abstract: Interference occurs when the treatment of one person affects the outcome of another. For example, in infectious diseases, whether one individual is vaccinated may affect whether another individual becomes infected or develops disease. Quantifying such indirect (or spillover) effects of vaccination could have important public health or policy implications. In this article we use recently developed inverse-probability weighted (IPW) estimators of treatment effects in the presence of interference to analyze an individually-randomized, placebo-controlled trial of cholera vaccination that targeted 121,982 individuals in Matlab, Bangladesh. Because these IPW estimators have not been employed previously, a simulation study was also conducted to assess the empirical behavior of the estimators in settings similar to the cholera vaccine trial. Simulation study results demonstrate the IPW estimators can yield unbiased estimates of the direct, indirect, total, and overall effects of vaccination when there is interference provided the untestable no unmeasured confounders assumption holds and the group-level propensity score model is correctly specified. Application of the IPW estimators to the cholera vaccine trial indicates the presence of interference. For example, the IPW estimates suggest on average 5.29 fewer cases of cholera per 1000 person-years (95% confidence interval 2.61, 7.96) will occur among unvaccinated individuals within neighborhoods with 60% vaccine coverage compared to neighborhoods with 32% coverage. Our analysis also demonstrates how not accounting for interference can render misleading conclusions about the public health utility of vaccination.

78 citations


Journal ArticleDOI
TL;DR: The proposed Bayesian localized conditional autoregressive model is flexible spatially, in the sense that it is not only able to model areas of spatial smoothness, but also it is able to capture step changes in the random effects surface.
Abstract: Estimation of the long-term health effects of air pollution is a challenging task, especially when modeling spatial small-area disease incidence data in an ecological study design. The challenge comes from the unobserved underlying spatial autocorrelation structure in these data, which is accounted for using random effects modeled by a globally smooth conditional autoregressive model. These smooth random effects confound the effects of air pollution, which are also globally smooth. To avoid this collinearity a Bayesian localized conditional autoregressive model is developed for the random effects. This localized model is flexible spatially, in the sense that it is not only able to model areas of spatial smoothness, but also it is able to capture step changes in the random effects surface. This methodological development allows us to improve the estimation performance of the covariate effects, compared to using traditional conditional auto-regressive models. These results are established using a simulation study, and are then illustrated with our motivating study on air pollution and respiratory ill health in Greater Glasgow, Scotland in 2011. The model shows substantial health effects of particulate matter air pollution and nitrogen dioxide, whose effects have been consistently attenuated by the currently available globally smooth models.

71 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a method for constructing dynamic treatment regimes that accommodates competing outcomes by recommending sets of treatments at each decision point, where each set of treatments contains all treatments that produce non-inferior outcomes.
Abstract: Dynamic treatment regimes operationalize the clinical decision process as a sequence of functions, one for each clinical decision, where each function maps up-to-date patient information to a single recommended treatment. Current methods for estimating optimal dynamic treatment regimes, for example Q-learning, require the specification of a single outcome by which the ‘goodness’ of competing dynamic treatment regimes is measured. However, this is an over-simplification of the goal of clinical decision making, which aims to balance several potentially competing outcomes, e.g., symptom relief and side-effect burden. When there are competing outcomes and patients do not know or cannot communicate their preferences, formation of a single composite outcome that correctly balances the competing outcomes is not possible. This problem also occurs when patient preferences evolve over time. We propose a method for constructing dynamic treatment regimes that accommodates competing outcomes by recommending sets of treatments at each decision point. Formally, we construct a sequence of set-valued functions that take as input up-to-date patient information and give as output a recommended subset of the possible treatments. For a given patient history, the recommended set of treatments contains all treatments that produce non-inferior outcome vectors. Constructing these set-valued functions requires solving a non-trivial enumeration problem. We offer an exact enumeration algorithm by recasting the problem as a linear mixed integer program. The proposed methods are illustrated using data from the CATIE schizophrenia study.

62 citations


Journal ArticleDOI
TL;DR: This work proposes a decision‐theoretic approach to confounder selection and effect estimation, which first estimates the full standard Bayesian regression model and then post‐process the posterior distribution with a loss function that penalizes models omitting important confounders.
Abstract: When estimating the effect of an exposure or treatment on an outcome it is important to select the proper subset of confounding variables to include in the model. Including too many covariates increases mean square error on the effect of interest while not including confounding variables biases the exposure effect estimate. We propose a decision-theoretic approach to confounder selection and effect estimation. We first estimate the full standard Bayesian regression model and then post-process the posterior distribution with a loss function that penalizes models omitting important confounders. Our method can be fit easily with existing software and in many situations without the use of Markov chain Monte Carlo methods, resulting in computation on the order of the least squares solution. We prove that the proposed estimator has attractive asymptotic properties. In a simulation study we show that our method outperforms existing methods. We demonstrate our method by estimating the effect of fine particulate matter (PM2.5) exposure on birth weight in Mecklenburg County, North Carolina.

56 citations


Journal ArticleDOI
TL;DR: This paper investigates how causal peer effects of traits and behaviors can be identified using genes (or other structurally isomorphic variables) as instrumental variables (IV) in a large set of data generating models with homophily and confounding and shows that IV identification of peer effects remains possible even under multiple complications often regarded as lethal.
Abstract: The identification of causal peer effects (also known as social contagion or induction) from observational data in social networks is challenged by two distinct sources of bias: latent homophily and unobserved confounding. In this paper, we investigate how causal peer effects of traits and behaviors can be identified using genes (or other structurally isomorphic variables) as instrumental variables (IV) in a large set of data generating models with homophily and confounding. We use directed acyclic graphs to represent these models and employ multiple IV strategies and report three main identification results. First, using a single fixed gene (or allele) as an IV will generally fail to identify peer effects if the gene affects past values of the treatment. Second, multiple fixed genes/alleles, or, more promisingly, time-varying gene expression, can identify peer effects if we instrument exclusion violations as well as the focal treatment. Third, we show that IV identification of peer effects remains possible even under multiple complications often regarded as lethal for IV identification of intra-individual effects, such as pleiotropy on observables and unobservables, homophily on past phenotype, past and ongoing homophily on genotype, inter-phenotype peer effects, population stratification, gene expression that is endogenous to past phenotype and past gene expression, and others. We apply our identification results to estimating peer effects of body mass index (BMI) among friends and spouses in the Framingham Heart Study. Results suggest a positive causal peer effect of BMI between friends.

Journal ArticleDOI
TL;DR: This work proposes a sparse covariate dependent Ising model to study both the conditional dependency within the binary data and its relationship with the additional covariates, and uses ℓ1 penalties to induce sparsity in the fitted graphs and the number of selected covariates.
Abstract: Summary There has been a lot of work fitting Ising models to multivariate binary data in order to understand the conditional dependency relationships between the variables. However, additional covariates are frequently recorded together with the binary data, and may influence the dependence relationships. Motivated by such a dataset on genomic instability collected from tumor samples of several types, we propose a sparse covariate dependent Ising model to study both the conditional dependency within the binary data and its relationship with the additional covariates. This results in subject-specific Ising models, where the subject's covariates influence the strength of association between the genes. As in all exploratory data analysis, interpretability of results is important, and we use penalties to induce sparsity in the fitted graphs and in the number of selected covariates. Two algorithms to fit the model are proposed and compared on a set of simulated data, and asymptotic results are established. The results on the tumor dataset and their biological significance are discussed in detail.

Journal ArticleDOI
TL;DR: Two approaches for building robust M-estimators of the regression parameters in the class of generalized linear models to the negative binomial distribution are extended and a robust weighted maximum likelihood estimator for the overdispersion parameter is introduced, specific to the NB distribution.
Abstract: Summary A popular way to model overdispersed count data, such as the number of falls reported during intervention studies, is by means of the negative binomial (NB) distribution. Classical estimating methods are well-known to be sensitive to model misspecifications, taking the form of patients falling much more than expected in such intervention studies where the NB regression model is used. We extend in this article two approaches for building robust M-estimators of the regression parameters in the class of generalized linear models to the NB distribution. The first approach achieves robustness in the response by applying a bounded function on the Pearson residuals arising in the maximum likelihood estimating equations, while the second approach achieves robustness by bounding the unscaled deviance components. For both approaches, we explore different choices for the bounding functions. Through a unified notation, we show how close these approaches may actually be as long as the bounding functions are chosen and tuned appropriately, and provide the asymptotic distributions of the resulting estimators. Moreover, we introduce a robust weighted maximum likelihood estimator for the overdispersion parameter, specific to the NB distribution. Simulations under various settings show that redescending bounding functions yield estimates with smaller biases under contamination while keeping high efficiency at the assumed model, and this for both approaches. We present an application to a recent randomized controlled trial measuring the effectiveness of an exercise program at reducing the number of falls among people suffering from Parkinsons disease to illustrate the diagnostic use of such robust procedures and their need for reliable inference.

Journal ArticleDOI
TL;DR: A Bayesian approach based on the non‐central hypergeometric model that can detect moderate but consistent enrichment signals and identify sets of closely related and biologically meaningful functional terms rather than isolated terms is proposed.
Abstract: Summary Functional enrichment analysis is conducted on high-throughput data to provide functional interpretation for a list of genes or proteins that share a common property, such as being differentially expressed (DE). The hypergeometric P-value has been widely used to investigate whether genes from pre-defined functional terms, for example, Gene Ontology (GO), are enriched in the DE genes. The hypergeometric P-value has three limitations: (1) computed independently for each term, thus neglecting biological dependence; (2) subject to a size constraint that leads to the tendency of selecting less-specific terms; (3) repeated use of information due to overlapping annotations by the true-path rule. We propose a Bayesian approach based on the non-central hypergeometric model. The GO dependence structure is incorporated through a prior on non-centrality parameters. The likelihood function does not include overlapping information. The inference about enrichment is based on posterior probabilities that do not have a size constraint. This method can detect moderate but consistent enrichment signals and identify sets of closely related and biologically meaningful functional terms rather than isolated terms. We also describe the basic ideas of assumption and implementation of different methods to provide some theoretical insights, which are demonstrated via a simulation study. A real application is presented.

Journal ArticleDOI
TL;DR: It is shown under which conditions shared random-effects models proposed for observed-cluster inference do actually describe members with observed Y, and a psoriatic arthritis dataset is used to illustrate the danger of misinterpreting estimates from shared random -effects models.
Abstract: Clustered data commonly arise in epidemiology. We assume each cluster member has an outcome Y and covariates X. When there are missing data in Y, the distribution of Y given X in all cluster members ("complete clusters") may be different from the distribution just in members with observed Y ("observed clusters"). Often the former is of interest, but when data are missing because in a fundamental sense Y does not exist (e.g., quality of life for a person who has died), the latter may be more meaningful (quality of life conditional on being alive). Weighted and doubly weighted generalized estimating equations and shared random-effects models have been proposed for observed-cluster inference when cluster size is informative, that is, the distribution of Y given X in observed clusters depends on observed cluster size. We show these methods can be seen as actually giving inference for complete clusters and may not also give observed-cluster inference. This is true even if observed clusters are complete in themselves rather than being the observed part of larger complete clusters: here methods may describe imaginary complete clusters rather than the observed clusters. We show under which conditions shared random-effects models proposed for observed-cluster inference do actually describe members with observed Y. A psoriatic arthritis dataset is used to illustrate the danger of misinterpreting estimates from shared random-effects models.

Journal ArticleDOI
TL;DR: This work proposes an efficient composite likelihood approach in that the estimation efficiency is resulted from a construction of over‐identified joint composite estimating equations, and the statistical theory for the proposed estimation is developed by extending the classical theory of the generalized method of moments.
Abstract: Spatial-clustered data refer to high-dimensional correlated measurements collected from units or subjects that are spatially clustered. Such data arise frequently from studies in social and health sciences. We propose a unified modeling framework, termed as GeoCopula, to characterize both large-scale variation, and small-scale variation for various data types, including continuous data, binary data, and count data as special cases. To overcome challenges in the estimation and inference for the model parameters, we propose an efficient composite likelihood approach in that the estimation efficiency is resulted from a construction of over-identified joint composite estimating equations. Consequently, the statistical theory for the proposed estimation is developed by extending the classical theory of the generalized method of moments. A clear advantage of the proposed estimation method is the computation feasibility. We conduct several simulation studies to assess the performance of the proposed models and estimation methods for both Gaussian and binary spatial-clustered data. Results show a clear improvement on estimation efficiency over the conventional composite likelihood method. An illustrative data example is included to motivate and demonstrate the proposed method.

Journal ArticleDOI
TL;DR: The naive likelihood is shown to be valid under mixed case intervalcensoring, but not under an independent inspection process model, in contrast with full maximum likelihood which is valid under both interval censoring models.
Abstract: Parametric estimation of the cumulative incidence function (CIF) is considered for competing risks data subject to interval censoring. Existing parametric models of the CIF for right censored competing risks data are adapted to the general case of interval censoring. Maximum likelihood estimators for the CIF are considered under the assumed models, extending earlier work on nonparametric estimation. A simple naive likelihood estimator is also considered that utilizes only part of the observed data. The naive estimator enables separate estimation of models for each cause, unlike full maximum likelihood in which all models are fit simultaneously. The naive likelihood is shown to be valid under mixed case interval censoring, but not under an independent inspection process model, in contrast with full maximum likelihood which is valid under both interval censoring models. In simulations, the naive estimator is shown to perform well and yield comparable efficiency to the full likelihood estimator in some settings. The methods are applied to data from a large, recent randomized clinical trial for the prevention of mother-to-child transmission of HIV.

Journal ArticleDOI
TL;DR: This article considers a general setting with multiple markers and proposes a two‐step robust method to derive individualized treatment rules and evaluates their values and proposes procedures for comparing different ITRs, which can be used to quantify the incremental value of new markers in improving treatment selection.
Abstract: A potential venue to improve healthcare efficiency is to effectively tailor individualized treatment strategies by incorporating patient level predictor information such as environmental exposure, biological, and genetic marker measurements. Many useful statistical methods for deriving individualized treatment rules (ITR) have become available in recent years. Prior to adopting any ITR in clinical practice, it is crucial to evaluate its value in improving patient outcomes. Existing methods for quantifying such values mainly consider either a single marker or semi-parametric methods that are subject to bias under model misspecification. In this paper, we consider a general setting with multiple markers and propose a two-step robust method to derive ITRs and evaluate their values. We also propose procedures for comparing different ITRs, which can be used to quantify the incremental value of new markers in improving treatment selection. While working models are used in step I to approximate optimal ITRs, we add a layer of calibration to guard against model misspecification and further assess the value of the ITR non-parametrically, which ensures the validity of the inference. To account for the sampling variability of the estimated rules and their corresponding values, we propose a resampling procedure to provide valid confidence intervals for the value functions as well as for the incremental value of new markers for treatment selection. Our proposals are examined through extensive simulation studies and illustrated with the data from a clinical trial that studies the effects of two drug combinations on HIV-1 infected patients.

Journal ArticleDOI
TL;DR: A method to test the correlation of two random fields when they are both spatially autocorrelated, using Monte‐Carlo methods, and focuses on permuting, and then smoothing and scaling one of the variables to destroy the correlation with the other, while maintaining at the same time the initial Autocorrelation.
Abstract: Summary. We propose a method to test the correlation of two random fields when they are both spatially autocorrelated. In this scenario, the assumption of independence for the pair of observations in the standard test does not hold, and as a result we reject in many cases where there is no effect (the precision of the null distribution is overestimated). Our method recovers the null distribution taking into account the autocorrelation. It uses Monte-Carlo methods, and focuses on permuting, and then smoothing and scaling one of the variables to destroy the correlation with the other, while maintaining at the same time the initial autocorrelation. With this simulation model, any test based on the independence of two (or more) random fields can be constructed. This research was motivated by a project in biodiversity and conservation in the Biology Department at Stanford University.

Journal ArticleDOI
TL;DR: It is argued that rotation-based multiple testing, by allowing for adjustments for confounding, represents an important extension of permutation- based multiple testing procedures.
Abstract: Permutation methods are very useful in several scientic elds. They have the advantage of making fewer assumptions about the data and of providing more reliable inferential results. They are also particularly useful in case of high-dimensional problems since they easily account for dependence between tests, thereby allowing for more powerful multiplicity control procedures. Indeed, Westfall and Young's min-p procedure often improves on the Holm procedure by providing more rejections. The advantage of being able to make fewer assumptions about the process generating the data unfortunately involves an inherent limitation in the way a process can be modeled (e.g. through multiple linear models). In this work, we propose a permutation (and rotation) method which allows the inference in the multivariate linear model even in the presence of covariates (i.e. nuisance parameters, i.e. confounders). Also, the method allows for the immediate application of the min-p procedure. We make clear how permutations are a particular case of rotations of the data. Permutation tests are exact, while rotation tests retain exactness under multiple-multivariate linear model with normal errors. When errors are not normal, the rotation tests are weakly exchangeable (i.e. approximated and asymptotically exact). A real application to genetic data is presented and discussed.

Journal ArticleDOI
TL;DR: In this paper, a parametric bootstrap method is proposed to test the significance of the multiplicative terms in the final model for a given set of genotypes, for example, crop cultivars.
Abstract: The genotype main effects and genotype-by-environment interaction effects (GGE) model and the additive main effects and multiplicative interaction (AMMI) model are two common models for analysis of genotype-by-environment data. These models are frequently used by agronomists, plant breeders, geneticists and statisticians for analysis of multi-environment trials. In such trials, a set of genotypes, for example, crop cultivars, are compared across a range of environments, for example, locations. The GGE and AMMI models use singular value decomposition to partition genotype-by-environment interaction into an ordered sum of multiplicative terms. This article deals with the problem of testing the significance of these multiplicative terms in order to decide how many terms to retain in the final model. We propose parametric bootstrap methods for this problem. Models with fixed main effects, fixed multiplicative terms and random normally distributed errors are considered. Two methods are derived: a full and a simple parametric bootstrap method. These are compared with the alternatives of using approximate F-tests and cross-validation. In a simulation study based on four multi-environment trials, both bootstrap methods performed well with regard to Type I error rate and power. The simple parametric bootstrap method is particularly easy to use, since it only involves repeated sampling of standard normally distributed values. This method is recommended for selecting the number of multiplicative terms in GGE and AMMI models. The proposed methods can also be used for testing components in principal component analysis.

Journal ArticleDOI
TL;DR: A fully Bayesian semiparametric method for the purpose of attenuating bias and increasing efficiency when jointly modeling time‐to‐event data from two possibly non‐exchangeable sources of information is proposed.
Abstract: Trial investigators often have a primary interest in the estimation of the survival curve in a population for which there exists acceptable historical information from which to borrow strength. However, borrowing strength from a historical trial that is non-exchangeable with the current trial can result in biased conclusions. In this paper we propose a fully Bayesian semiparametric method for the purpose of attenuating bias and increasing efficiency when jointly modeling time-to-event data from two possibly non-exchangeable sources of information. We illustrate the mechanics of our methods by applying them to a pair of post-market surveillance datasets regarding adverse events in persons on dialysis that had either a bare metal or drug-eluting stent implanted during a cardiac revascularization surgery. We finish with a discussion of the advantages and limitations of this approach to evidence synthesis, as well as directions for future work in this area. The paper’s Supplementary Materials offer simulations to show our procedure’s bias, mean squared error, and coverage probability properties in a variety of settings.

Journal ArticleDOI
TL;DR: A novel method called meta‐lasso for variable selection with high‐dimensional meta‐analyzed data, which possesses the gene selection consistency, that is, when sample size of each data set is large, with high probability, the method can identify all important genes and remove all unimportant genes.
Abstract: Recent advance in biotechnology and its wide applications have led to the generation of many high-dimensional gene expression data sets that can be used to address similar biological questions. Meta-analysis plays an important role in summarizing and synthesizing scientific evidence from multiple studies. When the dimensions of datasets are high, it is desirable to incorporate variable selection into meta-analysis to improve model interpretation and prediction. According to our knowledge, all existing methods conduct variable selection with meta-analyzed data in an "all-in-or-all-out" fashion, that is, a gene is either selected in all of studies or not selected in any study. However, due to data heterogeneity commonly exist in meta-analyzed data, including choices of biospecimens, study population, and measurement sensitivity, it is possible that a gene is important in some studies while unimportant in others. In this article, we propose a novel method called meta-lasso for variable selection with high-dimensional meta-analyzed data. Through a hierarchical decomposition on regression coefficients, our method not only borrows strength across multiple data sets to boost the power to identify important genes, but also keeps the selection flexibility among data sets to take into account data heterogeneity. We show that our method possesses the gene selection consistency, that is, when sample size of each data set is large, with high probability, our method can identify all important genes and remove all unimportant genes. Simulation studies demonstrate a good performance of our method. We applied our meta-lasso method to a meta-analysis of five cardiovascular studies. The analysis results are clinically meaningful.

Journal ArticleDOI
TL;DR: A transformed Bernstein polynomial that is centered at standard parametric families, such as Weibull or log-logistic, is proposed for use in the accelerated hazards model, which is further generalized to time-dependent covariates.
Abstract: A transformed Bernstein polynomial that is centered at standard parametric families, such as Weibull or log-logistic, is proposed for use in the accelerated hazards model. This class provides a convenient way towards creating a Bayesian nonparametric prior for smooth densities, blending the merits of parametric and nonparametric methods, that is amenable to standard estimation approaches. For example optimization methods in SAS or R can yield the posterior mode and asymptotic covariance matrix. This novel nonparametric prior is employed in the accelerated hazards model, which is further generalized to time-dependent covariates. The proposed approach fares considerably better than previous approaches in simulations; data on the effectiveness of biodegradable carmustine polymers on recurrent brain malignant gliomas is investigated.

Journal ArticleDOI
TL;DR: It is shown analytically that the proposed approach can have more power to detect SNPs that are associated with the outcome through transcriptional regulation, compared to tests using the outcome and genotype data alone, and simulations show that the method is relatively robust to misspecification.
Abstract: Integrative genomics offers a promising approach to more powerful genetic association studies. The hope is that combining outcome and genotype data with other types of genomic information can lead to more powerful SNP detection. We present a new association test based on a statistical model that explicitly assumes that genetic variations affect the outcome through perturbing gene expression levels. It is shown analytically that the proposed approach can have more power to detect SNPs that are associated with the outcome through transcriptional regulation, compared to tests using the outcome and genotype data alone, and simulations show that our method is relatively robust to misspecification. We also provide a strategy for applying our approach to high-dimensional genomic data. We use this strategy to identify a potentially new association between a SNP and a yeast cell's response to the natural product tomatidine, which standard association analysis did not detect.

Journal ArticleDOI
TL;DR: A dynamic allocation procedure that increases power and efficiency when measuring an average treatment effect in fixed sample randomized trials with sequential allocation is proposed and estimators for theAverage treatment effect that combine information from both the matched pairs and unmatched subjects are developed.
Abstract: We propose a dynamic allocation procedure that increases power and efficiency when measuring an average treatment effect in fixed sample randomized trials with sequential allocation. Subjects arrive iteratively and are either randomized or paired via a matching criterion to a previously randomized subject and administered the alternate treatment. We develop estimators for the average treatment effect that combine information from both the matched pairs and unmatched subjects as well as an exact test. Simulations illustrate the method's higher efficiency and power over several competing allocation procedures in both simulations and in data from a clinical trial.

Journal ArticleDOI
TL;DR: In this article, the model selection problem in a general complex linear model system and in a Bayesian framework was considered and derived analytic Bayes factors and their approximations to facilitate model selection and discuss their theoretical and computational properties.
Abstract: Summary Motivated by examples from genetic association studies, this article considers the model selection problem in a general complex linear model system and in a Bayesian framework. We discuss formulating model selection problems and incorporating context-dependent a priori information through different levels of prior specifications. We also derive analytic Bayes factors and their approximations to facilitate model selection and discuss their theoretical and computational properties. We demonstrate our Bayesian approach based on an implemented Markov Chain Monte Carlo (MCMC) algorithm in simulations and a real data application of mapping tissue-specific eQTLs. Our novel results on Bayes factors provide a general framework to perform efficient model comparisons in complex linear model systems.

Journal ArticleDOI
TL;DR: A new spectral method to study and exploit complex relationships between model output and monitoring data and it is found that CMAQ captures large-scale spatial trends, but has low correlation with the monitoring data at small spatial scales.
Abstract: Complex computer models play a crucial role in air quality research. These models are used to evaluate potential regulatory impacts of emission control strategies and to estimate air quality in areas without monitoring data. For both of these purposes, it is important to calibrate model output with monitoring data to adjust for model biases and improve spatial prediction. In this article, we propose a new spectral method to study and exploit complex relationships between model output and monitoring data. Spectral methods allow us to estimate the relationship between model output and monitoring data separately at different spatial scales, and to use model output for prediction only at the appropriate scales. The proposed method is computationally efficient and can be implemented using standard software. We apply the method to compare Community Multiscale Air Quality (CMAQ) model output with ozone measurements in the United States in July 2005. We find that CMAQ captures large-scale spatial trends, but has low correlation with the monitoring data at small spatial scales.

Journal ArticleDOI
TL;DR: A new and flexible probability‐enhanced SDR method for binary classification problems by using the weighted support vector machine (WSVM) and it is observed that, in order to implement the new slicing scheme, one does not need exact probability values and the only required information is the relative order of probability values.
Abstract: Summary In high-dimensional data analysis, it is of primary interest to reduce the data dimensionality without loss of information. Sufficient dimension reduction (SDR) arises in this context, and many successful SDR methods have been developed since the introduction of sliced inverse regression (SIR) [Li (1991) Journal of the American Statistical Association 86, 316–327]. Despite their fast progress, though, most existing methods target on regression problems with a continuous response. For binary classification problems, SIR suffers the limitation of estimating at most one direction since only two slices are available. In this article, we develop a new and flexible probability-enhanced SDR method for binary classification problems by using the weighted support vector machine (WSVM). The key idea is to slice the data based on conditional class probabilities of observations rather than their binary responses. We first show that the central subspace based on the conditional class probability is the same as that based on the binary response. This important result justifies the proposed slicing scheme from a theoretical perspective and assures no information loss. In practice, the true conditional class probability is generally not available, and the problem of probability estimation can be challenging for data with large-dimensional inputs. We observe that, in order to implement the new slicing scheme, one does not need exact probability values and the only required information is the relative order of probability values. Motivated by this fact, our new SDR procedure bypasses the probability estimation step and employs the WSVM to directly estimate the order of probability values, based on which the slicing is performed. The performance of the proposed probability-enhanced SDR scheme is evaluated by both simulated and real data examples.

Journal ArticleDOI
TL;DR: A novel penalized minimization method that is based on the difference of convex functions algorithm (DCA) and the corresponding estimator of marker combinations has a kernel property that allows flexible modeling of linear and nonlinear marker combinations.
Abstract: Treatment-selection markers predict an individual’s response to different therapies, thus allowing for the selection of a therapy with the best predicted outcome. A good marker-based treatment-selection rule can significantly impact public health through the reduction of the disease burden in a cost-effective manner. Our goal in this paper is to use data from randomized trials to identify optimal linear and nonlinear biomarker combinations for treatment selection that minimize the total burden to the population caused by either the targeted disease or its treatment. We frame this objective into a general problem of minimizing a weighted sum of 0–1 loss and propose a novel penalized minimization method that is based on the difference of convex functions algorithm (DCA). The corresponding estimator of marker combinations has a kernel property that allows flexible modeling of linear and nonlinear marker combinations. We compare the proposed methods with existing methods for optimizing treatment regimens such as the logistic regression model and the weighted support vector machine. Performances of different weight functions are also investigated. The application of the proposed method is illustrated using a real example from an HIV vaccine trial: we search for a combination of Fc receptor genes for recommending vaccination in preventing HIV infection.