scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2011"


Journal ArticleDOI
TL;DR: This article presents how survival probabilities can be estimated for future subjects based on their available longitudinal measurements and a fitted joint model and assesses how well the marker is capable of discriminating between subjects who experience the event within a medically meaningful time frame from subjects who do not.
Abstract: In longitudinal studies it is often of interest to investigate how a marker that is repeatedly measured in time is associated with a time to an event of interest. This type of research question has given rise to a rapidly developing field of biostatistics research that deals with the joint modeling of longitudinal and time-to-event data. In this article, we consider this modeling framework and focus particularly on the assessment of the predictive ability of the longitudinal marker for the time-to-event outcome. In particular, we start by presenting how survival probabilities can be estimated for future subjects based on their available longitudinal measurements and a fitted joint model. Following we derive accuracy measures under the joint modeling framework and assess how well the marker is capable of discriminating between subjects who experience the event within a medically meaningful time frame from subjects who do not. We illustrate our proposals on a real data set on human immunodeficiency virus infected patients for which we are interested in predicting the time-to-death using their longitudinal CD4 cell count measurements.

389 citations


Journal ArticleDOI
TL;DR: A generalization of the N-mixture model is proposed that can be used to formally test the closure assumption and when applied to an open metapopulation, the generalized model provides estimates of population dynamics parameters and yields abundance estimates that account for imperfect detection probability and do not require theclosure assumption.
Abstract: Using only spatially and temporally replicated point counts, Royle (2004b, Biometrics 60, 108-115) developed an N-mixture model to estimate the abundance of an animal population when individual animal detection probability is unknown. One assumption inherent in this model is that the animal populations at each sampled location are closed with respect to migration, births, and deaths throughout the study. In the past this has been verified solely by biological arguments related to the study design as no statistical verification was available. In this article, we propose a generalization of the N-mixture model that can be used to formally test the closure assumption. Additionally, when applied to an open metapopulation, the generalized model provides estimates of population dynamics parameters and yields abundance estimates that account for imperfect detection probability and do not require the closure assumption. A simulation study shows these abundance estimates are less biased than the abundance estimate obtained from the original N-mixture model. The proposed model is then applied to two data sets of avian point counts. The first example demonstrates the closure test on a single-season study of Mallards (Anas platyrhynchos), and the second uses the proposed model to estimate the population dynamics parameters and yearly abundance of American robins (Turdus migratorius) from a multi-year study.

318 citations


Journal ArticleDOI
TL;DR: If any subset of the observed covariates suffices to control for confounding then the set of covariates chosen by the criterion will also suffice, and it is shown that other criteria for confounding control do not have this property.
Abstract: We propose a new criterion for confounder selection when the underlying causal structure is unknown and only limited knowledge is available. We assume all covariates being considered are pretreatment variables and that for each covariate it is known (i) whether the covariate is a cause of treatment, and (ii) whether the covariate is a cause of the outcome. The causal relationships the covariates have with one another is assumed unknown. We propose that control be made for any covariate that is either a cause of treatment or of the outcome or both. We show that irrespective of the actual underlying causal structure, if any subset of the observed covariates suffices to control for confounding then the set of covariates chosen by our criterion will also suffice. We show that other, commonly used, criteria for confounding control do not have this property. We use formal theory concerning causal diagrams to prove our result but the application of the result does not rely on familiarity with causal diagrams. An investigator simply need ask, “Is the covariate a cause of the treatment?” and “Is the covariate a cause of the outcome?” If the answer to either question is “yes” then the covariate is included for confounder control. We discuss some additional covariate selection results that preserve unconfoundedness and that may be of interest when used with our criterion.

315 citations


Journal ArticleDOI
TL;DR: This article presents several models that allow for the commensurability of the information in the historical and current data to determine how much historical information is used in hierarchical Bayesian methods for incorporating historical data that are adaptively robust to prior information that reveals itself to be inconsistent with the accumulating experimental data.
Abstract: Bayesian clinical trial designs offer the possibility of a substantially reduced sample size, increased statistical power, and reductions in cost and ethical hazard. However when prior and current information conflict, Bayesian methods can lead to higher than expected type I error, as well as the possibility of a costlier and lengthier trial. This motivates an investigation of the feasibility of hierarchical Bayesian methods for incorporating historical data that are adaptively robust to prior information that reveals itself to be inconsistent with the accumulating experimental data. In this article, we present several models that allow for the commensurability of the information in the historical and current data to determine how much historical information is used. A primary tool is elaborating the traditional power prior approach based upon a measure of commensurability for Gaussian data. We compare the frequentist performance of several methods using simulations, and close with an example of a colon cancer trial that illustrates a linear models extension of our adaptive borrowing approach. Our proposed methods produce more precise estimates of the model parameters, in particular, conferring statistical significance to the observed reduction in tumor size for the experimental regimen as compared to the control regimen.

267 citations



Journal ArticleDOI
TL;DR: In this article, approximate Bayesian computation (ABC) is used to estimate the parameters of a stochastic process model for a macroparasite population within a host.
Abstract: We estimate the parameters of a stochastic process model for a macroparasite population within a host using approximate Bayesian computation (ABC). The immunity of the host is an unobserved model variable and only mature macroparasites at sacrifice of the host are counted. With very limited data, process rates are inferred reasonably precisely. Modeling involves a three variable Markov process for which the observed data likelihood is computationally intractable. ABC methods are particularly useful when the likelihood is analytically or computationally intractable. The ABC algorithm we present is based on sequential Monte Carlo, is adaptive in nature, and overcomes some drawbacks of previous approaches to ABC. The algorithm is validated on a test example involving simulated data from an autologistic model before being used to infer parameters of the Markov process model for experimental data. The fitted model explains the observed extra-binomial variation in terms of a zero-one immunity variable, which has a short-lived presence in the host.

228 citations


Journal ArticleDOI
TL;DR: It is shown that, with the appropriate filtration, a martingale property holds that allows deriving asymptotic results for the proportional subdistribution hazards model in the same way as for the standard Cox proportional hazards model.
Abstract: Summary The standard estimator for the cause-specific cumulative incidence function in a competing risks setting with left truncated and/or right censored data can be written in two alternative forms. One is a weighted empirical cumulative distribution function and the other a product-limit estimator. This equivalence suggests an alternative view of the analysis of time-to-event data with left truncation and right censoring: individuals who are still at risk or experienced an earlier competing event receive weights from the censoring and truncation mechanisms. As a consequence, inference on the cumulative scale can be performed using weighted versions of standard procedures. This holds for estimation of the cause-specific cumulative incidence function as well as for estimation of the regression parameters in the Fine and Gray proportional subdistribution hazards model. We show that, with the appropriate filtration, a martingale property holds that allows deriving asymptotic results for the proportional subdistribution hazards model in the same way as for the standard Cox proportional hazards model. Estimation of the cause-specific cumulative incidence function and regression on the subdistribution hazard can be performed using standard software for survival analysis if the software allows for inclusion of time-dependent weights. We show the implementation in the R statistical package. The proportional subdistribution hazards model is used to investigate the effect of calendar period as a deterministic external time varying covariate, which can be seen as a special case of left truncation, on AIDS related and non-AIDS related cumulative mortality.

191 citations


Journal ArticleDOI
TL;DR: An adaptive reinforcement learning approach to discover optimal individualized treatment regimens from a specially designed clinical trial of an experimental treatment for patients with advanced NSCLC who have not been treated previously with systemic therapy is presented.
Abstract: Typical regimens for advanced metastatic stage IIIB/IV non-small cell lung cancer (NSCLC) consist of multiple lines of treatment. We present an adaptive reinforcement learning approach to discover optimal individualized treatment regimens from a specially designed clinical trial (a “clinical reinforcement trial”) of an experimental treatment for patients with advanced NSCLC who have not been treated previously with systemic therapy. In addition to the complexity of the problem of selecting optimal compounds for first and second-line treatments based on prognostic factors, another primary goal is to determine the optimal time to initiate second-line therapy, either immediately or delayed after induction therapy, yielding the longest overall survival time. A reinforcement learning method called Q-learning is utilized which involves learning an optimal regimen from patient data generated from the clinical reinforcement trial. Approximating the Q-function with time-indexed parameters can be achieved by using a modification of support vector regression which can utilize censored data. Within this framework, a simulation study shows that the procedure can extract optimal regimens for two lines of treatment directly from clinical data without prior knowledge of the treatment effect mechanism. In addition, we demonstrate that the design reliably selects the best initial time for second-line therapy while taking into account the heterogeneity of NSCLC across patients.

168 citations


Journal ArticleDOI
TL;DR: A new two-dimensional dose-finding method for multiple-agent trials that simplifies to the continual reassessment method (CRM), introduced by O'Quigley, Pepe, and Fisher (1990, Biometrics 46, 33-48), when the ordering is fully known, enables the assumption of a monotonic dose-toxicity curve to be relaxed.
Abstract: Summary. Much of the statistical methodology underlying the experimental design of phase 1 trials in oncology is intended for studies involving a single cytotoxic agent. The goal of these studies is to estimate the maximally tolerated dose, the highest dose that can be administered with an acceptable level of toxicity. A fundamental assumption of these methods is monotonicity of the dose–toxicity curve. This is a reasonable assumption for single-agent trials in which the administration of greater doses of the agent can be expected to produce dose-limiting toxicities in increasing proportions of patients. When studying multiple agents, the assumption may not hold because the ordering of the toxicity probabilities could possibly be unknown for several of the available drug combinations. At the same time, some of the orderings are known and so we describe the whole situation as that of a partial ordering. In this article, we propose a new two-dimensional dose-finding method for multiple-agent trials that simplifies to the continual reassessment method (CRM), introduced by O’Quigley, Pepe, and Fisher (1990, Biometrics 46, 33–48), when the ordering is fully known. This design enables us to relax the assumption of a monotonic dose–toxicity curve. We compare our approach and some simulation results to a CRM design in which the ordering is known as well as to other suggestions for partial orders.

161 citations


Journal ArticleDOI
TL;DR: This work considers selecting both fixed and random effects in a general class of mixed effects models using maximum penalized likelihood (MPL) estimation along with the smoothly clipped absolute deviation (SCAD) and adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions.
Abstract: Summary We consider selecting both fixed and random effects in a general class of mixed effects models using maximum penalized likelihood (MPL) estimation along with the smoothly clipped absolute deviation (SCAD) and adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions. The MPL estimates are shown to possess consistency and sparsity properties and asymptotic normality. A model selection criterion, called the ICQ statistic, is proposed for selecting the penalty parameters (Ibrahim, Zhu, and Tang, 2008, Journal of the American Statistical Association 103, 1648–1658). The variable selection procedure based on ICQ is shown to consistently select important fixed and random effects. The methodology is very general and can be applied to numerous situations involving random effects, including generalized linear mixed models. Simulation studies and a real data set from a Yale infant growth study are used to illustrate the proposed methodology.

151 citations


Journal ArticleDOI
TL;DR: A method applicable to multiple stages of mediation and mixed variable types using generalized linear models is presented and low sensitivity to the counterfactual correlation in most scenarios is found.
Abstract: The goal of mediation analysis is to assess direct and indirect effects of a treatment or exposure on an outcome. More generally, we may be interested in the context of a causal model as characterized by a directed acyclic graph (DAG), where mediation via a specific path from exposure to outcome may involve an arbitrary number of links (or "stages"). Methods for estimating mediation (or pathway) effects are available for a continuous outcome and a continuous mediator related via a linear model, while for a categorical outcome or categorical mediator, methods are usually limited to two-stage mediation. We present a method applicable to multiple stages of mediation and mixed variable types using generalized linear models. We define pathway effects using a potential outcomes framework and present a general formula that provides the effect of exposure through any specified pathway. Some pathway effects are nonidentifiable and their estimation requires an assumption regarding the correlation between counterfactuals. We provide a sensitivity analysis to assess the impact of this assumption. Confidence intervals for pathway effect estimates are obtained via a bootstrap method. The method is applied to a cohort study of dental caries in very low birth weight adolescents. A simulation study demonstrates low bias of pathway effect estimators and close-to-nominal coverage rates of confidence intervals. We also find low sensitivity to the counterfactual correlation in most scenarios.

Journal ArticleDOI
TL;DR: In order to estimate the functional relationship between the copula parameter and the covariate, this work proposes a nonparametric approach based on local likelihood and derives the asymptotic bias and variance of the resulting local polynomial estimator, and outlines how to construct pointwise confidence intervals.
Abstract: The study of dependence between random variables is a mainstay in statistics. In many cases, the strength of dependence between two or more random variables varies according to the values of a measured covariate. We propose inference for this type of variation using a conditional copula model where the copula function belongs to a parametric copula family and the copula parameter varies with the covariate. In order to estimate the functional relationship between the copula parameter and the covariate, we propose a nonparametric approach based on local likelihood. Of importance is also the choice of the copula family that best represents a given set of data. The proposed framework naturally leads to a novel copula selection method based on cross-validated prediction errors. We derive the asymptotic bias and variance of the resulting local polynomial estimator, and outline how to construct pointwise confidence intervals. The finite-sample performance of our method is investigated using simulation studies and is illustrated using a subset of the Matched Multiple Birth data.

Journal ArticleDOI
TL;DR: It is shown that, although the predicted values can vary with the assumed distribution, the prediction accuracy is little affected for mild-to-moderate violations of the assumptions, and standard approaches, readily available in statistical software, will often suffice.
Abstract: Statistical models that include random effects are commonly used to analyze longitudinal and correlated data, often with the assumption that the random effects follow a Gaussian distribution. Via theoretical and numerical calculations and simulation, we investigate the impact of misspecification of this distribution on both how well the predicted values recover the true underlying distribution and the accuracy of prediction of the realized values of the random effects. We show that, although the predicted values can vary with the assumed distribution, the prediction accuracy, as measured by mean square error, is little affected for mild-to-moderate violations of the assumptions. Thus, standard approaches, readily available in statistical software, will often suffice. The results are illustrated using data from the Heart and Estrogen/Progestin Replacement Study using models to predict future blood pressure values.

Journal ArticleDOI
TL;DR: A modification of generalized estimating equations (GEEs) methodology is proposed for hypothesis testing of high-dimensional data, with particular interest in multivariate abundance data in ecology, and it is shown via theory and simulation that this substantially improves the power of Wald statistics when cluster size is not small.
Abstract: Summary A modification of generalized estimating equations (GEEs) methodology is proposed for hypothesis testing of high-dimensional data, with particular interest in multivariate abundance data in ecology, an important application of interest in thousands of environmental science studies. Such data are typically counts characterized by high dimensionality (in the sense that cluster size exceeds number of clusters, n>K) and over-dispersion relative to the Poisson distribution. Usual GEE methods cannot be applied in this setting primarily because sandwich estimators become numerically unstable as n increases. We propose instead using a regularized sandwich estimator that assumes a common correlation matrix R, and shrinks the sample estimate of R toward the working correlation matrix to improve its numerical stability. It is shown via theory and simulation that this substantially improves the power of Wald statistics when cluster size is not small. We apply the proposed approach to study the effects of nutrient addition on nematode communities, and in doing so discuss important issues in implementation, such as using statistics that have good properties when parameter estimates approach the boundary (), and using resampling to enable valid inference that is robust to high dimensionality and to possible model misspecification.

Journal ArticleDOI
TL;DR: A stratified competing risks regression is considered to allow the baseline hazard to vary across levels of the stratification covariate, and data from a breast cancer clinical trial and a bone marrow transplantation registry illustrate the potential utility of this stratified Fine-Gray model.
Abstract: For competing risks data, the Fine–Gray proportional hazards model for subdistribution has gained popularity for its convenience in directly assessing the effect of covariates on the cumulative incidence function. However, in many important applications, proportional hazards may not be satisfied, including multicenter clinical trials, where the baseline subdistribution hazards may not be common due to varying patient populations. In this article, we consider a stratified competing risks regression, to allow the baseline hazard to vary across levels of the stratification covariate. According to the relative size of the number of strata and strata sizes, two stratification regimes are considered. Using partial likelihood and weighting techniques, we obtain consistent estimators of regression parameters. The corresponding asymptotic properties and resulting inferences are provided for the two regimes separately. Data from a breast cancer clinical trial and from a bone marrow transplantation registry illustrate the potential utility of the stratified Fine–Gray model.

Journal ArticleDOI
TL;DR: Probabilistic Inference for ChIP-seq (PICS) as discussed by the authors identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model.
Abstract: ChIP-seq, which combines chromatin immunoprecipitation with massively parallel short-read sequencing, can profile in vivo genome-wide transcription factor-DNA association with higher sensitivity, specificity and spatial resolution than ChIP-chip. While it presents new opportunities for research, ChIP-seq poses new challenges for statistical analysis that derive from the complexity of the biological systems characterized and the variability and biases in its digital sequence data. We propose a method called PICS (Probabilistic Inference for ChIP-seq) for extracting information from ChIP-seq aligned-read data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. Its per-event fragment length estimates also allow it to remove from analysis regions that have atypical lengths. PICS uses pre-calculated, whole-genome read mappability profiles and a truncated tdistribution to adjust binding event models for reads that are missing due to local genome repetitiveness. It estimates uncertainties in model parameters that can be used to define confidence regions on binding event locations and to filter estimates. Finally, PICS calculates a per-event enrichment score relative to a control sample, and can use a control sample to estimate a false discovery rate. We compared PICS to the alternative methods MACS, QuEST, and CisGenome, using published GABP and FOXA1 data sets from human cell lines, and found that PICS’ predicted binding sites were more consistent with computationally predicted binding motifs.

Journal ArticleDOI
TL;DR: A Bayesian framework for censored linear (and nonlinear) models replacing the Gaussian assumptions for the random terms with normal/independent (NI) distributions is developed and can be used to develop Bayesian case-deletion influence diagnostics based on the Kullback-Leibler divergence.
Abstract: Summary HIV RNA viral load measures are often subjected to some upper and lower detection limits depending on the quantification assays. Hence, the responses are either left or right censored. Linear (and nonlinear) mixed-effects models (with modifications to accommodate censoring) are routinely used to analyze this type of data and are based on normality assumptions for the random terms. However, those analyses might not provide robust inference when the normality assumptions are questionable. In this article, we develop a Bayesian framework for censored linear (and nonlinear) models replacing the Gaussian assumptions for the random terms with normal/independent (NI) distributions. The NI is an attractive class of symmetric heavy-tailed densities that includes the normal, Student's-t, slash, and the contaminated normal distributions as special cases. The marginal likelihood is tractable (using approximations for nonlinear models) and can be used to develop Bayesian case-deletion influence diagnostics based on the Kullback–Leibler divergence. The newly developed procedures are illustrated with two HIV AIDS studies on viral loads that were initially analyzed using normal (censored) mixed-effects models, as well as simulations.

Journal ArticleDOI
TL;DR: This article decomposes longitudinal outcomes as a sum of several terms: a population mean function, covariates with time-varying coefficients, functional subject-specific random effects, and residual measurement error processes, and proposes penalized spline-based methods for functional mixed effects models with varying coefficients.
Abstract: In this article, we propose penalized spline (P-spline)-based methods for functional mixed effects models with varying coefficients. We decompose longitudinal outcomes as a sum of several terms: a population mean function, covariates with time-varying coefficients, functional subject-specific random effects, and residual measurement error processes. Using P-splines, we propose nonparametric estimation of the population mean function, varying coefficient, random subject-specific curves, and the associated covariance function that represents between-subject variation and the variance function of the residual measurement errors which represents within-subject variation. Proposed methods offer flexible estimation of both the population- and subject-level curves. In addition, decomposing variability of the outcomes as a between- and within-subject source is useful in identifying the dominant variance component therefore optimally model a covariance function. We use a likelihood-based method to select multiple smoothing parameters. Furthermore, we study the asymptotics of the baseline P-spline estimator with longitudinal data. We conduct simulation studies to investigate performance of the proposed methods. The benefit of the between- and within-subject covariance decomposition is illustrated through an analysis of Berkeley growth data, where we identified clearly distinct patterns of the between- and within-subject covariance functions of children's heights. We also apply the proposed methods to estimate the effect of antihypertensive treatment from the Framingham Heart Study data.

Journal ArticleDOI
TL;DR: It is showed that the models with skew-normality assumption may provide more reasonable results if the data exhibit skewness and the results may be important for HIV/AIDS studies in providing quantitative guidance to better understand the virologic responses to antiretroviral treatment.
Abstract: Summary In recent years, nonlinear mixed-effects (NLME) models have been proposed for modeling complex longitudinal data. Covariates are usually introduced in the models to partially explain intersubject variations. However, one often assumes that both model random error and random effects are normally distributed, which may not always give reliable results if the data exhibit skewness. Moreover, some covariates such as CD4 cell count may be often measured with substantial errors. In this article, we address these issues simultaneously by jointly modeling the response and covariate processes using a Bayesian approach to NLME models with covariate measurement errors and a skew-normal distribution. A real data example is offered to illustrate the methodologies by comparing various potential models with different distribution specifications. It is showed that the models with skew-normality assumption may provide more reasonable results if the data exhibit skewness and the results may be important for HIV/AIDS studies in providing quantitative guidance to better understand the virologic responses to antiretroviral treatment.

Journal ArticleDOI
TL;DR: A new Bayesian approach of sample size determination (SSD) for the design of noninferiority clinical trials with a focus on controlling the type I error and power is developed.
Abstract: Summary We develop a new Bayesian approach of sample size determination (SSD) for the design of noninferiority clinical trials. We extend the fitting and sampling priors of Wang and Gelfand (2002, Statistical Science 17, 193–208) to Bayesian SSD with a focus on controlling the type I error and power. Historical data are incorporated via a hierarchical modeling approach as well as the power prior approach of Ibrahim and Chen (2000, Statistical Science 15, 46–60). Various properties of the proposed Bayesian SSD methodology are examined and a simulation-based computational algorithm is developed. The proposed methodology is applied to the design of a noninferiority medical device clinical trial with historical data from previous trials.

Journal ArticleDOI
TL;DR: A novel summary measure (the standardized total gain) is proposed that can be used to compare markers and to assess the incremental value of a new marker and a semiparametric estimated‐likelihood method is developed to estimate the joint surrogate value of multiple biomarkers.
Abstract: Recently a new definition of surrogate endpoint, the ‘principal surrogate’, was proposed based on causal associations between treatment effects on the biomarker and on the clinical endpoint. Despite its appealing interpretation, limited research has been conducted to evaluate principal surrogates, and existing methods focus on risk models that consider a single biomarker. How to compare principal surrogate value of biomarkers or general risk models that consider multiple biomarkers remains an open research question. We propose to characterize a marker or risk model’s principal surrogate value based on the distribution of risk difference between interventions. In addition, we propose a novel summary measure (the standardized total gain) that can be used to compare markers and to assess the incremental value of a new marker. We develop a semiparametric estimated-likelihood method to estimate the joint surrogate value of multiple biomarkers. This method accommodates two-phase sampling of biomarkers and ismore widely applicable than existing nonparametric methods by incorporating continuous baseline covariates to predict the biomarker(s), and is more robust than existing parametric methods by leaving the error distribution of markers unspecified. The methodology is illustrated using a simulated example set and a real data set in the context of HIV vaccine trials.

Journal ArticleDOI
TL;DR: An efficient score test is proposed to assess the overall effect of a set of markers, such as genes within a pathway or a network, on survival outcomes and has the advantage of capturing the potentially nonlinear effects without explicitly specifying a particular nonlinear functional form.
Abstract: There is growing evidence that genomic and proteomic research holds great potential for changing irrevocably the practice of medicine. The ability to identify important genomic and biological markers for risk assessment can have a great impact in public health from disease prevention, to detection, to treatment selection. However, the potentially large number of markers and the complexity in the relationship between the markers and the outcome of interest impose a grand challenge in developing accurate risk prediction models. The standard approach to identifying important markers often assesses the marginal effects of individual markers on a phenotype of interest. When multiple markers relate to the phenotype simultaneously via a complex structure, such a type of marginal analysis may not be effective. To overcome such difficulties, we employ a kernel machine Cox regression framework and propose an efficient score test to assess the overall effect of a set of markers, such as genes within a pathway or a network, on survival outcomes. The proposed test has the advantage of capturing the potentially nonlinear effects without explicitly specifying a particular nonlinear functional form. To approximate the null distribution of the score statistic, we propose a simple resampling procedure that can be easily implemented in practice. Numerical studies suggest that the test performs well with respect to both empirical size and power even when the number of variables in a gene set is not small compared to the sample size.

Journal ArticleDOI
TL;DR: It is shown that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges, and a consistent estimator of the dimension of the underlying conditionalfactor model is proposed.
Abstract: High-dimensional data, such as those obtained from a gene expression microarray or second generation sequencing experiment, consist of a large number of dependent features measured on a small number of samples. One of the key problems in genomics is the identification and estimation of factors that associate with many features simultaneously. Identifying the number of factors is also important for unsupervised statistical analyses such as hierarchical clustering. A conditional factor model is the most common model for many types of genomic data, ranging from gene expression, to single nucleotide polymorphisms, to methylation. Here we show that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges. We also propose a consistent estimator of the dimension of the underlying conditional factor model for a finite fixed sample size and an infinite number of features based on a scaled eigen-decomposition. We propose a practical approach for selection of the number of factors in real data sets, and we illustrate the utility of these results for capturing batch and other unmodeled effects in a microarray experiment using the dependence kernel approach of Leek and Storey (2008, Proceedings of the National Academy of Sciences of the United States of America 105, 18718-18723).

Journal ArticleDOI
TL;DR: This work provides a new class of frailty-based competing risks models for clustered failure times data based on expanding the competing risks model of Prentice et al. (1978) to incorporate frailty variates, with the use of cause-specific proportional hazards frailty models for all the causes.
Abstract: In this work, we provide a new class of frailty-based competing risks models for clustered failure times data. This class is based on expanding the competing risks model of Prentice et al. (1978, Biometrics 34, 541-554) to incorporate frailty variates, with the use of cause-specific proportional hazards frailty models for all the causes. Parametric and nonparametric maximum likelihood estimators are proposed. The main advantages of the proposed class of models, in contrast to the existing models, are: (1) the inclusion of covariates; (2) the flexible structure of the dependency among the various types of failure times within a cluster; and (3) the unspecified within-subject dependency structure. The proposed estimation procedures produce the most efficient parametric and semiparametric estimators and are easy to implement. Simulation studies show that the proposed methods perform very well in practical situations.

Journal ArticleDOI
TL;DR: This work proposed to use the SAEM (stochastic approximation expectation‐maximization) algorithm, a powerful maximum likelihood estimation algorithm, to analyze simultaneously the HIV viral load decrease and the CD4 increase in patients using a long‐term HIV dynamic system.
Abstract: HIV dynamics studies, based on differential equations, have significantly improved the knowledge on HIV infection. While first studies used simplified short-term dynamic models, recent works considered more complex long-term models combined with a global analysis of whole patient data based on nonlinear mixed models, increasing the accuracy of the HIV dynamic analysis. However statistical issues remain, given the complexity of the problem. We proposed to use the SAEM (stochastic approximation expectation-maximization) algorithm, a powerful maximum likelihood estimation algorithm, to analyze simultaneously the HIV viral load decrease and the CD4 increase in patients using a long-term HIV dynamic system. We applied the proposed methodology to the prospective COPHAR2-ANRS 111 trial. Very satisfactory results were obtained with a model with latent CD4 cells defined with five differential equations. One parameter was fixed, the 10 remaining parameters (eight with between-patient variability) of this model were well estimated. We showed that the efficacy of nelfinavir was reduced compared to indinavir and lopinavir.

Journal ArticleDOI
TL;DR: Methods to estimate group- (e.g., treatment-) specific differences in restricted mean lifetime for studies where treatment is not randomized and lifetimes are subject to both dependent and independent censoring are proposed.
Abstract: In epidemiologic studies of time to an event, mean lifetime is often of direct interest. We propose methods to estimate group- (e.g., treatment-) specific differences in restricted mean lifetime for studies where treatment is not randomized and lifetimes are subject to both dependent and independent censoring. The proposed methods may be viewed as a hybrid of two general approaches to accounting for confounders. Specifically, treatment-specific proportional hazards models are employed to account for baseline covariates, while inverse probability of censoring weighting is used to accommodate time-dependent predictors of censoring. The average causal effect is then obtained by averaging over differences in fitted values based on the proportional hazards models. Large-sample properties of the proposed estimators are derived and simulation studies are conducted to assess their finite-sample applicability. We apply the proposed methods to liver wait list mortality data from the Scientific Registry of Transplant Recipients.

Journal ArticleDOI
TL;DR: An algorithm for a bias‐corrected point estimate of the relative risk using an RRC approach is presented, followed by the derivation of an estimate of its variance, resulting in a sandwich estimator.
Abstract: Occupational, environmental, and nutritional epidemiologists are often interested in estimating the prospective effect of time-varying exposure variables such as cumulative exposure or cumulative updated average exposure, in relation to chronic disease endpoints such as cancer incidence and mortality. From exposure validation studies, it is apparent that many of the variables of interest are measured with moderate to substantial error. Although the ordinary regression calibration approach is approximately valid and efficient for measurement error correction of relative risk estimates from the Cox model with time-independent point exposures when the disease is rare, it is not adaptable for use with time-varying exposures. By re-calibrating the measurement error model within each risk set, a risk set regression calibration method is proposed for this setting. An algorithm for a bias-corrected point estimate of the relative risk using an RRC approach is presented, followed by the derivation of an estimate of its variance, resulting in a sandwich estimator. Emphasis is on methods applicable to the main study/external validation study design, which arises in important applications. Simulation studies under several assumptions about the error model were carried out, which demonstrated the validity and efficiency of the method in finite samples. The method was applied to a study of diet and cancer from Harvard’s Health Professionals Follow-up Study (HPFS).

Journal ArticleDOI
TL;DR: A Bayesian two-part latent class model is proposed to characterize the effect of parity on mental health use and expenditures and confirmed that parity had an impact only on the moderate spender class.
Abstract: In 2001, the U.S. Office of Personnel Management required all health plans participating in the Federal Employees Health Benefits Program to offer mental health and substance abuse benefits on par with general medical benefits. The initial evaluation found that, on average, parity did not result in either large spending increases or increased service use over the four-year observational period. However, some groups of enrollees may have benefited from parity more than others. To address this question, we propose a Bayesian two-part latent class model to characterize the effect of parity on mental health use and expenditures. Within each class, we fit a two-part random effects model to separately model the probability of mental health or substance abuse use and mean spending trajectories among those having used services. The regression coefficients and random effect covariances vary across classes, thus permitting class-varying correlation structures between the two components of the model. Our analysis identified three classes of subjects: a group of low spenders that tended to be male, had relatively rare use of services, and decreased their spending pattern over time; a group of moderate spenders, primarily female, that had an increase in both use and mean spending after the introduction of parity; and a group of high spenders that tended to have chronic service use and constant spending patterns. By examining the joint 95% highest probability density regions of expected changes in use and spending for each class, we confirmed that parity had an impact only on the moderate spender class.

Journal ArticleDOI
TL;DR: Three measures are proposed, including the ratio of cumulative hazards, relative risk, and difference in restricted mean lifetime, and a double inverse-weighted estimator, constructed by using inverse probability of treatment weighting to balance the treatment-specific covariate distributions.
Abstract: In medical studies of time-to-event data, nonproportional hazards and dependent censoring are very common issues when estimating the treatment effect. A traditional method for dealing with time-dependent treatment effects is to model the time-dependence parametrically. Limitations of this approach include the difficulty to verify the correctness of the specified functional form and the fact that, in the presence of a treatment effect that varies over time, investigators are usually interested in the cumulative as opposed to instantaneous treatment effect. In many applications, censoring time is not independent of event time. Therefore, we propose methods for estimating the cumulative treatment effect in the presence of nonproportional hazards and dependent censoring. Three measures are proposed, including the ratio of cumulative hazards, relative risk, and difference in restricted mean lifetime. For each measure, we propose a double inverse-weighted estimator, constructed by first using inverse probability of treatment weighting (IPTW) to balance the treatment-specific covariate distributions, then using inverse probability of censoring weighting (IPCW) to overcome the dependent censoring. The proposed estimators are shown to be consistent and asymptotically normal. We study their finite-sample properties through simulation. The proposed methods are used to compare kidney wait-list mortality by race.

Journal ArticleDOI
TL;DR: A flexible class of time series models are proposed to estimate the relative risk of mortality associated with heat waves and Bayesian model averaging is conducted to account for the multiplicity of potential models.
Abstract: Estimating the risks heat waves pose to human health is a critical part of assessing the future impact of climate change. In this article, we propose a flexible class of time series models to estimate the relative risk of mortality associated with heat waves and conduct Bayesian model averaging (BMA) to account for the multiplicity of potential models. Applying these methods to data from 105 U.S. cities for the period 1987-2005, we identify those cities having a high posterior probability of increased mortality risk during heat waves, examine the heterogeneity of the posterior distributions of mortality risk across cities, assess sensitivity of the results to the selection of prior distributions, and compare our BMA results to a model selection approach. Our results show that no single model best predicts risk across the majority of cities, and that for some cities heat-wave risk estimation is sensitive to model choice. Although model averaging leads to posterior distributions with increased variance as compared to statistical inference conditional on a model obtained through model selection, we find that the posterior mean of heat wave mortality risk is robust to accounting for model uncertainty over a broad class of models.