scispace - formally typeset
Search or ask a question

Showing papers in "Biostatistics in 2008"


Journal ArticleDOI
TL;DR: Using a coordinate descent procedure for the lasso, a simple algorithm is developed that solves a 1000-node problem in at most a minute and is 30-4000 times faster than competing methods.
Abstract: We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm--the graphical lasso--that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Buhlmann (2006). We illustrate the method on some cell-signaling data from proteomics.

5,577 citations


Journal ArticleDOI
TL;DR: This work proposes using a variant of logistic regression with (L)_(2)-regularization to fit gene-gene and gene-environment interaction models and demonstrates that this method outperforms other methods in the identification of the interaction structures as well as prediction accuracy.
Abstract: We propose using a variant of logistic regression (LR) with (L)_(2)-regularization to fit gene-gene and gene-environment interaction models. Studies have shown that many common diseases are influenced by interaction of certain genes. LR models with quadratic penalization not only correctly characterizes the influential genes along with their interaction structures but also yields additional benefits in handling high-dimensional, discrete factors with a binary response. We illustrate the advantages of using an (L)_(2)-regularization scheme and compare its performance with that of "multifactor dimensionality reduction" and "FlexTree," 2 recent tools for identifying gene-gene interactions. Through simulated and real data sets, we demonstrate that our method outperforms other methods in the identification of the interaction structures as well as prediction accuracy. In addition, we validate the significance of the factors selected through bootstrap analyses.

394 citations


Journal ArticleDOI
TL;DR: DIC is shown to be an approximation to a penalized loss function based on the deviance, with a penalty derived from a cross-validation argument, which under-penalizes more complex models.
Abstract: The deviance information criterion (DIC) is widely used for Bayesian model comparison, despite the lack of a clear theoretical foundation. DIC is shown to be an approximation to a penalized loss function based on the deviance, with a penalty derived from a cross-validation argument. This approximation is valid only when the effective number of parameters in the model is much smaller than the number of independent observations. In disease mapping, a typical application of DIC, this assumption does not hold and DIC under-penalizes more complex models. Another deviance-based loss function, derived from the same decision-theoretic framework, is applied to mixture models, which have previously been considered an unsuitable application for DIC.

388 citations


Journal ArticleDOI
TL;DR: The fused lasso criterion leads to a convex optimization problem, and a fast algorithm is provided for its solution, which generally outperforms competing methods for calling gains and losses in CGH data.
Abstract: SUMMARY We apply the “fused lasso” regression method of Tibshirani and others (2004) to the problem of “hotspot detection”, in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the new method generally outperforms competing methods for calling gains and losses in CGH data.

377 citations


Journal ArticleDOI
TL;DR: In this paper, the authors recommend formats that improve clarity in text and tables and emphasize that an estimate and its associated uncertainty should be "connected at the hip" as a single unit, rather than separately.
Abstract: When reporting estimates and associated standard errors (ses) or confidence intervals (CIs), the standard formats, “estimate (se)” and “estimate (95% CI: [lower, upper]),” can be confusing in text; in tables they hinder comparisons. Furthermore, some readers can misinterpret the CI format as indicating equal support for all reported values. To remedy these deficits, we recommend formats that (1) improve clarity in text and tables and (2) emphasize that an estimate and its associated uncertainty should be “connected at the hip” as a single unit.

189 citations


Journal ArticleDOI
TL;DR: 3 bias-reduced estimators are proposed and evaluated and corresponding weighted estimators that combine corrected and uncorrected estimators, to reduce selection bias are also proposed and their corresponding CIs are proposed.
Abstract: Genome-wide association studies (GWAS) provide an important approach to identifying common genetic variants that predispose to human disease. A typical GWAS may genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) located throughout the human genome in a set of cases and controls. Logistic regression is often used to test for association between a SNP genotype and case versus control status, with corresponding odds ratios (ORs) typically reported only for those SNPs meeting selection criteria. However, when these estimates are based on the original data used to detect the variant, the results are affected by a selection bias sometimes referred to the “winner's curse” (Capen and others, 1971). The actual genetic association is typically overestimated. We show that such selection bias may be severe in the sense that the conditional expectation of the standard OR estimator may be quite far away from the underlying parameter. Also standard confidence intervals (CIs) may have far from the desired coverage rate for the selected ORs. We propose and evaluate 3 bias-reduced estimators, and also corresponding weighted estimators that combine corrected and uncorrected estimators, to reduce selection bias. Their corresponding CIs are also proposed. We study the performance of these estimators using simulated data sets and show that they reduce the bias and give CI coverage close to the desired level under various scenarios, even for associations having only small statistical power.

166 citations


Journal ArticleDOI
TL;DR: 1 and 2 degree-of-freedom tests for association which do not assume Hardy–Weinberg equilibrium and which treat males as homozygous females are proposed, which remains valid when phenotype varies between sexes, provided the allele frequency does not and avoids the loss of power resulting from stratification by sex in such circumstances.
Abstract: The problem of testing for genotype-phenotype association with loci on the X chromosome in mixed-sex samples has received surprisingly little attention. A simple test can be constructed by counting alleles, with males contributing a single allele and females 2. This approach does assume not only Hardy-Weinberg equilibrium in the population from which the study subjects are sampled but also, perhaps, an unrealistic alternative hypothesis. This paper proposes 1 and 2 degree-of-freedom tests for association which do not assume Hardy-Weinberg equilibrium and which treat males as homozygous females. The proposed method remains valid when phenotype varies between sexes, provided the allele frequency does not, and avoids the loss of power resulting from stratification by sex in such circumstances.

147 citations


Journal ArticleDOI
TL;DR: This paper considers synthesis of 2 correlated endpoints and proposes an alternative model for bivariate random-effects meta-analysis (BRMA), which maintains the individual weighting of each study in the analysis but includes only one overall correlation parameter, rho, which removes the need to know the within-study correlations.
Abstract: SUMMARY Multivariate meta-analysis models can be used to synthesize multiple, correlated endpoints such as overall and disease-free survival. A hierarchical framework for multivariate random-effects meta-analysis includes both within-study and between-study correlation. The within-study correlations are assumed known, but they are usually unavailable, which limits the multivariate approach in practice. In this paper, we consider synthesis of 2 correlated endpoints and propose an alternative model for bivariate randomeffects meta-analysis (BRMA). This model maintains the individual weighting of each study in the analysis but includes only one overall correlation parameter, ρ, which removes the need to know the within-study correlations. Further, the only data needed to fit the model are those required for a separate univariate random-effects meta-analysis (URMA) of each endpoint, currently the common approach in practice. This makes the alternative model immediately applicable to a wide variety of evidence synthesis situations, including studies of prognosis and surrogate outcomes. We examine the performance of the alternative model through analytic assessment, a realistic simulation study, and application to data sets from the literature. Our results show that, unless ˆ ρ is very close to 1 or –1, the alternative model produces appropriate pooled estimates with little bias that (i) are very similar to those from a fully hierarchical BRMA model where the within-study correlations are known and (ii) have better statistical properties than those from separate URMAs, especially given missing data. The alternative model is also less prone to estimation at parameter space boundaries than the fully hierarchical model and thus may be preferred even when the within-study correlations are known. It also suitably estimates a function of the pooled estimates and their correlation; however, it only provides an approximate indication of the between-study variation. The alternative model greatly facilitates the utilization of correlation in meta-analysis and should allow an increased application of BRMA in practice.

143 citations


Journal ArticleDOI
TL;DR: This paper shows how logic regression can be employed to identify SNP interactions explanatory for the disease status in a case-control study and proposes 2 measures for quantifying the importance of these interactions for classification.
Abstract: Interactions of single nucleotide polymorphisms (SNPs) are assumed to be responsible for complex diseases such as sporadic breast cancer. Important goals of studies concerned with such genetic data are thus to identify combinations of SNPs that lead to a higher risk of developing a disease and to measure the importance of these interactions. There are many approaches based on classification methods such as CART and random forests that allow measuring the importance of single variables. But none of these methods enable the importance of combinations of variables to be quantified directly. In this paper, we show how logic regression can be employed to identify SNP interactions explanatory for the disease status in a case-control study and propose 2 measures for quantifying the importance of these interactions for classification. These approaches are then applied on the one hand to simulated data sets and on the other hand to the SNP data of the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer.

140 citations


Journal ArticleDOI
TL;DR: The method can help to determine whether selection bias is present and thus confirm the validity of study conclusions when no evidence of selection bias can be found, and is demonstrated using simulations that the estimates of the odds ratios produced by the method are consistently closer to the true odds ratio than standard odds ratio estimates using logistic regression.
Abstract: Retrospective case-control studies are more susceptible to selection bias than other epidemiologic studies as by design they require that both cases and controls are representative of the same population. However, as cases and control recruitment processes are often different, it is not always obvious that the necessary exchangeability conditions hold. Selection bias typically arises when the selection criteria are associated with the risk factor under investigation. We develop a method which produces bias-adjusted estimates for the odds ratio. Our method hinges on 2 conditions. The first is that a variable that separates the risk factor from the selection criteria can be identified. This is termed the "bias breaking" variable. The second condition is that data can be found such that a bias-corrected estimate of the distribution of the bias breaking variable can be obtained. We show by means of a set of examples that such bias breaking variables are not uncommon in epidemiologic settings. We demonstrate using simulations that the estimates of the odds ratios produced by our method are consistently closer to the true odds ratio than standard odds ratio estimates using logistic regression. Further, by applying it to a case-control study, we show that our method can help to determine whether selection bias is present and thus confirm the validity of study conclusions when no evidence of selection bias can be found.

136 citations


Journal ArticleDOI
TL;DR: A new method is developed for analyzing case series data in situations where occurrence of the event censors, curtails, or otherwise affects post-event exposures.
Abstract: A new method is developed for analyzing case series data in situations where occurrence of the event censors, curtails, or otherwise affects post-event exposures. Unbiased estimating equations derived from the self-controlled case series model are adapted to allow for exposures whose occurrence or observation is influenced by the event. The method applies to transient point exposures and rare nonrecurrent events. Asymptotic efficiency is studied in some special cases. A computational scheme based on a pseudo-likelihood is proposed to make the computations feasible in complex models. Simulations, a validation study, and 2 applications are described.

Journal ArticleDOI
TL;DR: A method based on estimating the probabilities via some procedure, for example, multinomial logistic regression or Bootstrap inferences, is developed to account for variability in estimates of probabilities and perform well in simulations.
Abstract: The accuracy of a single diagnostic test for binary outcome can be summarized by the area under the receiver operating characteristic (ROC) curve. Volume under the surface and hypervolume under the manifold have been proposed as extensions for multiple class diagnosis (Scurfield, 1996, 1998). However, the lack of simple inferential procedures for such measures has limited their practical utility. Part of the difficulty is that calculating such quantities may not be straightforward, even with a single test. The decision rule used to generate the ROC surface requires class probability assessments, which are not provided by the tests. We develop a method based on estimating the probabilities via some procedure, for example, multinomial logistic regression. Bootstrap inferences are proposed to account for variability in estimating the probabilities and perform well in simulations. The ROC measures are compared to the correct classification rate, which depends heavily on class prevalences. An example of tumor classification with microarray data demonstrates that this property may lead to substantially different analyses. The ROC-based analysis yields notable decreases in model complexity over previous analyses.

Journal ArticleDOI
TL;DR: The intimate relationship of discrete covariates and multistate models are used to naturally treat time-dependent covariates within the subdistribution hazards framework and the proposed methodology provides a useful synthesis of separate cause-specific hazards analyses.
Abstract: Separate Cox analyses of all cause-specific hazards are the standard technique of choice to study the effect of a covariate in competing risks, but a synopsis of these results in terms of cumulative event probabilities is challenging. This difficulty has led to the development of the proportional subdistribution hazards model. If the covariate is known at baseline, the model allows for a summarizing assessment in terms of the cumulative incidence function. black Mathematically, the model also allows for including random time-dependent covariates, but practical implementation has remained unclear due to a certain risk set peculiarity. We use the intimate relationship of discrete covariates and multistate models to naturally treat time-dependent covariates within the subdistribution hazards framework. The methodology then straightforwardly translates to real-valued time-dependent covariates. As with classical survival analysis, including time-dependent covariates does not result in a model for probability functions anymore. Nevertheless, the proposed methodology provides a useful synthesis of separate cause-specific hazards analyses. We illustrate this with hospital infection data, where time-dependent covariates and competing risks are essential to the subject research question.

Journal ArticleDOI
TL;DR: A new class of marginal structural models for so-called partial exposure regimes is introduced to describe the effect on the hazard of death of acquiring infection on a given day in ICU, versus not acquiring infection "up to that day," had patients stayed in the ICU for at least s days.
Abstract: Intensive care unit (ICU) patients are highly susceptible to hospital-acquired infections due to their poor health and many invasive therapeutic treatments. The effect on mortality of acquiring such infections is, however, poorly understood. Our goal is to quantify this using data from the National Surveillance Study of Nosocomial Infections in ICUs (Belgium). This is challenging because of the presence of time-dependent confounders, such as mechanical ventilation, which lie on the causal path from infection to mortality. Standard statistical analyses may be severely misleading in such settings and have shown contradictory results. Inverse probability weighting for marginal structural models may instead be used but is not directly applicable because these models parameterize the effect of acquiring infection on a given day in ICU, versus "never" acquiring infection in ICU, and this is ill-defined when ICU discharge precedes that day. Additional complications arise from the informative censoring of the survival time by hospital discharge and the instability of the inverse weighting estimation procedure. We accommodate this by introducing a new class of marginal structural models for so-called partial exposure regimes. These describe the effect on the hazard of death of acquiring infection on a given day s, versus not acquiring infection "up to that day," had patients stayed in the ICU for at least s days.

Journal ArticleDOI
TL;DR: A two-phase analysis with model selection for the case-control design of the Cochran-Armitage trend test and an optimal CATT corresponding to the selected model is used for testing association.
Abstract: The Cochran-Armitage trend test (CATT) is well suited for testing association between a marker and a disease in case-control studies. When the underlying genetic model for the disease is known, the CATT optimal for the genetic model is used. For complex diseases, however, the genetic models of the true disease loci are unknown. In this situation, robust tests are preferable. We propose a two-phase analysis with model selection for the case-control design. In the first phase, we use the difference of Hardy-Weinberg disequilibrium coefficients between the cases and the controls for model selection. Then, an optimal CATT corresponding to the selected model is used for testing association. The correlation of the statistics used for selection and the test for association is derived to adjust the two-phase analysis with control of the Type-I error rate. The simulation studies show that this new approach has greater efficiency robustness than the existing methods.

Journal ArticleDOI
TL;DR: WECCA is presented, a method for weighted clustering of samples on the basis of the ordinal aCGH data, and a new form of linkage, especially suited for ordinal data, is introduced.
Abstract: Array comparative genomic hybridization (aCGH) is a laboratory technique to measure chromosomal copy number changes. A clear biological interpretation of the measurements is obtained by mapping these onto an ordinal scale with categories loss/normal/gain of a copy. The pattern of gains and losses harbors a level of tumor specificity. Here, we present WECCA (weighted clustering of called aCGH data), a method for weighted clustering of samples on the basis of the ordinal aCGH data. Two similarities to be used in the clustering and particularly suited for ordinal data are proposed, which are generalized to deal with weighted observations. In addition, a new form of linkage, especially suited for ordinal data, is introduced. In a simulation study, we show that the proposed cluster method is competitive to clustering using the continuous data. We illustrate WECCA using an application to a breast cancer data set, where WECCA finds a clustering that relates better with survival than the original one.

Journal ArticleDOI
TL;DR: A new fitting procedure using the expectation-maximization algorithm is introduced, treating the cause of death as missing data, meaning that all the wealth of options in existing software for the Cox model can be used in relative survival.
Abstract: SUMMARY The goal of relative survival methodology is to compare the survival experience of a cohort with that of the background population. Most often an additive excess hazard model is employed, which assumes that each person’s hazard is a sum of 2 components—the population hazard obtained from life tables and an excess hazard attributable to the specific condition. Usually covariate effects on the excess hazard are assumed to have a proportional hazards structure with parametrically modelled baseline. In this paper, we introduce a new fitting procedure using the expectation–maximization algorithm, treating the cause of death as missing data. The method requires no assumptions about the baseline excess hazard thus reducing the risk of bias through misspecification. It accommodates the possibility of knowledge of cause of death for some patients, and as a side effect, the method yields an estimate of the ratio between the excess and the population hazard for each subject. More importantly, it estimates the baseline excess hazard flexibly with no additional degrees of freedom spent. Finally, it is a generalization of the Cox model, meaning that all the wealth of options in existing software for the Cox model can be used in relative survival. The method is applied to a data set on survival after myocardial infarction, where it shows how a particular form of the hazard function could be missed using the existing methods.

Journal ArticleDOI
TL;DR: It is shown that allowing the markers to be correlated can improve the prognosis for long-term success of the transplant and be used in a Bayes rule to obtain this prognosis.
Abstract: Patients who have undergone renal transplantation are monitored longitudinally at irregular time intervals over 10 years or more. This yields a set of biochemical and physiological markers containing valuable information to anticipate a failure of the graft. A general linear, generalized linear, or nonlinear mixed model is used to describe the longitudinal profile of each marker. To account for the correlation between markers, the univariate mixed models are combined into a multivariate mixed model (MMM) by specifying a joint distribution for the random effects. Due to the high number of markers, a pairwise modeling strategy, where all possible pairs of bivariate mixed models are fitted, is used to obtain parameter estimates for the MMM. These estimates are used in a Bayes rule to obtain, at each point in time, the prognosis for long-term success of the transplant. It is shown that allowing the markers to be correlated can improve this prognosis.

Journal ArticleDOI
TL;DR: This approach applies to a wide variety of semiparametric and nonparametric problems in biostatistics and does not require solving estimating equations and is thus much faster than the existing resampling procedures.
Abstract: We propose a simple and general resampling strategy to estimate variances for parameter estimators derived from nonsmooth estimating functions. This approach applies to a wide variety of semiparametric and nonparametric problems in biostatistics. It does not require solving estimating equations and is thus much faster than the existing resampling procedures. Its usefulness is illustrated with heteroscedastic quantile regression and censored data rank regression. Numerical results based on simulated and real data are provided.

Journal ArticleDOI
TL;DR: A model for estimation of temperature effects on mortality that is able to capture jointly the typical features of every temperature-death relationship, that is, nonlinearity and delayed effect of cold and heat over a few days is presented.
Abstract: We present a model for estimation of temperature effects on mortality that is able to capture jointly the typical features of every temperature-death relationship, that is, nonlinearity and delayed effect of cold and heat over a few days. Using a segmented approximation along with a doubly penalized spline-based distributed lag parameterization, estimates and relevant standard errors of the cold- and heat-related risks and the heat tolerance are provided. The model is applied to data from Milano, Italy.

Journal ArticleDOI
TL;DR: A new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown is proposed and compared to other statistics it is found that the proposed method often has more power then its competitors.
Abstract: We propose a new statistics for the detection of differentially expressed genes when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include cancer outlier profile analysis (Tomlins and others, 2005), outlier sum (Tibshirani and Hastie, 2007), and outlier robust t-statistics (Wu, 2007). We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find that the proposed method often has more power then its competitors.

Journal ArticleDOI
TL;DR: Assessment of evidence consistency and investigation of the main drivers for model inferences are investigated and a cross-validation technique to reveal data conflict and leverage when each data source is in turn removed from the model is considered.
Abstract: Multiparameter evidence synthesis is becoming widely used as a way of combining evidence from multiple and often disparate sources of information concerning a number of parameters. Synthesizing data in one encompassing model allows propagation of evidence and learning. We demonstrate the use of such an approach in estimating the number of people infected with the hepatitis C virus (HCV) in England and Wales. Data are obtained from seroprevalence studies conducted in different subpopulations. Each subpopulation is modeled as a composition of 3 main HCV risk groups (current injecting drug users (IDUs), ex-IDUs, and non-IDUs). Further, data obtained on the prevalence (size) of each risk group provide an estimate of the prevalence of HCV in the whole population. We simultaneously estimate all model parameters through the use of Bayesian Markov chain Monte Carlo techniques. The main emphasis of this paper is the assessment of evidence consistency and investigation of the main drivers for model inferences. We consider a cross-validation technique to reveal data conflict and leverage when each data source is in turn removed from the model.

Journal ArticleDOI
TL;DR: Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.
Abstract: One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.

Journal ArticleDOI
TL;DR: This work considers model-based clustering of data that lie on a unit sphere and proposes to model the clusters on the sphere with inverse stereographic projections of multivariate normal distributions.
Abstract: We consider model-based clustering of data that lie on a unit sphere. Such data arise in the analysis of microarray experiments when the gene expressions are standardized so that they have mean 0 and variance 1 across the arrays. We propose to model the clusters on the sphere with inverse stereographic projections of multivariate normal distributions. The corresponding model-based clustering algorithm is described. This algorithm is applied first to simulated data sets to assess the performance of several criteria for determining the number of clusters and to compare its performance with existing methods and second to a real reference data set of standardized gene expression profiles.

Journal ArticleDOI
TL;DR: A Bayesian dose-finding method similar to the TITE-CRM in which doses are chosen using time-to-toxicity data, with a set of rules, based on predictive probabilities, that temporarily suspend accrual if the risk of toxicity at prospective doses for future patients is unacceptably high.
Abstract: Late-onset (LO) toxicities are a serious concern in many phase I trials. Since most dose-limiting toxicities occur soon after therapy begins, most dose-finding methods use a binary indicator of toxicity occurring within a short initial time period. If an agent causes LO toxicities, however, an undesirably large number of patients may be treated at toxic doses before any toxicities are observed. A method addressing this problem is the time-to-event continual reassessment method (TITE-CRM, Cheung and Chappell, 2000). We propose a Bayesian dose-finding method similar to the TITE-CRM in which doses are chosen using time-to-toxicity data. The new aspect of our method is a set of rules, based on predictive probabilities, that temporarily suspend accrual if the risk of toxicity at prospective doses for future patients is unacceptably high. If additional follow-up data reduce the predicted risk of toxicity to an acceptable level, then accrual is restarted, and this process may be repeated several times during the trial. A simulation study shows that the proposed method provides a greater degree of safety than the TITE-CRM, while still reliably choosing the preferred dose. This advantage increases with accrual rate, but the price of this additional safety is that the trial takes longer to complete on average.

Journal ArticleDOI
TL;DR: This work proposes to use additional information-compliance-predictive covariates-to help identify the causal effects and to help describe characteristics of the subpopulations in behavioral medicine trials.
Abstract: SUMMARY In behavioral medicine trials, such as smoking cessation trials, 2 or more active treatments are often compared. Noncompliance by some subjects with their assigned treatment poses a challenge to the data analyst. The principal stratification framework permits inference about causal effects among subpopulations characterized by potential compliance. However, in the absence of prior information, there are 2 significant limitations: (1) the causal effects cannot be point identified for some strata and (2) individuals in the subpopulations (strata) cannot be identified. We propose to use additional information— compliance-predictive covariates—to help identify the causal effects and to help describe characteristics of the subpopulations. The probability of membership in each principal stratum is modeled as a function of these covariates. The model is constructed using marginal compliance models (which are identified) and a sensitivity parameter that captures the association between the 2 marginal distributions. We illustrate our methods in both a simulation study and an analysis of data from a smoking cessation trial.

Journal ArticleDOI
TL;DR: It is shown that when combined with a baseline risk model and information about the population distribution of Y given X, covariate-specific predictiveness curves can be estimated and are useful to an individual in deciding if ascertainment of Y is likely to be informative or not for him.
Abstract: Consider a set of baseline predictors X to predict a binary outcome D and let Y be a novel marker or predictor. This paper is concerned with evaluating the performance of the augmented risk model P(D = 1|Y,X) compared with the baseline model P(D = 1|X). The diagnostic likelihood ratio, DLRX(y), quantifies the change in risk obtained with knowledge of Y = y for a subject with baseline risk factors X. The notion is commonly used in clinical medicine to quantify the increment in risk prediction due to Y. It is contrasted here with the notion of covariate-adjusted effect of Y in the augmented risk model. We also propose methods for making inference about DLRX(y). Case–control study designs are accommodated. The methods provide a mechanism to investigate if the predictive information in Y varies with baseline covariates. In addition, we show that when combined with a baseline risk model and information about the population distribution of Y given X, covariate-specific predictiveness curves can be estimated. These curves are useful to an individual in deciding if ascertainment of Y is likely to be informative or not for him. We illustrate with data from 2 studies: one is a study of the performance of hearing screening tests for infants, and the other concerns the value of serum creatinine in diagnosing renal artery stenosis.

Journal ArticleDOI
TL;DR: A novel clustering method is developed that combines an initial pairwise curve alignment to adjust for time variation within likely clusters and shows excellent performance over standard clustering methods in terms of cluster quality measures in simulations and for yeast and human fibroblast data sets.
Abstract: Current clustering methods are routinely applied to gene expression time course data to find genes with similar activation patterns and ultimately to understand the dynamics of biological processes. As the dynamic unfolding of a biological process often involves the activation of genes at different rates, successful clustering in this context requires dealing with varying time and shape patterns simultaneously. This motivates the combination of a novel pairwise warping with a suitable clustering method to discover expression shape clusters. We develop a novel clustering method that combines an initial pairwise curve alignment to adjust for time variation within likely clusters. The cluster-specific time synchronization method shows excellent performance over standard clustering methods in terms of cluster quality measures in simulations and for yeast and human fibroblast data sets. In the yeast example, the discovered clusters have high concordance with the known biological processes.

Journal ArticleDOI
TL;DR: It is confirmed that height skewness and kurtosis change with age during puberty, and a model to explain why is devised, and adjusting for age at PHV on a multiplicative scale largely removed the trends in the higher moments.
Abstract: SUMMARY Higher moments of the frequency distribution of child height and weight change with age, particularly during puberty, though why is not known. Our aims were to confirm that height skewness and kurtosis change with age during puberty, to devise a model to explain why, and to test the model by analyzing the data longitudinally. Heights of 3245 Christ’s Hospital School boys born during 1927–1956 were measured twice termly from 9 to 20 years (n = 129 508). Treating the data as independent, the mean, standard deviation (SD), skewness, and kurtosis were calculated in 40 age groups and plotted as functions of age t. The data were also analyzed longitudinally using the nonlinear random-effects growth model H (t) = h(t − e) + α, with H (t) the cross-sectional data, h(t) the individual mean curve, and e and α subject-specific random effects reflecting variability in age and height at peak height velocity (PHV). Mean height increased monotonically with age, while the SD, skewness, and kurtosis changed cyclically with, respectively, 1, 2, and 3 turning points. Surprisingly, their age curves corresponded closely in shape to the first, second, and third derivatives of the mean height curve. The growth model expanded as a Taylor series in e predicted such a pattern, and the longitudinal analysis showed that adjusting for age at PHV on a multiplicative scale largely removed the trends in the higher moments. A nonlinear growth process where subjects grow at different rates, such as in puberty, generates cyclical changes in the higher moments of the frequency distribution. ∗ To whom correspondence should be addressed.

Journal ArticleDOI
TL;DR: The goal of this manuscript is to promote the consideration of outcome-dependent longitudinal sampling designs and to both outline and evaluate the basic conditional likelihood analysis allowing for valid statistical inference.
Abstract: A typical longitudinal study prospectively collects both repeated measures of a health status outcome as well as covariates that are used either as the primary predictor of interest or as important adjustment factors. In many situations, all covariates are measured on the entire study cohort. However, in some scenarios the primary covariates are time dependent yet may be ascertained retrospectively after completion of the study. One common example would be covariate measurements based on stored biological specimens such as blood plasma. While authors have previously proposed generalizations of the standard case-control design in which the clustered outcome measurements are used to selectively ascertain covariates (Neuhaus and Jewell, 1990) and therefore provide resource efficient collection of information, these designs do not appear to be commonly used. One potential barrier to the use of longitudinal outcome-dependent sampling designs would be the lack of a flexible class of likelihood-based analysis methods. With the relatively recent development of flexible and practical methods such as generalized linear mixed models (Breslow and Clayton, 1993) and marginalized models for categorical longitudinal data (see Heagerty and Zeger, 2000, for an overview), the class of likelihood-based methods is now sufficiently well developed to capture the major forms of longitudinal correlation found in biomedical repeated measures data. Therefore, the goal of this manuscript is to promote the consideration of outcome-dependent longitudinal sampling designs and to both outline and evaluate the basic conditional likelihood analysis allowing for valid statistical inference.