scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2010"


Journal ArticleDOI
TL;DR: Sparse singular value decomposition (SSVD) is proposed as a new exploratory analysis tool for biclustering or identifying interpretable row–column associations within high‐dimensional data matrices.
Abstract: Sparse singular value decomposition (SSVD) is proposed as a new exploratory analysis tool for biclustering or identifying interpretable row-column associations within high-dimensional data matrices. SSVD seeks a low-rank, checkerboard structured matrix approximation to data matrices. The desired checkerboard structure is achieved by forcing both the left- and right-singular vectors to be sparse, that is, having many zero entries. By interpreting singular vectors as regression coefficient vectors for certain linear regressions, sparsity-inducing regularization penalties are imposed to the least squares regression to produce sparse singular vectors. An efficient iterative algorithm is proposed for computing the sparse singular vectors, along with some discussion of penalty parameter selection. A lung cancer microarray dataset and a food nutrition dataset are used to illustrate SSVD as a biclustering method. SSVD is also compared with some existing biclustering methods using simulated datasets.

271 citations


Journal ArticleDOI
TL;DR: This method is based on a penalized joint log likelihood with an adaptive penalty for the selection and estimation of both the fixed and random effects and enjoys the Oracle property, in that, asymptotically it performs as well as if the true model was known beforehand.
Abstract: It is of great practical interest to simultaneously identify the important predictors that correspond to both the fixed and random effects components in a linear mixed-effects (LME) model. Typical approaches perform selection separately on each of the fixed and random effect components. However, changing the structure of one set of effects can lead to different choices of variables for the other set of effects. We propose simultaneous selection of the fixed and random factors in an LME model using a modified Cholesky decomposition. Our method is based on a penalized joint log likelihood with an adaptive penalty for the selection and estimation of both the fixed and random effects. It performs model selection by allowing fixed effects or standard deviations of random effects to be exactly zero. A constrained expectation-maximization algorithm is then used to obtain the final estimates. It is further shown that the proposed penalized estimator enjoys the Oracle property, in that, asymptotically it performs as well as if the true model was known beforehand. We demonstrate the performance of our method based on a simulation study and a real data example.

219 citations


Journal ArticleDOI
TL;DR: In this article, an incremental mixture importance sampling (IMIS) algorithm is proposed, which iteratively builds up a better sampling function, which retains the simplicity and transparency of sampling importance resampling, but is much more efficient computationally.
Abstract: The Joint United Nations Programme on HIV/AIDS (UNAIDS) has decided to use Bayesian melding as the basis for its probabilistic projections of HIV prevalence in countries with generalized epidemics. This combines a mechanistic epidemiological model, prevalence data, and expert opinion. Initially, the posterior distribution was approximated by sampling-importance-resampling, which is simple to implement, easy to interpret, transparent to users, and gave acceptable results for most countries. For some countries, however, this is not computationally efficient because the posterior distribution tends to be concentrated around nonlinear ridges and can also be multimodal. We propose instead incremental mixture importance sampling (IMIS), which iteratively builds up a better importance sampling function. This retains the simplicity and transparency of sampling importance resampling, but is much more efficient computationally. It also leads to a simple estimator of the integrated likelihood that is the basis for Bayesian model comparison and model averaging. In simulation experiments and on real data, it outperformed both sampling importance resampling and three publicly available generic Markov chain Monte Carlo algorithms for this kind of problem.

155 citations


Journal ArticleDOI
TL;DR: An illness-death model with shared frailty is considered, which in its most restrictive form is identical to the semicompeting risks model that has been proposed and analyzed, but that allows for many generalizations and the simple incorporation of covariates.
Abstract: In many instances, a subject can experience both a nonterminal and terminal event where the terminal event (e.g., death) censors the nonterminal event (e.g., relapse) but not vice versa. Typically, the two events are correlated. This situation has been termed semicompeting risks (e.g., Fine, Jiang, and Chappell, 2001, Biometrika 88, 907-939; Wang, 2003, Journal of the Royal Statistical Society, Series B 65, 257-273), and analysis has been based on a joint survival function of two event times over the positive quadrant but with observation restricted to the upper wedge. Implicitly, this approach entertains the idea of latent failure times and leads to discussion of a marginal distribution of the nonterminal event that is not grounded in reality. We argue that, similar to models for competing risks, latent failure times should generally be avoided in modeling such data. We note that semicompeting risks have more classically been described as an illness-death model and this formulation avoids any reference to latent times. We consider an illness-death model with shared frailty, which in its most restrictive form is identical to the semicompeting risks model that has been proposed and analyzed, but that allows for many generalizations and the simple incorporation of covariates. Nonparametric maximum likelihood estimation is used for inference and resulting estimates for the correlation parameter are compared with other proposed approaches. Asymptotic properties, simulations studies, and application to a randomized clinical trial in nasopharyngeal cancer evaluate and illustrate the methods. A simple and fast algorithm is developed for its numerical implementation.

138 citations


Journal ArticleDOI
TL;DR: A theoretical justification for the use of principal components in functional regression is presented, and FPCR is extended in two directions: from linear to the generalized linear modeling, and from univariate signal predictors to high-resolution image predictors.
Abstract: Functional principal component regression (FPCR) is a promising new method for regressing scalar outcomes on functional predictors. In this article, we present a theoretical justification for the use of principal components in functional regression. FPCR is then extended in two directions: from linear to the generalized linear modeling, and from univariate signal predictors to high-resolution image predictors. We show how to implement the method efficiently by adapting generalized additive model technology to the functional regression context. A technique is proposed for estimating simultaneous confidence bands for the coefficient function; in the neuroimaging setting, this yields a novel means to identify brain regions that are associated with a clinical outcome. A new application of likelihood ratio testing is described for assessing the null hypothesis of a constant coefficient function. The performance of the methodology is illustrated via simulations and real data analyses with positron emission tomography images as predictors.

135 citations


Journal ArticleDOI
TL;DR: A grouped penalty based on the Lγ‐norm that smoothes the regression coefficients of the predictors over the network is proposed that performs best in variable selection across all simulation set‐ups considered.
Abstract: We consider penalized linear regression, especially for “large p, small n” problems, for which the relationships among predictors are described a priori by a network. A class of motivating examples includes modeling a phenotype through gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. To incorporate the prior knowledge of the similar effect sizes of neighboring predictors in a network, we propose a grouped penalty based on the Lγ-norm that smoothes the regression coefficients of the predictors over the network. The main feature of the proposed method is its ability to automatically realize grouped variable selection and exploit grouping effects. We also discuss effects of the choices of the γ and some weights inside the Lγ-norm. Simulation studies demonstrate the superior finite sample performance of the proposed method as compared to Lasso, elastic net and a recently proposed network-based method. The new method performs best in variable selection across all simulation set-ups considered. For illustration, the method is applied to a microarray dataset to predict survival times for some glioblastoma patients using a gene expression dataset and a gene network compiled from some KEGG pathways.

127 citations


Journal ArticleDOI
TL;DR: Compared to conventional mean regression, quantile regression can characterize the entire conditional distribution of the outcome variable, and is more robust to outliers and misspecification of the error distribution.
Abstract: We study quantile regression (QR) for longitudinal measurements with nonignorable intermittent missing data and dropout. Compared to conventional mean regression, quantile regression can characterize the entire conditional distribution of the outcome variable, and is more robust to outliers and misspecification of the error distribution. We account for the within-subject correlation by introducing a l(2) penalty in the usual QR check function to shrink the subject-specific intercepts and slopes toward the common population values. The informative missing data are assumed to be related to the longitudinal outcome process through the shared latent random effects. We assess the performance of the proposed method using simulation studies, and illustrate it with data from a pediatric AIDS clinical trial.

123 citations


Journal ArticleDOI
TL;DR: This article proposes two estimating equation approaches for estimating the covariate coefficients under the Cox model and uses the modern stochastic process and martingale theory to develop the asymptotic properties of the estimators.
Abstract: Length-biased time-to-event data are commonly encountered in applications ranging from epidemiological cohort studies or cancer prevention trials to studies of labor economy. A longstanding statistical problem is how to assess the association of risk factors with survival in the target population given the observed length-biased data. In this article, we demonstrate how to estimate these effects under the semiparametric Cox proportional hazards model. The structure of the Cox model is changed under length-biased sampling in general. Although the existing partial likelihood approach for left-truncated data can be used to estimate covariate effects, it may not be efficient for analyzing length-biased data. We propose two estimating equation approaches for estimating the covariate coefficients under the Cox model. We use the modern stochastic process and martingale theory to develop the asymptotic properties of the estimators. We evaluate the empirical performance and efficiency of the two methods through extensive simulation studies. We use data from a dementia study to illustrate the proposed methodology, and demonstrate the computational algorithms for point estimates, which can be directly linked to the existing functions in S-PLUS or R.

119 citations


Journal ArticleDOI
TL;DR: This work discusses currently available models for multiple tests and argues in favor of an extension of a model that was developed by Dendukuri and Joseph, and develops Goodman's technique, and makes geometric arguments to give further insight into the nature of models that lack identifiability.
Abstract: Summary We discuss the issue of identifiability of models for multiple dichotomous diagnostic tests in the absence of a gold standard (GS) test. Data arise as multinomial or product-multinomial counts depending upon the number of populations sampled. Models are generally posited in terms of population prevalences, test sensitivities and specificities, and test dependence terms. It is commonly believed that if the degrees of freedom in the data meet or exceed the number of parameters in a fitted model then the model is identifiable. Goodman (1974, Biometrika 61, 215–231) established that this was not the case a long time ago. We discuss currently available models for multiple tests and argue in favor of an extension of a model that was developed by Dendukuri and Joseph (2001, Biometrics 57, 158–167). Subsequently, we further develop Goodman's technique, and make geometric arguments to give further insight into the nature of models that lack identifiability. We present illustrations using simulated and real data.

113 citations


Journal ArticleDOI
TL;DR: A modeling and estimation strategy that incorporates the regret functions of Murphy into a regression model for observed responses is proposed, which is quick and diagnostics are available, meaning a variety of candidate models can be compared.
Abstract: We consider optimal dynamic treatment regime determination in practice. Model building, checking, and comparison have had little or no attention so far in this literature. Motivated by an application on optimal dosage of anticoagulants, we propose a modeling and estimation strategy that incorporates the regret functions of Murphy (2003, Journal of the Royal Statistical Society, Series B 65, 331-366) into a regression model for observed responses. Estimation is quick and diagnostics are available, meaning a variety of candidate models can be compared. The method is illustrated using simulation and the anticoagulation application.

106 citations


Journal ArticleDOI
TL;DR: The two major advances in this article are the provision of realistic abundance estimates that take account of heterogenetiy of capture, and an appraisal of the amount of overestimation of survival arising from conditioning on the first capture when heterogeneity of survival is present.
Abstract: Summary Estimation of abundance is important in both open and closed population capture–recapture analysis, but unmodeled heterogeneity of capture probability leads to negative bias in abundance estimates This article defines and develops a suite of open population capture–recapture models using finite mixtures to model heterogeneity of capture and survival probabilities Model comparisons and parameter estimation use likelihood-based methods A real example is analyzed, and simulations are used to check the main features of the heterogeneous models, especially the quality of estimation of abundance, survival, recruitment, and turnover The two major advances in this article are the provision of realistic abundance estimates that take account of heterogenetiy of capture, and an appraisal of the amount of overestimation of survival arising from conditioning on the first capture when heterogeneity of survival is present

Journal ArticleDOI
TL;DR: This article presents a simple model to capture uncertainty arising in the base‐calling procedure of the Illumina/Solexa GA platform, and provides informative and easily interpretable metrics that capture the variability in sequencing quality.
Abstract: Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads-strings of A,C,G, or T's, between 30 and 100 characters long-which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance.

Journal ArticleDOI
TL;DR: The adaptively weighted logrank test maintains optimality at the proportional alternatives, while improving the power over a wide range of nonproportional alternatives, as illustrated in several real data examples.
Abstract: For testing for treatment effects with time-to-event data, the logrank test is the most popular choice and has some optimality properties under proportional hazards alternatives. It may also be combined with other tests when a range of nonproportional alternatives are entertained. We introduce some versatile tests that use adaptively weighted logrank statistics. The adaptive weights utilize the hazard ratio obtained by fitting the model of Yang and Prentice (2005, Biometrika 92, 1-17). Extensive numerical studies have been performed under proportional and nonproportional alternatives, with a wide range of hazard ratios patterns. These studies show that these new tests typically improve the tests they are designed to modify. In particular, the adaptively weighted logrank test maintains optimality at the proportional alternatives, while improving the power over a wide range of nonproportional alternatives. The new tests are illustrated in several real data examples.

Journal ArticleDOI
TL;DR: In this article, time-dependent accuracy measures for a marker when the authors have censored survival times and competing risks outcomes are proposed to incorporate cause of failure for competing risks outcome.
Abstract: Competing risks arise naturally in time-to-event studies. In this article, we propose time-dependent accuracy measures for a marker when we have censored survival times and competing risks. Time-dependent versions of sensitivity or true positive (TP) fraction naturally correspond to consideration of either cumulative (or prevalent) cases that accrue over a fixed time period, or alternatively to incident cases that are observed among event-free subjects at any select time. Time-dependent (dynamic) specificity (1-false positive (FP)) can be based on the marker distribution among event-free subjects. We extend these definitions to incorporate cause of failure for competing risks outcomes. The proposed estimation for cause-specific cumulative TP/dynamic FP is based on the nearest neighbor estimation of bivariate distribution function of the marker and the event time. On the other hand, incident TP/dynamic FP can be estimated using a possibly nonproportional hazards Cox model for the cause-specific hazards and riskset reweighting of the marker distribution. The proposed methods extend the time-dependent predictive accuracy measures of Heagerty, Lumley, and Pepe (2000, Biometrics 56, 337-344) and Heagerty and Zheng (2005, Biometrics 61, 92-105).

Journal ArticleDOI
TL;DR: This work presents a general framework for Bayesian analysis of categorical data arising from a latent multinomial distribution and illustrates the approach using two data sets with individual misidentification, one simulated, the other summarizing recapture data for salamanders based on natural marks.
Abstract: Natural tags based on DNA fingerprints or natural features of animals are now becoming very widely used in wildlife population biology. However, classic capture-recapture models do not allow for misidentification of animals which is a potentially very serious problem with natural tags. Statistical analysis of misidentification processes is extremely difficult using traditional likelihood methods but is easily handled using Bayesian methods. We present a general framework for Bayesian analysis of categorical data arising from a latent multinomial distribution. Although our work is motivated by a specific model for misidentification in closed population capture-recapture analyses, with crucial assumptions which may not always be appropriate, the methods we develop extend naturally to a variety of other models with similar structure. Suppose that observed frequencies f are a known linear transformation f=A'x of a latent multinomial variable x with cell probability vector pi=pi(theta). Given that full conditional distributions [theta | x] can be sampled, implementation of Gibbs sampling requires only that we can sample from the full conditional distribution [x | f, theta], which is made possible by knowledge of the null space of A'. We illustrate the approach using two data sets with individual misidentification, one simulated, the other summarizing recapture data for salamanders based on natural marks.

Journal ArticleDOI
TL;DR: A pairwise variable selection method for high‐dimensional model‐based clustering is proposed, based on a new pairwise penalty, that performs better than alternative approaches that use ℓ1 andℓ∞ penalties and offers better interpretation.
Abstract: Variable selection for clustering is an important and challenging problem in high-dimensional data analysis. Existing variable selection methods for model-based clustering select informative variables in a "one-in-all-out" manner; that is, a variable is selected if at least one pair of clusters is separable by this variable and removed if it cannot separate any of the clusters. In many applications, however, it is of interest to further establish exactly which clusters are separable by each informative variable. To address this question, we propose a pairwise variable selection method for high-dimensional model-based clustering. The method is based on a new pairwise penalty. Results on simulated and real data show that the new method performs better than alternative approaches that use l(1) and l(∞) penalties and offers better interpretation.

Journal ArticleDOI
TL;DR: An outcome‐adaptive Bayesian design is proposed for choosing the optimal dose pair of a chemotherapeutic agent and a biological agent used in combination in a phase I/II clinical trial.
Abstract: An outcome-adaptive Bayesian design is proposed for choosing the optimal dose pair of a chemotherapeutic agent and a biological agent used in combination in a phase I/II clinical trial. Patient outcome is characterized as a vector of two ordinal variables accounting for toxicity and treatment efficacy. A generalization of the Aranda-Ordaz model (1981, Biometrika 68, 357-363) is used for the marginal outcome probabilities as functions of a dose pair, and a Gaussian copula is assumed to obtain joint distributions. Numerical utilities of all elementary patient outcomes, allowing the possibility that efficacy is inevaluable due to severe toxicity, are obtained using an elicitation method aimed to establish consensus among the physicians planning the trial. For each successive patient cohort, a dose pair is chosen to maximize the posterior mean utility. The method is illustrated by a trial in bladder cancer, including simulation studies of the method's sensitivity to prior parameters, the numerical utilities, correlation between the outcomes, sample size, cohort size, and starting dose pair.

Journal ArticleDOI
TL;DR: A sample size formula is derived that accounts for two levels of clustering: that of subjects within clusters and that of evaluations within subjects, which reveals that sample size is inflated, relative to a design with completely independent evaluations.
Abstract: Cluster randomized trials in health care may involve three instead of two levels, for instance, in trials where different interventions to improve quality of care are compared. In such trials, the intervention is implemented in health care units ("clusters") and aims at changing the behavior of health care professionals working in this unit ("subjects"), while the effects are measured at the patient level ("evaluations"). Within the generalized estimating equations approach, we derive a sample size formula that accounts for two levels of clustering: that of subjects within clusters and that of evaluations within subjects. The formula reveals that sample size is inflated, relative to a design with completely independent evaluations, by a multiplicative term that can be expressed as a product of two variance inflation factors, one that quantifies the impact of within-subject correlation of evaluations on the variance of subject-level means and the other that quantifies the impact of the correlation between subject-level means on the variance of the cluster means. Power levels as predicted by the sample size formula agreed well with the simulated power for more than 10 clusters in total, when data were analyzed using bias-corrected estimating equations for the correlation parameters in combination with the model-based covariance estimator or the sandwich estimator with a finite sample correction.

Journal ArticleDOI
TL;DR: A log‐linear model is used to directly model the association between the potential outcomes of S and T through the odds ratios and incorporates prior belief that is plausible in the surrogate context by using prior distributions to reduce the nonidentifiability problem and increase the precision of statistical inferences.
Abstract: A surrogate marker (S) is a variable that can be measured earlier and often more easily than the true endpoint (T) in a clinical trial. Most previous research has been devoted to developing surrogacy measures to quantify how well S can replace T or examining the use of S in predicting the effect of a treatment (Z). However, the research often requires one to fit models for the distribution of T given S and Z. It is well known that such models do not have causal interpretations because the models condition on a postrandomization variable S. In this article, we directly model the relationship among T, S, and Z using a potential outcomes framework introduced by Frangakis and Rubin (2002, Biometrics 58, 21-29). We propose a Bayesian estimation method to evaluate the causal probabilities associated with the cross-classification of the potential outcomes of S and T when S and T are both binary. We use a log-linear model to directly model the association between the potential outcomes of S and T through the odds ratios. The quantities derived from this approach always have causal interpretations. However, this causal model is not identifiable from the data without additional assumptions. To reduce the nonidentifiability problem and increase the precision of statistical inferences, we assume monotonicity and incorporate prior belief that is plausible in the surrogate context by using prior distributions. We also explore the relationship among the surrogacy measures based on traditional models and this counterfactual model. The method is applied to the data from a glaucoma treatment study.

Journal ArticleDOI
TL;DR: A hierarchical model for the probability of dose-limiting toxicity (DLT) for combinations of doses of two therapeutic agents is proposed and methods for generating prior distributions for the parameters in this model are described from a basic set of information elicited from clinical investigators.
Abstract: We propose a hierarchical model for the probability of dose-limiting toxicity (DLT) for combinations of doses of two therapeutic agents. We apply this model to an adaptive Bayesian trial algorithm whose goal is to identify combinations with DLT rates close to a prespecified target rate. We describe methods for generating prior distributions for the parameters in our model from a basic set of information elicited from clinical investigators. We survey the performance of our algorithm in a series of simulations of a hypothetical trial that examines combinations of four doses of two agents. We also compare the performance of our approach to two existing methods and assess the sensitivity of our approach to the chosen prior distribution.

Journal ArticleDOI
TL;DR: A fully model-based approach for the analysis of distance sampling data allows complex and opportunistic transect designs to be employed, it allows estimation of abundance in small subregions, and it provides a framework to assess the effects of habitat or experimental manipulation on density.
Abstract: Summary We consider a fully model-based approach for the analysis of distance sampling data. Distance sampling has been widely used to estimate abundance (or density) of animals or plants in a spatially explicit study area. There is, however, no readily available method of making statistical inference on the relationships between abundance and environmental covariates. Spatial Poisson process likelihoods can be used to simultaneously estimate detection and intensity parameters by modeling distance sampling data as a thinned spatial point process. A model-based spatial approach to distance sampling data has three main benefits: it allows complex and opportunistic transect designs to be employed, it allows estimation of abundance in small subregions, and it provides a framework to assess the effects of habitat or experimental manipulation on density. We demonstrate the model-based methodology with a small simulation study and analysis of the Dubbo weed data set. In addition, a simple ad hoc method for handling overdispersion is also proposed. The simulation study showed that the model-based approach compared favorably to conventional distance sampling methods for abundance estimation. In addition, the overdispersion correction performed adequately when the number of transects was high. Analysis of the Dubbo data set indicated a transect effect on abundance via Akaike's information criterion model selection. Further goodness-of-fit analysis, however, indicated some potential confounding of intensity with the detection function.

Journal ArticleDOI
TL;DR: It is concluded that large bias may result if the position of samplers is not randomized, and analysis methods fail to account for nonuniformity.
Abstract: Distance sampling is a widely used methodology for assessing animal abundance. A key requirement of distance sampling is that samplers (lines or points) are placed according to a randomized design, which ensures that samplers are positioned independently of animals. Often samplers are placed along linear features such as roads, so that bias is expected if animals are not uniformly distributed with respect to distance from the linear feature. We present an approach for analyzing distance data from a survey when the samplers are points placed along a linear feature. Based on results from a simulation study and from a survey of Irish hares in Northern Ireland conducted from roads, we conclude that large bias may result if the position of samplers is not randomized, and analysis methods fail to account for nonuniformity.

Journal ArticleDOI
TL;DR: This article jointly model HIV viral dynamics and time to decrease in CD4/CD8 ratio in the presence of CD4 process with measurement errors, and estimate the model parameters simultaneously via a method based on a Laplace approximation and the commonly used Monte Carlo EM algorithm.
Abstract: In an attempt to provide a tool to assess antiretroviral therapy and to monitor disease progression, this article studies association of human immunodeficiency virus (HIV) viral suppression and immune restoration. The data from a recent acquired immune deficiency syndrome (AIDS) study are used for illustration. We jointly model HIV viral dynamics and time to decrease in CD4/CD8 ratio in the presence of CD4 process with measurement errors, and estimate the model parameters simultaneously via a method based on a Laplace approximation and the commonly used Monte Carlo EM algorithm. The approaches and many of the points presented apply generally.

Journal ArticleDOI
TL;DR: This article proposes a multiple-imputation-based approach for creating multiple versions of the completed data set under the assumed joint model so that residuals and diagnostic plots for the complete data model can be calculated based on these imputed data sets.
Abstract: The majority of the statistical literature for the joint modeling of longitudinal and time-to-event data has focused on the development of models that aim at capturing specific aspects of the motivating case studies However, little attention has been given to the development of diagnostic and model-assessment tools The main difficulty in using standard model diagnostics in joint models is the nonrandom dropout in the longitudinal outcome caused by the occurrence of events In particular, the reference distribution of statistics, such as the residuals, in missing data settings is not directly available and complex calculations are required to derive it In this article, we propose a multiple-imputation-based approach for creating multiple versions of the completed data set under the assumed joint model Residuals and diagnostic plots for the complete data model can then be calculated based on these imputed data sets Our proposals are exemplified using two real data sets

Journal ArticleDOI
TL;DR: A double-penalized likelihood approach for simultaneous model selection and estimation in semiparametric mixed models for longitudinal data, which provides valid inference for data with missing at random and will be more efficient if the specified model is correct.
Abstract: We propose a double-penalized likelihood approach for simultaneous model selection and estimation in semiparametric mixed models for longitudinal data. Two types of penalties are jointly imposed on the ordinary log-likelihood: the roughness penalty on the nonparametric baseline function and a nonconcave shrinkage penalty on linear coefficients to achieve model sparsity. Compared to existing estimation equation based approaches, our procedure provides valid inference for data with missing at random, and will be more efficient if the specified model is correct. Another advantage of the new procedure is its easy computation for both regression components and variance parameters. We show that the double-penalized problem can be conveniently reformulated into a linear mixed model framework, so that existing software can be directly used to implement our method. For the purpose of model inference, we derive both frequentist and Bayesian variance estimation for estimated parametric and nonparametric components. Simulation is used to evaluate and compare the performance of our method to the existing ones. We then apply the new method to a real data set from a lactation study.

Journal ArticleDOI
TL;DR: By modeling the covariance in detection probabilities with distance, this framework can be used to provide more reliable analysis of double-observer line transect data and is illustrated through analysis of minke whale sightings data from the North Sea and adjacent waters.
Abstract: Double-observer line transect methods are becoming increasingly widespread, especially for the estimation of marine mammal abundance from aerial and shipboard surveys when detection of animals on the line is uncertain. The resulting data supplement conventional distance sampling data with two-sample mark-recapture data. Like conventional mark-recapture data, these have inherent problems for estimating abundance in the presence of heterogeneity. Unlike conventional mark-recapture methods, line transect methods use knowledge of the distribution of a covariate, which affects detection probability (namely, distance from the transect line) in inference. This knowledge can be used to diagnose unmodeled heterogeneity in the mark-recapture component of the data. By modeling the covariance in detection probabilities with distance, we show how the estimation problem can be formulated in terms of different levels of independence. At one extreme, full independence is assumed, as in the Petersen estimator (which does not use distance data); at the other extreme, independence only occurs in the limit as detection probability tends to one. Between the two extremes, there is a range of models, including those currently in common use, which have intermediate levels of independence. We show how this framework can be used to provide more reliable analysis of double-observer line transect data. We test the methods by simulation, and by analysis of a dataset for which true abundance is known. We illustrate the approach through analysis of minke whale sightings data from the North Sea and adjacent waters.

Journal ArticleDOI
TL;DR: A clinical trial with a primary and a secondary endpoint where the secondary endpoint is tested only if the primary endpoint is significant is considered, and an ad hoc boundary is proposed for thesecondary endpoint to address a practical concern that may be at issue in some applications.
Abstract: We consider a clinical trial with a primary and a secondary endpoint where the secondary endpoint is tested only if the primary endpoint is significant The trial uses a group sequential procedure with two stages The familywise error rate (FWER) of falsely concluding significance on either endpoint is to be controlled at a nominal level α The type I error rate for the primary endpoint is controlled by choosing any α-level stopping boundary, eg, the standard O'Brien-Fleming or the Pocock boundary Given any particular α-level boundary for the primary endpoint, we study the problem of determining the boundary for the secondary endpoint to control the FWER We study this FWER analytically and numerically and find that it is maximized when the correlation coefficient ρ between the two endpoints equals 1 For the four combinations consisting of O'Brien-Fleming and Pocock boundaries for the primary and secondary endpoints, the critical constants required to control the FWER are computed for different values of ρ An ad hoc boundary is proposed for the secondary endpoint to address a practical concern that may be at issue in some applications Numerical studies indicate that the O'Brien-Fleming boundary for the primary endpoint and the Pocock boundary for the secondary endpoint generally gives the best primary as well as secondary power performance The Pocock boundary may be replaced by the ad hoc boundary for the secondary endpoint with a very little loss of secondary power if the practical concern is at issue A clinical trial example is given to illustrate the methods

Journal ArticleDOI
TL;DR: This article proposes a novel semiparametric inference procedure that depends on neither the frailty nor the censoring time distribution, and incorporates both time-dependent and time-independent covariates in the formulation.
Abstract: Recurrent event data analyses are usually conducted under the assumption that the censoring time is independent of the recurrent event process. In many applications the censoring time can be informative about the underlying recurrent event process, especially in situations where a correlated failure event could potentially terminate the observation of recurrent events. In this article, we consider a semiparametric model of recurrent event data that allows correlations between censoring times and recurrent event process via frailty. This flexible framework incorporates both time-dependent and time-independent covariates in the formulation, while leaving the distributions of frailty and censoring times unspecified. We propose a novel semiparametric inference procedure that depends on neither the frailty nor the censoring time distribution. Large sample properties of the regression parameter estimates and the estimated baseline cumulative intensity functions are studied. Numerical studies demonstrate that the proposed methodology performs well for realistic sample sizes. An analysis of hospitalization data for patients in an AIDS cohort study is presented to illustrate the proposed method.

Journal ArticleDOI
TL;DR: A procedure extending that of Benjamini and Yekutieli based on the Bonferroni test for each gene is developed, and a proof is given for its mdFDR control when the underlying test statistics are independent across the genes.
Abstract: Microarray gene expression studies over ordered categories are routinely conducted to gain insights into biological functions of genes and the underlying biological processes. Some common experiments are time-course/dose-response experiments where a tissue or cell line is exposed to different doses and/or durations of time to a chemical. A goal of such studies is to identify gene expression patterns/profiles over the ordered categories. This problem can be formulated as a multiple testing problem where for each gene the null hypothesis of no difference between the successive mean gene expressions is tested and further directional decisions are made if it is rejected. Much of the existing multiple testing procedures are devised for controlling the usual false discovery rate (FDR) rather than the mixed directional FDR (mdFDR), the expected proportion of Type I and directional errors among all rejections. Benjamini and Yekutieli (2005, Journal of the American Statistical Association 100, 71-93) proved that an augmentation of the usual Benjamini-Hochberg (BH) procedure can control the mdFDR while testing simple null hypotheses against two-sided alternatives in terms of one-dimensional parameters. In this article, we consider the problem of controlling the mdFDR involving multidimensional parameters. To deal with this problem, we develop a procedure extending that of Benjamini and Yekutieli based on the Bonferroni test for each gene. A proof is given for its mdFDR control when the underlying test statistics are independent across the genes. The results of a simulation study evaluating its performance under independence as well as under dependence of the underlying test statistics across the genes relative to other relevant procedures are reported. Finally, the proposed methodology is applied to a time-course microarray data obtained by Lobenhofer et al. (2002, Molecular Endocrinology 16, 1215-1229). We identified several important cell-cycle genes, such as DNA replication/repair gene MCM4 and replication factor subunit C2, which were not identified by the previous analyses of the same data by Lobenhofer et al. (2002) and Peddada et al. (2003, Bioinformatics 19, 834-841). Although some of our findings overlap with previous findings, we identify several other genes that complement the results of Lobenhofer et al. (2002).

Journal ArticleDOI
TL;DR: It is shown that the rank abundance distribution (RAD) representation of the data provides a convenient method for quantifying these three attributes constituting biodiversity, and presents a statistical framework for modeling RADs and allow their multivariate distribution to vary according to environmental gradients.
Abstract: Biodiversity is an important topic of ecological research. A common form of data collected to investigate patterns of biodiversity is the number of individuals of each species at a series of locations. These data contain information on the number of individuals (abundance), the number of species (richness), and the relative proportion of each species within the sampled assemblage (evenness). If there are enough sampled locations across an environmental gradient then the data should contain information on how these three attributes of biodiversity change over gradients. We show that the rank abundance distribution (RAD) representation of the data provides a convenient method for quantifying these three attributes constituting biodiversity. We present a statistical framework for modeling RADs and allow their multivariate distribution to vary according to environmental gradients. The method relies on three models: a negative binomial model, a truncated negative binomial model, and a novel model based on a modified Dirichlet-multinomial that allows for a particular type of heterogeneity observed in RAD data. The method is motivated by, and applied to, a large-scale marine survey off the coast of Western Australia, Australia. It provides a rich description of biodiversity and how it changes with environmental conditions.