scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2004"


Journal ArticleDOI
TL;DR: A class of models (N-mixture models) which allow for estimation of population size from site-specific population sizes, N, as independent random variables distributed according to some mixing distribution (e.g., Poisson).
Abstract: Spatial replication is a common theme in count surveys of animals. Such surveys often generate sparse count data from which it is difficult to estimate population size while formally accounting for detection probability. In this article, I describe a class of models (N-mixture models) which allow for estimation of population size from such data. The key idea is to view site-specific population sizes, N, as independent random variables distributed according to some mixing distribution (e.g., Poisson). Prior parameters are estimated from the marginal likelihood of the data, having integrated over the prior distribution for N. Carroll and Lombard (1985, Journal of American Statistical Association 80, 423-426) proposed a class of estimators based on mixing over a prior distribution for detection probability. Their estimator can be applied in limited settings, but is sensitive to prior parameter values that are fixed a priori. Spatial replication provides additional information regarding the parameters of the prior distribution on N that is exploited by the N-mixture models and which leads to reasonable estimates of abundance from sparse data. A simulation study demonstrates superior operating characteristics (bias, confidence interval coverage) of the N-mixture estimator compared to the Caroll and Lombard estimator. Both estimators are applied to point count data on six species of birds illustrating the sensitivity to choice of prior on p and substantially different estimates of abundance as a consequence.

1,291 citations


Journal ArticleDOI
D. Edwards1
TL;DR: A model-Based Analysis of Oligonucleotide Arrays Issues in cDNA Microarray Analysis and Design and Analysis of Comparative Microarray Experiments are presented.
Abstract: MODEL-BASED ANALYSIS OF OLIGONUCLEOTIDE ARRAYS AND ISSUES IN cDNA MICROARRAY ANALYSIS, Cheng Li, George C. Tseng, and Wing Hung Wong Model-Based Analysis of Oligonucleotide Arrays Issues in cDNA Microarray Analysis Acknowledgments DESIGN AND ANALYSIS OF COMPARATIVE MICROARRAY EXPERIMENTS, Yee Hwa Yang and Terry Speed Introduction Experimental Design Two-Sample Comparisons Single-Factor Experiments with more than Two Levels Factorial Experiments Some Topics for Further Research CLASSIFICATION IN MICROARRAY EXPERIMENTS, \ Sandrine Dudoit and Jane Fridlyand Introduction Overview of Different Classifiers General Issues in Classification Performance Assessment Aggregating Predictors Datasets Results Discussion Software and Datasets Acknowledgments CLUSTERING MICROARRAY DATA, Hugh Chipman, Trevor J. Hastie, and Robert Tibshirani An Example Dissimilarity Clustering Methods Partitioning Methods Hierarchical Methods Two-Way Clustering Principal Components, the SVD, and Gene Shaving Other Approaches Software REFERENCES INDEX

523 citations


Journal ArticleDOI
TL;DR: Computer simulations show that the new adaptive Bayesian method for dose‐finding in phase I/II clinical trials based on trade‐offs between the probabilities of treatment efficacy and toxicity has high probabilities of making correct decisions and treats most patients at doses with desirable efficacy–toxicity trade-offs.
Abstract: We present an adaptive Bayesian method for dose-finding in phase I/II clinical trials based on trade-offs between the probabilities of treatment efficacy and toxicity. The method accommodates either trinary or bivariate binary outcomes, as well as efficacy probabilities that possibly are nonmonotone in dose. Doses are selected for successive patient cohorts based on a set of efficacy-toxicity trade-off contours that partition the two-dimensional outcome probability domain. Priors are established by solving for hyperparameters that optimize the fit of the model to elicited mean outcome probabilities. For trinary outcomes, the new algorithm is compared to the method of Thall and Russell (1998, Biometrics 54, 251-264) by application to a trial of rapid treatment for ischemic stroke. The bivariate binary outcome case is illustrated by a trial of graft-versus-host disease treatment in allogeneic bone marrow transplantation. Computer simulations show that, under a wide rage of dose-outcome scenarios, the new method has high probabilities of making correct decisions and treats most patients at doses with desirable efficacy-toxicity trade-offs.

420 citations


Journal ArticleDOI
TL;DR: This model avoids the difficulties encountered in alternative approaches which attempt to specify a dependent joint distribution with marginal proportional hazards and yields an estimate of the degree of dependence.
Abstract: There has been an increasing interest in the analysis of recurrent event data (Cook and Lawless, 2002, Statistical Methods in Medical Research 11, 141-166). In many situations, a terminating event such as death can happen during the follow-up period to preclude further occurrence of the recurrent events. Furthermore, the death time may be dependent on the recurrent event history. In this article we consider frailty proportional hazards models for the recurrent and terminal event processes. The dependence is modeled by conditioning on a shared frailty that is included in both hazard functions. Covariate effects can be taken into account in the model as well. Maximum likelihood estimation and inference are carried out through a Monte Carlo EM algorithm with Metropolis-Hastings sampler in the E-step. An analysis of hospitalization and death data for waitlisted dialysis patients is presented to illustrate the proposed methods. Methods to check the validity of the proposed model are also demonstrated. This model avoids the difficulties encountered in alternative approaches which attempt to specify a dependent joint distribution with marginal proportional hazards and yields an estimate of the degree of dependence.

316 citations


Journal ArticleDOI
TL;DR: A procedure to compare two segmented line regression functions, specifically to test whether the two segments are identical or whether theTwo mean functions are parallel allowing different intercepts is proposed.
Abstract: Les modeles de regression lineaire segmentes, composes de phases lineaires continues, ont ete appliques pour decrire des changements de tendance de taux. Dans cet article nous proposons une procedurepour comparer deux fonctions de regression lineaires segmentees, afin de tester soit a) si les deux fonctions de regression lineaires segmentees sont identiques ou ii) si les deux fonctions moyennes sont paralleles et admettent des intercepts differents. Une forme generale de la statistique de test est decrite, et on propose alors une procedure par permutation pour estimer la p-value du test. Le test de permutation est compare a un test F approche en terme d'estimation de la p-value, et les performances du test de permutation sont etudiees par simulation. Les tests sont appliques pour comparer les taux de mortalite par cancer du poumon chez les femmes, entre deux regions de recensement, et egalement pour comparer les taux de mortalite par cancer du sein chez la femme entre deux stades.

219 citations


Journal ArticleDOI
TL;DR: It is demonstrated that with small numbers of tests, likelihood comparisons and other model diagnostics may not be able to distinguish between models with different dependence structures, and estimators of sensitivity, specificity, and prevalence can be biased.
Abstract: Modeling diagnostic error without a gold standard has been an active area of biostatistical research. In a majority of the approaches, model-based estimates of sensitivity, specificity, and prevalence are derived from a latent class model in which the latent variable represents an individual's true unobserved disease status. For simplicity, initial approaches assumed that the diagnostic test results on the same subject were independent given the true disease status (i.e., the conditional independence assumption). More recently, various authors have proposed approaches for modeling the dependence structure between test results given true disease status. This note discusses a potential problem with these approaches. Namely, we show that when the conditional dependence between tests is misspecified, estimators of sensitivity, specificity, and prevalence can be biased. Importantly, we demonstrate that with small numbers of tests, likelihood comparisons and other model diagnostics may not be able to distinguish between models with different dependence structures. We present asymptotic results that show the generality of the problem. Further, data analysis and simulations demonstrate the practical implications of model misspecification. Finally, we present some guidelines about the use of these models for practitioners.

208 citations


Journal ArticleDOI
TL;DR: This work follows a Bayesian approach to estimation and inference, developing an efficient data augmentation algorithm for posterior computation that can be used to construct a likelihood for multivariate logistic regression analysis of binary and categorical data.
Abstract: Bayesian analyses of multivariate binary or categorical outcomes typically rely on probit or mixed effects logistic regression models that do not have a marginal logistic structure for the individual outcomes. In addition, difficulties arise when simple noninformative priors are chosen for the covariance parameters. Motivated by these problems, we propose a new type of multivariate logistic distribution that can be used to construct a likelihood for multivariate logistic regression analysis of binary and categorical data. The model for individual outcomes has a marginal logistic structure, simplifying interpretation. We follow a Bayesian approach to estimation and inference, developing an efficient data augmentation algorithm for posterior computation. The method is illustrated with application to a neurotoxicology study.

168 citations


Journal ArticleDOI
TL;DR: The method makes use of mixture priors and Markov chain Monte Carlo techniques to select sets of variables that differ among the classes and applies the methodology to a problem in functional genomics using gene expression profiling data.
Abstract: Here we focus on discrimination problems where the number of predictors substantially exceeds the sample size and we propose a Bayesian variable selection approach to multinomial probit models. Our method makes use of mixture priors and Markov chain Monte Carlo techniques to select sets of variables that differ among the classes. We apply our methodology to a problem in functional genomics using gene expression profiling data. The aim of the analysis is to identify molecular signatures that characterize two different stages of rheumatoid arthritis.

148 citations


Journal ArticleDOI
TL;DR: A new framework for Bayesian isotonic regression and order-restricted inference for categorical outcomes and multiple predictors is proposed, and the approach is applied to an epidemiology application.
Abstract: In many applications, the mean of a response variable can be assumed to be a nondecreasing function of a continuous predictor, controlling for covariates. In such cases, interest often focuses on estimating the regression function, while also assessing evidence of an association. This article proposes a new framework for Bayesian isotonic regression and order-restricted inference. Approximating the regression function with a high-dimensional piecewise linear model, the nondecreasing constraint is incorporated through a prior distribution for the slopes consisting of a product mixture of point masses (accounting for flat regions) and truncated normal densities. To borrow information across the intervals and smooth the curve, the prior is formulated as a latent autoregressive normal process. This structure facilitates efficient posterior computation, since the full conditional distributions of the parameters have simple conjugate forms. Point and interval estimates of the regression function and posterior probabilities of an association for different regions of the predictor can be estimated from a single MCMC run. Generalizations to categorical outcomes and multiple predictors are described, and the approach is applied to an epidemiology application.

146 citations


Journal ArticleDOI
TL;DR: This work proposes a three-level hierarchical mixed model of adverse events that allows for borrowing across body systems, but there is greater potential-depending on the actual data-for borrowing within each body system.
Abstract: Multiple comparisons and other multiplicities are among the most difficult of problems that face statisticians, frequentists, and Bayesians alike. An example is the analysis of the many types of adverse events (AEs) that are recorded in drug clinical trials. We propose a three-level hierarchical mixed model. The most basic level is type of AE. The second level is body system, each of which contains a number of types of possibly related AEs. The highest level is the collection of all body systems. Our analysis allows for borrowing across body systems, but there is greater potential-depending on the actual data-for borrowing within each body system. The probability that a drug has caused a type of AE is greater if its rate is elevated for several types of AEs within the same body system than if the AEs with elevated rates were in different body systems. We give examples to illustrate our method and we describe its application to other types of problems.

139 citations


Journal ArticleDOI
TL;DR: Using simulations, it is shown that, when the markers are independent and when they are correlated, the two-stage approach provides a substantial reduction in the total number of marker evaluations for a minimal loss of power.
Abstract: Gene-disease association studies based on case-control designs may often be used to identify candidate polymorphisms (markers) conferring disease risk. If a large number of markers are studied, genotyping all markers on all samples is inefficient in resource utilization. Here, we propose an alternative two-stage method to identify disease-susceptibility markers. In the first stage all markers are evaluated on a fraction of the available subjects. The most promising markers are then evaluated on the remaining individuals in Stage 2. This approach can be cost effective since markers unlikely to be associated with the disease can be eliminated in the first stage. Using simulations we show that, when the markers are independent and when they are correlated, the two-stage approach provides a substantial reduction in the total number of marker evaluations for a minimal loss of power. The power of the two-stage approach is evaluated when a single marker is associated with the disease, and in the presence of multiple disease-susceptibility markers. As a general guideline, the simulations over a wide range of parametric configurations indicate that evaluating all the markers on 50% of the individuals in Stage 1 and evaluating the most promising 10% of the markers on the remaining individuals in Stage 2 provides near-optimal power while resulting in a 45% decrease in the total number of marker evaluations.

Journal ArticleDOI
TL;DR: Multivariate regression tree methodology is developed and illustrated in a study predicting the abundance of several cooccurring plant species in Missouri Ozark forests.
Abstract: Multivariate regression tree methodology is developed and illustrated in a study predicting the abundance of several cooccurring plant species in Missouri Ozark forests. The technique is a variation of the approach of Segal (1992) for longitudinal data. It has the potential to be applied to many different types of problems in which analysts want to predict the simultaneous cooccurrence of several dependent variables. Multivariate regression trees can also be used as an alternative to cluster analysis in situations where clusters are defined by a set of independent variables and the researcher wants clusters as homogeneous as possible with respect to a group of dependent variables.

Journal ArticleDOI
TL;DR: Simulation results indicate that confidence intervals based on the estimator proposed by Fleiss and Cuzick provide coverage levels close to nominal over a wide range of parameter combinations.
Abstract: We obtain closed-form asymptotic variance formulae for three point estimators of the intraclass correlation coefficient that may be applied to binary outcome data arising in clusters of variable size. Our results include as special cases those that have previously appeared in the literature (Fleiss and Cuzick, 1979, Applied Psychological Measurement 3, 537-542; Bloch and Kraemer, 1989, Biometrics 45, 269-287; Altaye, Donner, and Klar, 2001, Biometrics 57, 584-588). Simulation results indicate that confidence intervals based on the estimator proposed by Fleiss and Cuzick provide coverage levels close to nominal over a wide range of parameter combinations. Two examples are presented.

Journal ArticleDOI
TL;DR: A linear mixed model with a smooth random effects density is proposed and is applied to the cholesterol data first analyzed by Zhang and Davidian and shows that it yields almost unbiased estimates of the regression and the smoothing parameters in small sample settings.
Abstract: A linear mixed model with a smooth random effects density is proposed. A similar approach to P-spline smoothing of Eilers and Marx (1996, Statistical Science 11, 89-121) is applied to yield a more flexible estimate of the random effects density. Our approach differs from theirs in that the B-spline basis functions are replaced by approximating Gaussian densities. Fitting the model involves maximizing a penalized marginal likelihood. The best penalty parameters minimize Akaike's Information Criterion employing Gray's (1992, Journal of the American Statistical Association 87, 942-951) results. Although our method is applicable to any dimensions of the random effects structure, in this article the two-dimensional case is explored. Our methodology is conceptually simple, and it is relatively easy to fit in practice and is applied to the cholesterol data first analyzed by Zhang and Davidian (2001, Biometrics 57, 795-802). A simulation study shows that our approach yields almost unbiased estimates of the regression and the smoothing parameters in small sample settings. Consistency of the estimates is shown in a particular case.

Journal ArticleDOI
TL;DR: This work proposes multivariate multilevel nonlinear mixed effects models for describing several plot‐level timber quantity characteristics simultaneously and describes how such models can be used to produce future predictions of timber volume (yield).
Abstract: Nonlinear mixed effects models have become important tools for growth and yield modeling in forestry. To date, applications have concentrated on modeling single growth variables such as tree height or bole volume. Here, we propose multivariate multilevel nonlinear mixed effects models for describing several plot-level timber quantity characteristics simultaneously. We describe how such models can be used to produce future predictions of timber volume (yield). The class of models and methods of estimation and prediction are developed and then illustrated on data from a University of Georgia study of the effects of various site preparation methods on the growth of slash pine (Pinus elliottii Engelm.).

Journal ArticleDOI
TL;DR: This work derives a best linear unbiased estimator (BLUE) of allele frequency, which is equivalent to the quasi-likelihood estimator for this problem, and describes an efficient algorithm for computing the estimate and its variance and applies it to allele-frequency estimation in a Hutterite data set.
Abstract: Many types of genetic analyses depend on estimates of allele frequencies. We consider the problem of allele-frequency estimation based on data from related individuals. The motivation for this work is data collected on the Hutterites, an isolated founder population, so we focus particularly on the case in which the relationships among the sampled individuals are specified by a large, complex pedigree for which maximum likelihood estimation is impractical. For this case, we propose to use the best linear unbiased estimator (BLUE) of allele frequency. We derive this estimator, which is equivalent to the quasi-likelihood estimator for this problem, and we describe an efficient algorithm for computing the estimate and its variance. We show that our estimator has certain desirable small-sample properties in common with the maximum likelihood estimator (MLE) for this problem. We treat both the case when parental origin of each allele is known and when it is unknown. The results are extended to prediction of allele frequency in some set of individuals S based on genotype data collected on a set of individuals R. We compare the mean-squared error of the BLUE, the commonly used naive estimator (sample frequency) and the MLE when the latter is feasible to calculate. The results indicate that although the MLE performs the best of the three, the BLUE is close in performance to the MLE and is substantially easier to calculate, making it particularly useful for large complex pedigrees in which MLE calculation is impractical or infeasible. We apply our method to allele-frequency estimation in a Hutterite data set.

Journal ArticleDOI
TL;DR: A Bayesian method that identifies segments of similar structure by using a Markov chain governed by a hidden Markov model is described, applied to the segmentation of the bacteriophage lambda genome.
Abstract: Many deoxyribonucleic acid (DNA) sequences display compositional heterogeneity in the form of segments of similar structure. This article describes a Bayesian method that identifies such segments by using a Markov chain governed by a hidden Markov model. Markov chain Monte Carlo (MCMC) techniques are employed to compute all posterior quantities of interest and, in particular, allow inferences to be made regarding the number of segment types and the order of Markov dependence in the DNA sequence. The method is applied to the segmentation of the bacteriophage lambda genome, a common benchmark sequence used for the comparison of statistical segmentation algorithms.

Journal ArticleDOI
TL;DR: This article presents a design for phase I trials in which the toxicity probabilities follow a partial order, meaning that there are pairs of treatments for which the ordering of theoxicity probabilities is not known at the start of the trial.
Abstract: Phase I trials of cytotoxic agents in oncology are usually dose-finding studies that involve a single cytotoxic agent. Many statistical methods have been proposed for these trials, all of which are based on the assumption of a monotonic dose-toxicity curve. For single-agent trials, this is a valid assumption. In many trials, however, investigators are interested in finding the maximally tolerated dose based on escalating multiple cytotoxic agents. When there are multiple agents, monotonicity of the dose-toxicity curve is not clearly defined. In this article we present a design for phase I trials in which the toxicity probabilities follow a partial order, meaning that there are pairs of treatments for which the ordering of the toxicity probabilities is not known at the start of the trial. We compare the new design to existing methods for simple orders and investigate the properties of the design for two partial orders.


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a graphical display, the selection impact (SI) curve that shows the population response rate as a function of treatment selection criteria based on the marker, which can be useful for choosing a treatment policy that incorporates information on the patient's marker value exceeding a threshold.
Abstract: Selecting the best treatment for a patient's disease may be facilitated by evaluating clinical characteristics or biomarker measurements at diagnosis We consider how to evaluate the potential impact of such measurements on treatment selection algorithms For example, magnetic resonance neurographic imaging is potentially useful for deciding whether a patient should be treated surgically for Carpal Tunnel Syndrome or should receive less-invasive conservative therapy We propose a graphical display, the selection impact (SI) curve that shows the population response rate as a function of treatment selection criteria based on the marker The curve can be useful for choosing a treatment policy that incorporates information on the patient's marker value exceeding a threshold The SI curve can be estimated using data from a comparative randomized trial conducted in the population as long as treatment assignment in the trial is independent of the predictive marker Estimating the SI curve is therefore part of a post hoc analysis to determine whether the marker identifies patients that are more likely to benefit from one treatment over another Nonparametric and parametric estimates of the SI curve are proposed in this article Asymptotic distribution theory is used to evaluate the relative efficiencies of the estimators Simulation studies show that inference is straightforward with realistic sample sizes We illustrate the SI curve and statistical inference for it with data motivated by an ongoing trial of surgery versus conservative therapy for Carpal Tunnel Syndrome

Journal ArticleDOI
TL;DR: A latent pattern mixture model (LPMM) is proposed, where the mixture patterns are formed from latent classes that link the longitudinal response and the missingness process and suggests the presence of four latent classes linking subject visit patterns to homeless outcomes.
Abstract: A frequently encountered problem in longitudinal studies is data that are missing due to missed visits or dropouts. In the statistical literature, interest has primarily focused on monotone missing data (dropout) with much less work on intermittent missing data in which a subject may return after one or more missed visits. Intermittent missing data have broader applicability that can include the frequent situation in which subjects do not have common sets of visit times or they visit at nonprescheduled times. In this article, we propose a latent pattern mixture model (LPMM), where the mixture patterns are formed from latent classes that link the longitudinal response and the missingness process. This allows us to handle arbitrary patterns of missing data embodied by subjects' visit process, and avoids the need to specify the mixture patterns a priori. One assumption of our model is that the missingness process is assumed to be conditionally independent of the longitudinal outcomes given the latent classes. We propose a noniterative approach to assess this key assumption. The LPMM is illustrated with a data set from a health service research study in which homeless people with mental illness were randomized to three different service packages and measures of homelessness were recorded at multiple time points. Our model suggests the presence of four latent classes linking subject visit patterns to homeless outcomes.

Journal ArticleDOI
TL;DR: This article derives estimators that are easy to compute and are more efficient than previous estimators for the survival distribution and mean restricted survival time under different treatment policies and applies these estimators to a leukemia clinical trial data set that motivated this study.
Abstract: Two-stage designs are common in therapeutic clinical trials such as Cancer or AIDS treatments. In a two-stage design, patients are initially treated with one induction (primary) therapy and then depending upon their response and consent, are treated by a maintenance therapy, sometimes to intensify the effect of the first stage therapy. The goal is to compare different combinations of primary and maintenance (intensification) therapies to find the combination that is most beneficial. To achieve this goal, patients are initially randomized to one of several induction therapies and then if they are eligible for the second-stage randomization, are offered to be randomized to one of several maintenance therapies. In practice, the analysis is usually conducted in two separate stages which does not directly address the major objective of finding the best combination. Recently Lunceford et al. (2002, Biometrics, 58, 48-57) introduced ad hoc estimators for the survival distribution and mean restricted survival time under different treatment policies. These estimators are consistent but not efficient, and do not include information from auxiliary covariates. In this dissertation study we derive estimators that are easy to compute and are more efficient than previous estimators. We also show how to improve efficiency further by taking into account additional information from auxiliary variables. Large sample properties of these estimators are derived and comparisons with other estimators are made using simulation. We apply our estimators to a leukemia clinical trial data set that motivated this study.

Journal ArticleDOI
TL;DR: This article extends sample size requirements for estimating the prevalence of disease in the case of a single imperfect test to include two conditionally independent imperfect tests, and applies several different criteria for Bayesian sample size determination to the design of such studies.
Abstract: Planning studies involving diagnostic tests is complicated by the fact that virtually no test provides perfectly accurate results. The misclassification induced by imperfect sensitivities and specificities of diagnostic tests must be taken into account, whether the primary goal of the study is to estimate the prevalence of a disease in a population or to investigate the properties of a new diagnostic test. Previous work on sample size requirements for estimating the prevalence of disease in the case of a single imperfect test showed very large discrepancies in size when compared to methods that assume a perfect test. In this article we extend these methods to include two conditionally independent imperfect tests, and apply several different criteria for Bayesian sample size determination to the design of such studies. We consider both disease prevalence studies and studies designed to estimate the sensitivity and specificity of diagnostic tests. As the problem is typically nonidentifiable, we investigate the limits on the accuracy of parameter estimation as the sample size approaches infinity. Through two examples from infectious diseases, we illustrate the changes in sample sizes that arise when two tests are applied to individuals in a study rather than a single test. Although smaller sample sizes are often found in the two-test situation, they can still be prohibitively large unless accurate information is available about the sensitivities and specificities of the tests being used.

Journal ArticleDOI
TL;DR: It is shown that the ROC curve can be interpreted as a cumulative distribution function for the discriminatory measure Y in the affected population after Y has been standardized to the distribution in the reference population (D= 0).
Abstract: The idea of using measurements such as biomarkers, clinical data, or molecular biology assays for classification and prediction is popular in modern medicine. The scientific evaluation of such measures includes assessing the accuracy with which they predict the outcome of interest. Receiver operating characteristic curves are commonly used for evaluating the accuracy of diagnostic tests. They can be applied more broadly, indeed to any problem involving classification to two states or populations (D= 0 or 1). We show that the ROC curve can be interpreted as a cumulative distribution function for the discriminatory measure Y in the affected population (D= 1) after Y has been standardized to the distribution in the reference population (D= 0). The standardized values are called placement values. If the placement values have a uniform(0, 1) distribution, then Y is not discriminatory, because its distribution in the affected population is the same as that in the reference population. The degree to which the distribution of the standardized measure differs from uniform(0, 1) is a natural way to characterize the discriminatory capacity of Y and provides a nontraditional interpretation for the ROC curve. Statistical methods for making inference about distribution functions therefore motivate new approaches to making inference about ROC curves. We demonstrate this by considering the ROC-GLM regression model and observing that it is equivalent to a regression model for the distribution of placement values. The likelihood of the placement values provides a new approach to ROC parameter estimation that appears to be more efficient than previously proposed methods. The method is applied to evaluate a pulmonary function measure in cystic fibrosis patients as a predictor of future occurrence of severe acute pulmonary infection requiring hospitalization. Finally, we note the relationship between regression models for the mean placement value and recently proposed models for the area under the ROC curve which is the classic summary index of discrimination.

Journal ArticleDOI
TL;DR: This article describes two tests for the case-cohort design, which can be treated as a natural generalization of log-rank test in the full cohort design and derives an explicit form for power/sample size calculation based on these two tests.
Abstract: In epidemiologic studies and disease prevention trials, interest often involves estimation of the relationship between some disease endpoints and individual exposure. In some studies, due to the rarity of the disease and the cost in collecting the exposure information for the entire cohort, a case-cohort design, which consists of a small random sample of the whole cohort and all the diseased subjects, is often used. Previous work has focused on analyzing data from the case-cohort design and few have discussed the sample size issues. In this article, we describe two tests for the case-cohort design, which can be treated as a natural generalization of log-rank test in the full cohort design. We derive an explicit form for power/sample size calculation based on these two tests. A number of simulation studies have been used to illustrate the efficiency of the tests for the case-cohort design. An example is provided on how to use the formula.

Journal ArticleDOI
TL;DR: It is found that for both of these approaches, imputation by LOCF can lead to substantial biases in estimators of treatment effects, the type I error rates of associated tests can be greatly inflated, and the coverage probability can be far from the nominal level.
Abstract: In recent years there has been considerable research devoted to the development of methods for the analysis of incomplete data in longitudinal studies. Despite these advances, the methods used in practice have changed relatively little, particularly in the reporting of pharmaceutical trials. In this setting, perhaps the most widely adopted strategy for dealing with incomplete longitudinal data is imputation by the "last observation carried forward" (LOCF) approach, in which values for missing responses are imputed using observations from the most recently completed assessment. We examine the asymptotic and empirical bias, the empirical type I error rate, and the empirical coverage probability associated with estimators and tests of treatment effect based on the LOCF imputation strategy. We consider a setting involving longitudinal binary data with longitudinal analyses based on generalized estimating equations, and an analysis based simply on the response at the end of the scheduled follow-up. We find that for both of these approaches, imputation by LOCF can lead to substantial biases in estimators of treatment effects, the type I error rates of associated tests can be greatly inflated, and the coverage probability can be far from the nominal level. Alternative analyses based on all available data lead to estimators with comparatively small bias, and inverse probability weighted analyses yield consistent estimators subject to correct specification of the missing data process. We illustrate the differences between various methods of dealing with drop-outs using data from a study of smoking behavior.

Journal ArticleDOI
TL;DR: The approach is Bayesian, where posterior summaries are obtained via a hybrid Markov chain Monte Carlo algorithm and compared across a broad collection of rather high‐dimensional hierarchical models using the deviance information criterion, a tool recently developed for just this purpose.
Abstract: Summary. Several recent papers (e.g., Chen, Ibrahim, and Sinha, 1999, Journal of the American Statistical Association94, 909–919; Ibrahim, Chen, and Sinha, 2001a, Biometrics57, 383–388) have described statistical methods for use with time-to-event data featuring a surviving fraction (i.e., a proportion of the population that never experiences the event). Such cure rate models and their multivariate generalizations are quite useful in studies of multiple diseases to which an individual may never succumb, or from which an individual may reasonably be expected to recover following treatment (e.g., various types of cancer). In this article we extend these models to allow for spatial correlation (estimable via zip code identifiers for the subjects) as well as interval censoring. Our approach is Bayesian, where posterior summaries are obtained via a hybrid Markov chain Monte Carlo algorithm. We compare across a broad collection of rather high-dimensional hierarchical models using the deviance information criterion, a tool recently developed for just this purpose. We apply our approach to the analysis of a smoking cessation study where the subjects reside in 53 southeastern Minnesota zip codes. In addition to the usual posterior estimates, our approach yields smoothed zip code level maps of model parameters related to the relapse rates over time and the ultimate proportion of quitters (the cure rates).

Journal ArticleDOI
TL;DR: A new family of distributions for circular random variables is proposed, based on nonnegative trigonometric sums, which can be used to model data sets which present skewness and/or multimodality.
Abstract: A new family of distributions for circular random variables is proposed. It is based on nonnegative trigonometric sums and can be used to model data sets which present skewness and/or multimodality. In this family of distributions, the trigonometric moments are easily expressed in terms of the parameters of the distribution. The proposed family is applied to two data sets, one related with the directions taken by ants and the other with the directions taken by turtles, to compare their goodness of fit versus common distributions used in the literature.

Journal ArticleDOI
Rongling Wu1, Chang-Xing Ma1, Min Lin1, Zuoheng Wang1, George Casella1 
TL;DR: This article presents a new statistical model for mapping growth QTL, which also addresses the problem of variance stationarity, by using a transform-both-sides (TBS) model advocated by Carroll and Ruppert (1984).
Abstract: The incorporation of developmental control mechanisms of growth has proven to be a powerful tool in mapping quantitative trait loci (QTL) underlying growth trajectories. A theoretical framework for implementing a QTL mapping strategy with growth laws has been established. This framework can be generalized to an arbitrary number of time points, where growth is measured, and becomes computationally more tractable, when the assumption of variance stationarity is made. In practice, however, this assumption is likely to be violated for age-specific growth traits due to a scale effect. In this article, we present a new statistical model for mapping growth QTL, which also addresses the problem of variance stationarity, by using a transform-both-sides (TBS) model advocated by Carroll and Ruppert (1984, Journal of the American Statistical Association 79, 321-328). The TBS-based model for mapping growth QTL cannot only maintain the original biological properties of a growth model, but also can increase the accuracy and precision of parameter estimation and the power to detect a QTL responsible for growth differentiation. Using the TBS-based model, we successfully map a QTL governing growth trajectories to a linkage group in an example of forest trees. The statistical and biological properties of the estimates of this growth QTL position and effect are investigated using Monte Carlo simulation studies. The implications of our model for understanding the genetic architecture of growth are discussed.

Journal ArticleDOI
TL;DR: An application to Los Angeles County wildfire data is given, in which it is shown that the separability hypothesis is invalidated largely due to clustering of fires of similar sizes within periods of up to 3.9 years.
Abstract: Nonparametric tests for investigating the separability of a spatial-temporal marked point process are described and compared. It is shown that a Cramer-von Mises-type test is very powerful at detecting gradual departures from separability, and that a residual test based on randomly rescaling the process is powerful at detecting nonseparable clustering or inhibition of the marks. An application to Los Angeles County wildfire data is given, in which it is shown that the separability hypothesis is invalidated largely due to clustering of fires of similar sizes within periods of up to 3.9 years.