scispace - formally typeset
Search or ask a question

Showing papers in "Biostatistics in 2004"


Journal ArticleDOI
TL;DR: A modification ofbinary segmentation is developed, which is called circular binary segmentation, to translate noisy intensity measurements into regions of equal copy number in DNA sequence copy number.
Abstract: DNA sequence copy number is the number of copies of DNA at a region of a genome. Cancer progression often involves alterations in DNA copy number. Newly developed microarray technologies enable simultaneous measurement of copy number at thousands of sites in a genome. We have developed a modification of binary segmentation, which we call circular binary segmentation, to translate noisy intensity measurements into regions of equal copy number. The method is evaluated by simulation and is demonstrated on cell line data with known copy number alterations and on a breast cancer cell line data set.

2,269 citations


Journal ArticleDOI
TL;DR: This work proposes a hierarchical mixture model to provide methodology that is both sensitive in detecting differential expression and sufficiently flexible to account for the complex variability of normalized microarray data.
Abstract: SUMMARY Mixture modeling provides an effective approach to the differential expression problem in microarray data analysis. Methods based on fully parametric mixture models are available, but lack of fit in some examples indicates that more flexible models may be beneficial. Existing, more flexible, mixture models work at the level of one-dimensional gene-specific summary statistics, and so when there are relatively few measurements per gene these methods may not provide sensitive detectors of differential expression. We propose a hierarchical mixture model to provide methodology that is both sensitive in detecting differential expression and sufficiently flexible to account for the complex variability of normalized microarray data. EM-based algorithms are used to fit both parametric and semiparametric versions of the model. We restrict attention to the two-sample comparison problem; an experiment involving Affymetrix microarrays and yeast translation provides the motivating case study. Gene-specific posterior probabilities of differential expression form the basis of statistical inference; they define short gene lists and false discovery rates. Compared to several competing methodologies, the proposed methodology exhibits good operating characteristics in a simulation study, on the analysis of spike-in data, and in a cross-validation calculation.

622 citations


Journal ArticleDOI
TL;DR: It is argued that some simple but commonly used methods to handle incomplete longitudinal clinical trial data require restrictive assumptions and stand on a weaker theoretical foundation than likelihood-based methods developed under the missing at random (MAR) framework, and their optimal place is within sensitivity analysis.
Abstract: Using standard missing data taxonomy, due to Rubin and co-workers, and simple algebraic derivations, it is argued that some simple but commonly used methods to handle incomplete longitudinal clinical trial data, such as complete case analyses and methods based on last observation carried forward, require restrictive assumptions and stand on a weaker theoretical foundation than likelihood-based methods developed under the missing at random (MAR) framework. Given the availability of flexible software for analyzing longitudinal sequences of unequal length, implementation of likelihood-based MAR analyses is not limited by computational considerations. While such analyses are valid under the comparatively weak assumption of MAR, the possibility of data missing not at random (MNAR) is difficult to rule out. It is argued, however, that MNAR analyses are, themselves, surrounded with problems and therefore, rather than ignoring MNAR analyses altogether or blindly shifting to them, their optimal place is within sensitivity analysis. The concepts developed here are illustrated using data from three clinical trials, where it is shown that the analysis method may have an impact on the conclusions of the study.

417 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem and showed that when using the same set of genes, PLR and SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability.
Abstract: Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.

383 citations


Journal ArticleDOI
TL;DR: The use of optimal multivariate matching prior to randomization to improve covariate balance for many variables at the same time is discussed, presenting an algorithm and a case-study of its performance.
Abstract: SUMMARY Although blocking or pairing before randomization is a basic principle of experimental design, the principle is almost invariably applied to at most one or two blocking variables. Here, we discuss the use of optimal multivariate matching prior to randomization to improve covariate balance for many variables at the same time, presenting an algorithm and a case-study of its performance. The method is useful when all subjects, or large groups of subjects, are randomized at the same time. Optimal matching divides a single group of 2n subjects into n pairs to minimize covariate differences within pairs—the so-called nonbipartite matching problem—then one subject in each pair is picked at random for treatment, the other being assigned to control. Using the baseline covariate data for 132 patients from an actual, unmatched, randomized experiment, we construct 66 pairs matching for 14 covariates. We then create 10 000 unmatched and 10 000 matched randomized experiments by repeatedly randomizing the 132 patients, and compare the covariate balance with and without matching. By every measure, every one of the 14 covariates was substantially better balanced when randomization was performed within matched pairs. Even after covariance adjustment for chance imbalances in the 14 covariates, matched randomizations provided more accurate estimates than unmatched randomizations, the increase in accuracy being equivalent to, on average, a 7% increase in sample size. In randomization tests of no treatment effect, matched randomizations using the signed rank test had substantially higher power than unmatched randomizations using the rank sum test, even when only 2 of 14 covariates were relevant to a simulated response. Unmatched randomizations experienced rare disasters which were consistently avoided by matched randomizations.

210 citations


Journal ArticleDOI
TL;DR: A 'structured' hidden Markov model where the underlying Markov chain is generated by a simple transmission model, which is more parsimonious, is more biologically plausible, and allows key epidemiological parameters to be estimated.
Abstract: SUMMARY Surveillance data for communicable nosocomial pathogens usually consist of short time series of lownumbered counts of infected patients. These often show overdispersion and autocorrelation. To date, almost all analyses of such data have ignored the communicable nature of the organisms and have used methods appropriate only for independent outcomes. Inferences that depend on such analyses cannot be considered reliable when patient-to-patient transmission is important. We propose a new method for analysing these data based on a mechanistic model of the epidemic process. Since important nosocomial pathogens are often carried asymptomatically with overt infection developing in only a proportion of patients, the epidemic process is usually only partially observed by routine surveillance data. We therefore develop a ‘structured’ hidden Markov model where the underlying Markov chain is generated by a simple transmission model. We apply both structured and standard (unstructured) hidden Markov models to time series for three important pathogens. We find that both methods can offer marked improvements over currently used approaches when nosocomial spread is important. Compared to the standard hidden Markov model, the new approach is more parsimonious, is more biologically plausible, and allows key epidemiological parameters to be estimated.

173 citations


Journal ArticleDOI
TL;DR: This work focuses on the case in which patient subgroups are defined to contain patients having increasingly larger values of one particular covariate of interest, with the intent of exploring the possible interaction between treatment effect and that covariate.
Abstract: SUMMARY We discuss the practice of examining patterns of treatment effects across overlapping patient subpopulations. In particular, we focus on the case in which patient subgroups are defined to contain patients having increasingly larger (or smaller) values of one particular covariate of interest, with the intent of exploring the possible interaction between treatment effect and that covariate. We formalize these subgroup approaches (STEPP: subpopulation treatment effect pattern plots) and implement them when treatment effect is defined as the difference in survival at a fixed time point between two treatment arms. The joint asymptotic distribution of the treatment effect estimates is derived, and used to construct simultaneous confidence bands around the estimates and to test the null hypothesis of no interaction. These methods are illustrated using data from a clinical trial conducted by the International Breast Cancer Study Group, which demonstrates the critical role of estrogen receptor content of the primary breast cancer for selecting appropriate adjuvant therapy. The considerations are also relevant for general subset analysis, since information from the same patients is typically used in the estimation of treatment effects within two or more subgroups of patients defined with respect to different covariates.

170 citations


Journal ArticleDOI
TL;DR: These designs represent a substantial practical improvement over classical experimental designs which work in terms of standard interactions and main effects, and are superior to both the popular reference designs, which are highly inefficient, and to designs incorporating all possible direct pairwise comparisons.
Abstract: SUMMARY Microarrays are powerful tools for surveying the expression levels of many thousands of genes simultaneously. They belong to the new genomics technologies which have important applications in the biological, agricultural and pharmaceutical sciences. There are myriad sources of uncertainty in microarray experiments, and rigorous experimental design is essential for fully realizing the potential of these valuable resources. Two questions frequently asked by biologists on the brink of conducting cDNA or two-colour, spotted microarray experiments are ‘Which mRNA samples should be competitively hybridized together on the same slide?’ and ‘How many times should each slide be replicated?’ Early experience has shown that whilst the field of classical experimental design has much to offer this emerging multi-disciplinary area, new approaches which accommodate features specific to the microarray context are needed. In this paper, we propose optimal designs for factorial and time course experiments, which are special designs arising quite frequently in microarray experimentation. Our criterion for optimality is statistical efficiency based on a new notion of admissible designs; our approach enables efficient designs to be selected subject to the information available on the effects of most interest to biologists, the number of arrays available for the experiment, and other resource or practical constraints, including limitations on the amount of mRNA probe. We show that our designs are superior to both the popular reference designs, which are highly inefficient, and to designs incorporating all possible direct pairwise comparisons. Moreover, our proposed designs represent a substantial practical improvement over classical experimental designs which work in terms of standard interactions and main effects. The latter do not provide a basis for meaningful inference on the effects of most interest to biologists, nor make the most efficient use of valuable and limited resources.

139 citations


Journal ArticleDOI
TL;DR: This article exposes a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks, and shows that dramatic computational savings are possible over naive implementations.
Abstract: SUMMARY Gene expression arrays typically have 50 to 100 samples and 1000 to 20 000 variables (genes). There have been many attempts to adapt statistical models for regression and classification to these data, and in many cases these attempts have challenged the computational resources. In this article we expose a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks. For all of these models, we show that dramatic computational savings are possible over naive implementations, using standard transformations in numerical linear algebra.

119 citations


Journal ArticleDOI
TL;DR: A stochastic epidemic model is proposed which incorporates heterogeneity in the spread of a disease through a population based on the spatial location of an individual's home and the household and school class to which the individual belongs.
Abstract: A stochastic epidemic model is proposed which incorporates heterogeneity in the spread of a disease through a population. In particular, three factors are considered: the spatial location of an individual's home and the household and school class to which the individual belongs. The model is applied to an extremely informative measles data set and the model is compared with nested models, which incorporate some, but not all, of the aforementioned factors. A reversible jump Markov chain Monte Carlo algorithm is then introduced which assists in selecting the most appropriate model to fit the data.

94 citations


Journal ArticleDOI
TL;DR: Comparing the causal estimand of the new treatment of the complier average causal effect with an estimator based on an alternative, nonignorable model for the missing data process, developed by Frangakis and Rubin, and a new missing data model that is specially suited for models with instrumental variables, but makes different substantive assumptions are analyzed.
Abstract: SUMMARY Recently, instrumental variables methods have been used to address non-compliance in randomized experiments. Complicating such analyses is often the presence of missing data. The standard model for missing data, missing at random (MAR), has some unattractive features in this context. In this paper we compare MAR-based estimates of the complier average causal effect (CACE) with an estimator based on an alternative, nonignorable model for the missing data process, developed by Frangakis and Rubin (1999). We also introduce a new missing data model that, like the Frangakis–Rubin model, is specially suited for models with instrumental variables, but makes different substantive assumptions. We analyze these issues in the context of a randomized trial of breast self-examination (BSE). In the study two methods of teaching BSE, consisting of either mailed information about BSE (the standard treatment) or the attendance of a course involving theoretical and practical sessions (the new treatment), were compared with the aim of assessing whether teaching programs could increase BSE practice and improve examination skills. The study was affected by the two sources of bias mentioned above: only 55% of women assigned to receive the new treatment complied with their assignment and 35% of the women did not respond to the post-test questionnaire. Comparing the causal estimand of the new treatment using the MAR, Frangakis–Rubin, and our new approach, the results suggest that for these data the MAR assumption appears least plausible, and that the new model appears most plausible among the three choices.

Journal ArticleDOI
TL;DR: This work proposes two new methods for estimating the intercept and slope parameters from a binormal ROC curve that assesses the accuracy of a continuous diagnostic test using a profile likelihood and a simple algorithm.
Abstract: SUMMARY Not until recently has much attention been given to deriving maximum likelihood methods for estimating the intercept and slope parameters from a binormal ROC curve that assesses the accuracy of a continuous diagnostic test. We propose two new methods for estimating these parameters. The first method uses the profile likelihood and a simple algorithm to produce fully efficient estimates. The second method is based on a pseudo-maximum likelihood that can easily accommodate adjusting for covariates that could affect the accuracy of the continuous test.

Journal ArticleDOI
TL;DR: A semiparametric estimation method for time-dependent ROC curves that adopts a regression quantile approach for longitudinal data introduced by Heagerty and Pepe and develops asymptotic distribution theory for the ROC estimators where the distributional shape for the marker is allowed to depend on covariates.
Abstract: One approach to evaluating the strength of association between a longitudinal marker process and a key clinical event time is through predictive regression methods such as a time-dependent covariate hazard model. For example, a Cox model with time-varying covariates specifies the instantaneous risk of the event as a function of the time-varying marker and additional covariates. In this manuscript we explore a second complementary approach which characterizes the distribution of the marker as a function of both the measurement time and the ultimate event time. Our goal is to extend the standard diagnostic accuracy concepts of sensitivity and specificity so as to recognize explicitly both the timing of the marker measurement and the timing of disease. The accuracy of a longitudinal marker can be fully characterized using time-dependent receiver operating characteristic (ROC) curves. We detail a semiparametric estimation method for time-dependent ROC curves that adopts a regression quantile approach for longitudinal data introduced by Heagerty and Pepe (1999, Applied Statistics, 48, 533-551). We extend the work of Heagerty and Pepe (1999, Applied Statistics, 48, 533-551) by developing asymptotic distribution theory for the ROC estimators where the distributional shape for the marker is allowed to depend on covariates. To illustrate our method, we analyze pulmonary function measurements among cystic fibrosis subjects and estimate ROC curves that assess how well the pulmonary function measurement can distinguish subjects that progress to death from subjects that remain alive. Comparing the results from our semiparametric analysis to a fully parametric method discussed by Etzioni et al. (1999, Medical Decision Making, 19, 242-251) suggests that the ability to relax distributional assumptions may be important in practice.

Journal ArticleDOI
TL;DR: It is shown how maximum likelihood estimation can be used to reconstruct a tree model for the dependences between genetic alterations in a given tumour type.
Abstract: SUMMARY We present a new approach for modelling the dependences between genetic changes in human tumours. In solid tumours, data on genetic alterations are usually only available at a single point in time, allowing no direct insight into the sequential order of genetic events. In our approach, genetic tumour development and progression is assumed to follow a probabilistic tree model. We show how maximum likelihood estimation can be used to reconstruct a tree model for the dependences between genetic alterations in a given tumour type. We illustrate the use of the proposed method by applying it to cytogenetic data from 173 cases of clear cell renal cell carcinoma, arriving at a model for the karyotypic evolution of this tumour.

Journal ArticleDOI
TL;DR: Two approaches for regression analysis of multilevel binary data when clusters are not necessarily nested are proposed and compared: a GEE method that relies on a working independence assumption coupled with a three-step method for obtaining empirical standard errors and a likelihood-based method implemented using Bayesian computational techniques.
Abstract: We propose and compare two approaches for regression analysis of multilevel binary data when clusters are not necessarily nested: a GEE method that relies on a working independence assumption coupled with a three-step method for obtaining empirical standard errors, and a likelihood-based method implemented using Bayesian computational techniques. Implications of time-varying endogenous covariates are addressed. The methods are illustrated using data from the Breast Cancer Surveillance Consortium to estimate mammography accuracy from a repeatedly screened population.

Journal ArticleDOI
Tianxi Cai1
TL;DR: Simulation studies suggest that the new estimator under the semi-parametric model, while always being more robust, has efficiency that is comparable to or better than the Alonzo and Pepe (2002) estimator from the parametric model.
Abstract: SUMMARY Advances in technology provide new diagnostic tests for early detection of disease. Frequently, these tests have continuous outcomes. One popular method to summarize the accuracy of such a test is the Receiver Operating Characteristic (ROC) curve. Methods for estimating ROC curves have long been available. To examine covariate effects, Pepe (1997, 2000) and Alonzo and Pepe (2002) proposed distribution-free approaches based on a parametric regression model for the ROC curve. Cai and Pepe (2002) extended the parametric ROC regression model by allowing an arbitrary non-parametric baseline function. In this paper, while we follow the same semi-parametric setting as in that paper, we highlight an ew estimator that offers several improvements over the earlier work: superior efficiency, the ability to estimate the covariate effects without estimating the non-parametric baseline function and easy implementation with standard software. The methodology is applied to a case control dataset where we evaluate the accuracy of the prostate-specific antigen as a biomarker for early detection of prostate cancer. Simulation studies suggest that the new estimator under the semi-parametric model, while always being more robust, has efficiency that is comparable to or better than the Alonzo and Pepe (2002) estimator from the parametric model.

Journal ArticleDOI
TL;DR: The standard definitions of the predictive values are extended to accommodate prognostic factors that are measured on a continuous scale and a corresponding graphical method is suggested to summarize predictive accuracy.
Abstract: SUMMARY The positive and negative predictive values are standard ways of quantifying predictive accuracy when both the outcome and the prognostic factor are binary. Methods for comparing the predictive values of two or more binary factors have been discussed previously (Leisenring et al., 2000, Biometrics 56, 345–351). We propose extending the standard definitions of the predictive values to accommodate prognostic factors that are measured on a continuous scale and suggest a corresponding graphical method to summarize predictive accuracy. Drawing on the work of Leisenring et al. we make use of a marginal regression framework and discuss methods for estimating these predictive value functions and their differences within this framework. The methods presented in this paper have the potential to be useful in a number of areas including the design of clinical trials and health policy analysis.

Journal ArticleDOI
TL;DR: A mixed model is developed in conjunction with maximum pseudolikelihood and generalized linear mixed modeling by extending Baddeley and Turner's (2000, Australian and New Zealand Journal of Statistics 42, 283-322) work on pseudolikeslihood for single patterns.
Abstract: SUMMARY The statistical methodology for the analysis of replicated spatial point patterns in complex designs such as those including replications is fairly undeveloped. A mixed model is developed in conjunction with maximum pseudolikelihood and generalized linear mixed modeling by extending Baddeley and Turner’s (2000, Australian and New Zealand Journal of Statistics 42, 283–322) work on pseudolikelihood for single patterns. A simulation experiment is performed on parameter estimation. Fixed- and mixed-effect models are compared, and in some respects the mixed model is found to be superior. An example using the Strauss process for modeling neuron locations in post-mortem brain slices is shown.

Journal ArticleDOI
TL;DR: This framework provides a general method for handling missing covariate values when fitting generalized additive models and demonstrates that standard complete-case methods can yield biased estimates of the spatial variation of cancer risk.
Abstract: SUMMARY Maps depicting cancer incidence rates have become useful tools in public health research, giving valuable information about the spatial variation in rates of disease. Typically, these maps are generated using count data aggregated over areas such as counties or census blocks. However, with the proliferation of geographic information systems and related databases, it is becoming easier to obtain exact spatial locations for the cancer cases and suitable control subjects. The use of such point data allows us to adjust for individual-level covariates, such as age and smoking status, when estimating the spatial variation in disease risk. Unfortunately, such covariate information is often subject to missingness. We propose a method for mapping cancer risk when covariates are not completely observed. We model these data using a logistic generalized additive model. Estimates of the linear and non-linear effects are obtained using a mixed effects model representation. We develop an EM algorithm to account for missing data and the random effects. Since the expectation step involves an intractable integral, we estimate the E-step with a Laplace approximation. This framework provides a general method for handling missing covariate values when fitting generalized additive models. We illustrate our method through an analysis of cancer incidence data from Cape Cod, Massachusetts. These analyses demonstrate that standard complete-case methods can yield biased estimates of the spatial variation of cancer risk.

Journal ArticleDOI
TL;DR: A new maximum likelihood estimator (MLE) for nested case-control sampling in the context of Cox's proportional hazards model is presented and how the maximum likelihood framework can be used to obtain information additional to the relative risk estimates of covariates is illustrated.
Abstract: Nested case-control sampling is designed to reduce the costs of large cohort studies. It is important to estimate the parameters of interest as efficiently as possible. We present a new maximum likelihood estimator (MLE) for nested case-control sampling in the context of Cox's proportional hazards model. The MLE is computed by the EM-algorithm, which is easy to implement in the proportional hazards setting. Standard errors are estimated by a numerical profile likelihood approach based on EM aided differentiation. The work was motivated by a nested case-control study that hypothesized that insulin-like growth factor I was associated with ischemic heart disease. The study was based on a population of 3784 Danes and 231 cases of ischemic heart disease where controls were matched on age and gender. We illustrate the use of the MLE for these data and show how the maximum likelihood framework can be used to obtain information additional to the relative risk estimates of covariates.

Journal ArticleDOI
TL;DR: A simple algorithm is obtained to compute the maximum likelihood estimator of p(1),.
Abstract: The pool adjacent violator algorithm Ayer et al. (1955, The Annals of Mathematical Statistics, 26, 641-647) has long been known to give the maximum likelihood estimator of a series of ordered binomial parameters, based on an independent observation from each distribution (see Barlow et al., 1972, Statistical Inference under Order Restrictions, Wiley, New York). This result has immediate application to estimation of a survival distribution based on current survival status at a set of monitoring times. This paper considers an extended problem of maximum likelihood estimation of a series of 'ordered' multinomial parameters p(i)= (p(1i),p(2i),.,p(mi)) for 1

Journal ArticleDOI
TL;DR: The investigation indicates that the probability of overdiagnosis is remarkably high and is studied numerically for prostate cancer and applied to a variety of screening schedules.
Abstract: SUMMARY Overdiagnosis refers to the situation where a screening exam detects a disease that would have otherwise been undetected in a person’s lifetime. The disease would have not have been diagnosed because the individual would have died of other causes prior to its clinical onset. Although the probability of overdiagnosis is an important quantity for understanding early detection programs it has not been rigorously studied. We analyze an idealized early detection program and derive the mathematical expression for the probability of overdiagnosis. The results are studied numerically for prostate cancer and applied to a variety of screening schedules. Our investigation indicates that the probability of overdiagnosis is remarkably high.

Journal ArticleDOI
TL;DR: This article construct and study estimators of the causal effect of a time-dependent treatment on survival in longitudinal studies using a particular marginal structural model (MSM), proposed by Robins (2000), and follow a general methodology for constructing estimating functions in censored data models.
Abstract: SUMMARY In this article we construct and study estimators of the causal effect of a time-dependent treatment on survival in longitudinal studies. We employ a particular marginal structural model (MSM), proposed by Robins (2000), and follow a general methodology for constructing estimating functions in censored data models. The inverse probability of treatment weighted (IPTW) estimator of Robins et al. (2000) is used as an initial estimator and forms the basis for an improved, one-step estimator that is consistent and asymptotically linear when the treatment mechanism is consistently estimated. We extend these methods to handle informative censoring. The proposed methodology is employed to estimate the causal effect of exercise on mortality in a longitudinal study of seniors in Sonoma County. A simulation study demonstrates the bias of naive estimators in the presence of time-dependent confounders and also shows the efficiency gain of the IPTW estimator, even in the absence such confounding. The efficiency gain of the improved, one-step estimator is demonstrated through simulation.

Journal ArticleDOI
Jennifer Pittman1, Erich Huang1, Joseph R. Nevins1, Quanli Wang1, Mike West1 
TL;DR: Bayesian analysis of a specific class of tree models in which binary response data arise from a retrospective case-control design is described and questions of generating and combining multiple trees via Bayesian model averaging for prediction are discussed.
Abstract: SUMMARY Classification tree models are flexible analysis tools which have the ability to evaluate interactions among predictors as well as generate predictions for responses of interest. We describe Bayesian analysis of a specific class of tree models in which binary response data arise from a retrospective case-control design. We are also particularly interested in problems with potentially very many candidate predictors. This scenario is common in studies concerning gene expression data, which is a key motivating example context. Innovations here include the introduction of tree models that explicitly address and incorporate the retrospective design, and the use of nonparametric Bayesian models involving Dirichlet process priors on the distributions of predictor variables. The model specification influences the generation of trees through Bayes’ factor based tests of association that determine significant binary partitions of nodes during a process of forward generation of trees. We describe this constructive process and discuss questions of generating and combining multiple trees via Bayesian model averaging for prediction. Additional discussion of parameter selection and sensitivity is given in the context of an example which concerns prediction of breast tumour status utilizing high-dimensional gene expression data; the example demonstrates the exploratory/explanatory uses of such models as well as their primary utility in prediction. Shortcomings of the approach and comparison with alternative tree modelling algorithms are also discussed, as are issues of modelling and computational extensions.

Journal ArticleDOI
TL;DR: Methods for testing for differences in mean functions between treatment groups which accommodate the fact that each particular event process is ultimately terminated by death are described.
Abstract: SUMMARY In studies involving diseases associated with high rates of mortality, trials are frequently conducted to evaluate the effects of therapeutic interventions on recurrent event processes terminated by death. In this setting, cumulative mean functions form a natural basis for inference for questions of a health economic nature, and Ghosh and Lin (2000) recently proposed a relevant class of test statistics. Trials of patients with cancer metastatic to bone, however, involve multiple types of skeletal complications, each of which may be repeatedly experienced by patients over their lifetime. Traditionally the distinction between the various types of events is ignored and univariate analyses are conducted based on a composite recurrent event. However, when the events have different impacts on patients’ quality of life, or when they incur different costs, it can be important to gain insight into the relative frequency of the specific types of events and treatment effects thereon. This may be achieved by conducting separate marginal analyses with each analysis focusing on one type of recurrent event. Global inferences regarding treatment benefit can then be achieved by carrying out multiplicity adjusted marginal tests, more formal multiple testing procedures, or by constructing global test statistics. We describe methods for testing for differences in mean functions between treatment groups which accommodate the fact that each particular event process is ultimately terminated by death. The methods are illustrated by application to a motivating study designed to examine the effect of bisphosphonate therapy on the incidence of skeletal complications among patients with breast cancer metastatic to bone. We find that there is a consistent trend towards a reduction in the cumulative mean for all four types of skeletal complications with bisphosphonate therapy; there is a significant reduction in the need for radiation therapy for the treatment of bone. The global test suggests that bisphosphonate therapy significantly reduces the overall number of skeletal complications.

Journal ArticleDOI
TL;DR: A non-linear Bayesian hierarchical model is proposed to combine longitudinal data on PSA growth from three different studies to characterize growth rates of PSA accounting for differences when combining data from different studies, and investigate the impact of clinical covariates such as advanced disease and unfavorable histology on PSC growth rates.
Abstract: SUMMARY Prostate-specific antigen (PSA) is a biomarker commonly used to screen for prostate cancer. Several studies have examined PSA growth rates prior to prostate cancer diagnosis. However, the resulting estimates are highly variable. In this article we propose a non-linear Bayesian hierarchical model to combine longitudinal data on PSA growth from three different studies. Our model enables novel investigations into patterns of PSA growth that were previously impossible due to sample size limitations. The goals of our analysis are twofold: first, to characterize growth rates of PSA accounting for differences when combining data from different studies; second, to investigate the impact of clinical covariates such as advanced disease and unfavorable histology on PSA growth rates.

Journal ArticleDOI
TL;DR: In this article, a register based family studies provided the motivation for linking a two-stage estimation procedure in copula models for multivariate failure time data with a composite likelihood approach.
Abstract: SUMMARY In this paper register based family studies provide the motivation for linking a two-stage estimation procedure in copula models for multivariate failure time data with a composite likelihood approach. The asymptotic properties of the estimators in both parametric and semi-parametric models are derived, combining the approaches of Parner (2001) and Andersen (2003). The method is mainly studied when the families consist of groups of exchangeable members (e.g. siblings) or members at different levels (e.g. parents and children). The advantages of the proposed method are especially clear in this last case where very flexible modelling is possible. The suggested method is also studied in simulations and found to be efficient compared to maximum likelihood. Finally, the suggested method is applied to a family study of deep venous thromboembolism where it is seen that the association between ages at onset is larger for siblings than for parents or for parents and siblings.

Journal ArticleDOI
TL;DR: A model is presented for comparing several methods of measurement in the situation where replicate measurements by each method are available and a fitting algorithm is presented that allows the estimation of linear relationships between methods as well as relevant variance components.
Abstract: SUMMARY In studies designed to compare different methods of measurement where more than two methods are compared or replicate measurements by each method are available, standard statistical approaches such as computation of limits of agreement are not directly applicable. A model is presented for comparing several methods of measurement in the situation where replicate measurements by each method are available. Measurements are viewed as classified by method, subject and replicate. Models assuming exchangeable as well as non-exchangeable replicates are considered. A fitting algorithm is presented that allows the estimation of linear relationships between methods as well as relevant variance components. The algorithm only uses methods already implemented in most statistical software.

Journal ArticleDOI
TL;DR: It is demonstrated that a model including a frailty effect fits the incidence data well and gives interesting results and interpretations, although this is no proof of the effect's truth.
Abstract: SUMMARY The incidence of testicular cancer is highest among young men, and then decreases sharply with age. This points towards a frailty effect, where some men have a much greater risk of testicular cancer than the majority of the male population. Those with the highest risk get cancer, drain the group of individuals at risk, and leave a healthy male population which has approximately zero risk of testicular cancer. This leads to the observed decrease in incidence. We discuss a frailty model, where the frailty is compound-Poissondistributed. This allows for a non-susceptible group (of zero frailty). The model is successfully applied to incidence data from the Danish and Norwegian registries. It is indicated that there was a decrease in incidence for males born during World War II in both countries. Bootstrap analysis is used to find the degree of variation in the estimates. In the Armitage–Doll multistage model, the estimated number of transitions needed for a cell to become malignant is close to 3 for non-seminomas and 4 for seminomas in both the Danish and Norwegian data. This paper demonstrates that a model including a frailty effect fits the incidence data well and gives interesting results and interpretations, although this is no proof of the effect’s truth.

Journal ArticleDOI
TL;DR: A hierarchical bivariate time series model is developed to characterize the relationship between particulate matter less than 10 microns in aerodynamic diameter (PM10) and both mortality and hospital admissions for cardiovascular diseases and can predict the hospitalization log relative rate for a new city for which hospitalization data are unavailable, using that city's estimated mortality relative rate.
Abstract: SUMMARY In this paper we develop a hierarchical bivariate time series model to characterize the relationship between particulate matter less than 10 microns in aerodynamic diameter ( PM 10) and both mortality and hospital admissions for cardiovascular diseases. The model is applied to time series data on mortality and morbidity for 10 metropolitan areas in the United States from 1986 to 1993. We postulate that these time series should be related through a shared relationship with PM 10. At the first stage of the hierarchy, we fit two seemingly unrelated Poisson regression models to produce city-specific estimates of the log relative rates of mortality and morbidity associated with exposure to PM 10 within each location. The sample covariance matrix of the estimated log relative rates is obtained using a novel generalized estimating equation approach that takes into account the correlation between the mortality and morbidity time series. At the second stage, we combine information across locations to estimate overall log relative rates of mortality and morbidity and variation of the rates across cities. Using the combined information across the 10 locations we find that a 10 µg/m 3 increase in average PM 10 at the current day and previous day is associated with a 0.26% increase in mortality (95% posterior interval −0.37, 0.65), and a 0.71% increase in hospital admissions (95% posterior interval 0.35, 0.99). † To whom correspondence should be addressed.