scispace - formally typeset
Search or ask a question

Showing papers in "Biostatistics in 2007"


Journal ArticleDOI
TL;DR: This paper proposed parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples.
Abstract: SUMMARY Non-biological experimental variation or “batch effects” are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes (>25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.

6,319 citations


Journal ArticleDOI
TL;DR: A quantile-adjusted conditional maximum likelihood estimator for the dispersion parameter of the negative binomial distribution is derived and an "exact" test is derived that outperforms the standard approximate asymptotic tests.
Abstract: We derive a quantile-adjusted conditional maximum likelihood estimator for the dispersion parameter of the negative binomial distribution and compare its performance, in terms of bias, to various other methods. Our estimation scheme outperforms all other methods in very small samples, typical of those from serial analysis of gene expression studies, the motivating data for this study. The impact of dispersion estimation on hypothesis testing is studied. We derive an "exact" test that outperforms the standard approximate asymptotic tests.

1,038 citations


Journal ArticleDOI
TL;DR: It is shown that the hierarchical summary receiver operating characteristic (ROC) model and bivariate random-effects meta-analysis are very closely related, and the circumstances in which they are identical.
Abstract: Studies of diagnostic accuracy require more sophisticated methods for their meta-analysis than studies of therapeutic interventions. A number of different, and apparently divergent, methods for meta-analysis of diagnostic studies have been proposed, including two alternative approaches that are statistically rigorous and allow for between-study variability: the hierarchical summary receiver operating characteristic (ROC) model (Rutter and Gatsonis, 2001) and bivariate random-effects meta-analysis (van Houwelingen and others, 1993), (van Houwelingen and others, 2002), (Reitsma and others, 2005). We show that these two models are very closely related, and define the circumstances in which they are identical. We discuss the different forms of summary model output suggested by the two approaches, including summary ROC curves, summary points, confidence regions, and prediction regions.

637 citations


Journal ArticleDOI
TL;DR: Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method and can be as competitive as the support vector machines classifiers.
Abstract: In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as the support vector machines classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for this method (named "rda") is available on CRAN (http://www.r-project.org) for download and testing.

602 citations


Journal ArticleDOI
TL;DR: A novel linear model for quantile regression (QR) that includes random effects in order to account for the dependence between serial observations on the same subject is proposed and appears to be a robust alternative to the mean regression with random effects when the location parameter of the conditional distribution of the response is of interest.
Abstract: In longitudinal studies, measurements of the same individuals are taken repeatedly through time. Often, the primary goal is to characterize the change in response over time and the factors that influence change. Factors can affect not only the location but also more generally the shape of the distribution of the response over time. To make inference about the shape of a population distribution, the widely popular mixed-effects regression, for example, would be inadequate, if the distribution is not approximately Gaussian. We propose a novel linear model for quantile regression (QR) that includes random effects in order to account for the dependence between serial observations on the same subject. The notion of QR is synonymous with robust analysis of the conditional distribution of the response variable. We present a likelihood-based approach to the estimation of the regression quantiles that uses the asymmetric Laplace density. In a simulation study, the proposed method had an advantage in terms of mean squared error of the QR estimator, when compared with the approach that considers penalized fixed effects. Following our strategy, a nearly optimal degree of shrinkage of the individual effects is automatically selected by the data and their likelihood. Also, our model appears to be a robust alternative to the mean regression with random effects when the location parameter of the conditional distribution of the response is of interest. We apply our model to a real data set which consists of self-reported amount of labor pain measurements taken on women repeatedly over time, whose distribution is characterized by skewness, and the significance of the parameters is evaluated by the likelihood ratio statistic.

392 citations


Journal ArticleDOI
TL;DR: A new method for picking prior distributions is introduced, and a number of refinements of previously used models are proposed, including ecological bias, mutual standardization, and choice of both spatial model and prior specification.
Abstract: SUMMARY In this paper, we provide critical reviews of methods suggested for the analysis of aggregate count data in the context of disease mapping and spatial regression. We introduce a new method for picking prior distributions, and propose a number of refinements of previously used models. We also consider ecological bias, mutual standardization, and choice of both spatial model and prior specification. We analyze male lip cancer incidence data collected in Scotland over the period 1975–1980, and outline a number of problems with previous analyses of these data. In disease mapping studies, hierarchical models can provide robust estimation of area-level risk parameters, though care is required in the choice of covariate model, and it is important to assess the sensitivity of estimates to the spatial model chosen, and to the prior specifications on the variance parameters. Spatial ecological regression is a far more hazardous enterprise for two reasons. First, there is always the possibility of ecological bias, and this can only be alleviated by the inclusion of individual-level data. For the Scottish data, we show that the previously used mean model has limited interpretation from an individual perspective. Second, when residual spatial dependence is modeled, and if the exposure has spatial structure, then estimates of exposure association parameters will change when compared with those obtained from the independence across space model, and the data alone cannot choose the form and extent of spatial correlation that is appropriate.

341 citations


Journal ArticleDOI
TL;DR: A preprocessing methodology for a technology designed for the identification of DNA sequence variants in specific genes or regions of the human genome that are associated with phenotypes of interest such as disease is described.
Abstract: SUMMARY In most microarray technologies, a number of critical steps are required to convert raw intensity measurements into the data relied upon by data analysts, biologists, and clinicians. These data manipulations, referred to as preprocessing, can influence the quality of the ultimate measurements. In the last few years, the high-throughput measurement of gene expression is the most popular application of microarray technology. For this application, various groups have demonstrated that the use of modern statistical methodology can substantially improve accuracy and precision of the gene expression measurements, relative to ad hoc procedures introduced by designers and manufacturers of the technology. Currently, other applications of microarrays are becoming more and more popular. In this paper, we describe a preprocessing methodology for a technology designed for the identification of DNA sequence variants in specific genes or regions of the human genome that are associated with phenotypes of interest such as disease. In particular, we describe a methodology useful for preprocessing Affymetrix single-nucleotide polymorphism chips and obtaining genotype calls with the preprocessed data. We demonstrate how our procedure improves existing approaches using data from 3 relatively large studies including the one in which large numbers of independent calls are available. The proposed methods are implemented in the package oligo available from Bioconductor.

265 citations


Journal ArticleDOI
TL;DR: The parametric non-mixture cure fraction model is extended to incorporate background mortality, thus providing estimates of the cure fraction in population-based cancer studies and the estimates of relative survival and the Cure fraction between the 2 types of model are compared.
Abstract: SUMMARY In population-based cancer studies, cure is said to occur when the mortality (hazard) rate in the diseased group of individuals returns to the same level as that expected in the general population. The cure fraction (the proportion of patients cured of disease) is of interest to patients and is a useful measure to monitor trends in survival of curable disease. There are 2 main types of cure fraction model, the mixture cure fraction model and the non-mixture cure fraction model, with most previous work concentrating on the mixture cure fraction model. In this paper, we extend the parametric non-mixture cure fraction model to incorporate background mortality, thus providing estimates of the cure fraction in population-based cancer studies. We compare the estimates of relative survival and the cure fraction between the 2 types of model and also investigate the importance of modeling the ancillary parameters in the selected parametric distribution for both types of model.

234 citations


Journal ArticleDOI
TL;DR: It is shown that case-crossover using conditional logistic regression is a special case of time series analysis when there is a common exposure such as in air pollution studies, and this equivalence provides computational convenience for case- crossover analyses and a better understanding of timeseries models.
Abstract: The case-crossover design was introduced in epidemiology 15 years ago as a method for studying the effects of a risk factor on a health event using only cases. The idea is to compare a case's exposure immediately prior to or during the case-defining event with that same person's exposure at otherwise similar "reference" times. An alternative approach to the analysis of daily exposure and case-only data is time series analysis. Here, log-linear regression models express the expected total number of events on each day as a function of the exposure level and potential confounding variables. In time series analyses of air pollution, smooth functions of time and weather are the main confounders. Time series and case-crossover methods are often viewed as competing methods. In this paper, we show that case-crossover using conditional logistic regression is a special case of time series analysis when there is a common exposure such as in air pollution studies. This equivalence provides computational convenience for case-crossover analyses and a better understanding of time series models. Time series log-linear regression accounts for overdispersion of the Poisson variance, while case-crossover analyses typically do not. This equivalence also permits model checking for case-crossover data using standard log-linear model diagnostics.

221 citations


Journal ArticleDOI
TL;DR: By averaging the genes within the clusters obtained from hierarchical clustering, supergenes are defined and used to fit regression models, thereby attaining concise interpretation and accuracy in regression of DNA microarray data.
Abstract: SUMMARY Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.

179 citations


Journal ArticleDOI
TL;DR: It is concluded that careful justification of assumptions about the dependence between tests in diseased and nondiseased subjects is necessary in order to ensure unbiased estimates of prevalence and test operating characteristics and to provide these estimates clinical interpretations.
Abstract: Latent class analysis is used to assess diagnostic test accuracy when a gold standard assessment of disease is not available but results of multiple imperfect tests are. We consider the simplest setting, where 3 tests are observed and conditional independence (CI) is assumed. Closed-form expressions for maximum likelihood parameter estimates are derived. They show explicitly how observed 2- and 3-way associations between test results are used to infer disease prevalence and test true- and false-positive rates. Although interesting and reasonable under CI, the estimators clearly have no basis when it fails. Intuition for bias induced by conditional dependence follows from the analytic expressions. Further intuition derives from an Expectation Maximization (EM) approach to calculating the estimates. We discuss implications of our results and related work for settings where more than 3 tests are available. We conclude that careful justification of assumptions about the dependence between tests in diseased and nondiseased subjects is necessary in order to ensure unbiased estimates of prevalence and test operating characteristics and to provide these estimates clinical interpretations. Such justification must be based in part on a clear clinical definition of disease and biological knowledge about mechanisms giving rise to test results.

Journal ArticleDOI
TL;DR: A method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples, which can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples.
Abstract: We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).

Journal ArticleDOI
TL;DR: The connection between reproducibility and prediction accuracy is taken advantage to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized and the IGP is the best measure of prediction accuracy.
Abstract: In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).

Journal ArticleDOI
TL;DR: This work proposes a generally applicable estimate of the optimal discovery procedure (ODP), which has recently been introduced and theoretically shown to optimally perform multiple significance tests in a high-dimensional study, and shows favorable performance over five highly used existing methods.
Abstract: As much of the focus of genetics and molecular biology has shifted toward the systems level, it has become increasingly important to accurately extract biologically relevant signal from thousands of related measurements. The common property among these high-dimensional biological studies is that the measured features have a rich and largely unknown underlying structure. One example of much recent interest is identifying differentially expressed genes in comparative microarray experiments. We propose a new approach aimed at optimally performing many hypothesis tests in a high-dimensional study. This approach estimates the optimal discovery procedure (ODP), which has recently been introduced and theoretically shown to optimally perform multiple significance tests. Whereas existing procedures essentially use data from only one feature at a time, the ODP approach uses the relevant information from the entire data set when testing each feature. In particular, we propose a generally applicable estimate of the ODP for identifying differentially expressed genes in microarray experiments. This microarray method consistently shows favorable performance over five highly used existing methods. For example, in testing for differential expression between two breast cancer tumor types, the ODP provides increases from 72% to 185% in the number of genes called significant at a false discovery rate of 3%. Our proposed microarray method is freely available to academic users in the open-source, point-and-click EDGE software package.

Journal ArticleDOI
TL;DR: The logistic transformation, originally suggested by Johnson (1949), is applied to analyze responses that are restricted to a finite interval, so-called bounded outcome scores, and comes close to ordinal probit (OP) regression for a bounded outcome score of the second type with equal variances.
Abstract: The logistic transformation, originally suggested by Johnson (1949), is applied to analyze responses that are restricted to a finite interval (e.g. (0,1)), so-called bounded outcome scores. Bounded outcome scores often have a non-standard distribution, e.g. J- or U-shaped, precluding classical parametric statistical approaches for analysis. Applying the logistic transformation on a normally distributed random variable, gives rise to a logit-normal (LN) distribution. This distribution can take a variety of shapes on (0,1). Further, the model can be extended to correct for (baseline) covariates. Therefore, the method could be useful for comparative clinical trials. Bounded outcomes can be found in many research areas, e.g. drug compliance research, quality-of-life studies, and pain (and pain relief) studies using visual analog scores, but all these scores can attain the boundary values 0 or 1. A natural extension of the above approach is therefore to assume a latent score on 0,1) having a LN distribution. Two cases are considered: (a) the bounded outcome score is a proportion where the true probabilities have a LN distribution on (0,1) and (b) the bounded outcome score on [0,1] is a coarsened version of a latent score with a LN distribution on (0,1). We also allow the variance (on the transformed scale) to depend on treatment. The usefulness of our approach for comparative clinical trials will be assessed in this paper. It turns out to be important to distinguish the case of equal and unequal variances. For a bounded outcome score of the second type and with equal variances, our approach comes close to ordinal probit (OP) regression. However, ignoring the inequality of variances can lead to highly biased parameter estimates. A simulation study compares the performance of our approach with the two-sample Wilcoxon test and with OP regression. Finally, the different methods are illustrated on two data sets.

Journal ArticleDOI
TL;DR: The heightened sensitivity/specificity relation obtained on a large data set shows that longitudinal monitoring of an athlete's steroid profile may be used efficiently to detect the abuse of testosterone and its precursors in sports.
Abstract: SUMMARY We developed a test that compares sequential measurements of a biomarker against previous readings performed on the same individual. A probability mass function expresses prior information on interindividual variations of intraindividual parameters. Then, the model progressively integrates new readings to more accurately quantify the characteristics of the individual. This Bayesian framework generalizes the two main approaches currently used in forensic toxicology for the detection of abnormal values of a biomarker. The specificity is independent of the number n of previous test results, with a model that gradually evolves from population-derived limits when n = 0 to individual-based cutoff thresholds when n is large. We applied this model to detect abnormal values in an athlete’s steroid profile characterized by the testosterone over epitestosterone (T/E) marker. A cross-validation procedure was used for the estimation of prior densities as well as model validation. The heightened sensitivity/specificity relation obtained on a large data set shows that longitudinal monitoring of an athlete’s steroid profile may be used efficiently to detect the abuse of testosterone and its precursors in sports. Mild assumptions make the model interesting for other areas of forensic toxicology.

Journal ArticleDOI
TL;DR: This work introduces a methodology for sample size determination for prediction in the context of high-dimensional data that captures variability in both steps of predictor development and finds that many prediction problems do not require a large training set of arrays for classifier development.
Abstract: Many gene expression studies attempt to develop a predictor of pre-defined diagnostic or prognostic classes. If the classes are similar biologically, then the number of genes that are differentially expressed between the classes is likely to be small compared to the total number of genes measured. This motivates a two-step process for predictor development, a subset of differentially expressed genes is selected for use in the predictor and then the predictor constructed from these. Both these steps will introduce variability into the resulting classifier, so both must be incorporated in sample size estimation. We introduce a methodology for sample size determination for prediction in the context of high-dimensional data that captures variability in both steps of predictor development. The methodology is based on a parametric probability model, but permits sample size computations to be carried out in a practical manner without extensive requirements for preliminary data. We find that many prediction problems do not require a large training set of arrays for classifier development.

Journal ArticleDOI
TL;DR: The results demonstrate that the likelihood-based parametric analyses for the cumulative incidence function are a practically useful alternative to the semiparametric analyses.
Abstract: We propose parametric regression analysis of cumulative incidence function with competing risks data. A simple form of Gompertz distribution is used for the improper baseline subdistribution of the event of interest. Maximum likelihood inferences on regression parameters and associated cumulative incidence function are developed for parametric models, including a flexible generalized odds rate model. Estimation of the long-term proportion of patients with cause-specific events is straightforward in the parametric setting. Simple goodness-of-fit tests are discussed for evaluating a fixed odds rate assumption. The parametric regression methods are compared with an existing semiparametric regression analysis on a breast cancer data set where the cumulative incidence of recurrence is of interest. The results demonstrate that the likelihood-based parametric analyses for the cumulative incidence function are a practically useful alternative to the semiparametric analyses.

Journal ArticleDOI
TL;DR: This work proposes the outlier robust t-statistic (ORT), which is intuitively motivated from the t-Statistic, the most commonly used differential gene expression detection method for cancer genes that are over- or down-expressed in some but not all samples in a disease group.
Abstract: SUMMARY We study statistical methods to detect cancer genes that are over- or down-expressed in some but not all samples in a disease group. This has proven useful in cancer studies where oncogenes are activated only in a small subset of samples. We propose the outlier robust t-statistic (ORT), which is intuitively motivated from thet-statistic, the most commonly used differential gene expression detection method. Using real and simulation studies, we compare the ORT to the recently proposed cancer outlier profile analysis (Tomlins and others, 2005) and the outlier sum statistic of Tibshirani and Hastie (2006). The proposed method often has more detection power and smaller false discovery rates. Supplementary information can be found at http://www.biostat.umn.edu/∼baolin/research/ort.html.

Journal ArticleDOI
TL;DR: This work proposes to develop and evaluates a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata and observed that by incorporating the pathway information, it achieved better prediction for cancer recurrence.
Abstract: SUMMARY High-throughout genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data analysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis, and cell cycle-regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. Results from analysis of other two breast cancer gene expression data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer relapse and survival. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence.

Journal ArticleDOI
TL;DR: A fitting procedure is proposed that incorporates monotonicity assumptions on one or more smooth components within a GAM framework and uses the monotonic restriction for B-spline coefficients and provides componentwise selection of smooth components.
Abstract: In many studies where it is known that one or more of the certain covariates have monotonic efiect on the response variable, common fltting methods for generalized additive models (GAM) may be afiected by a sparse design and often generate implausible results. A fltting procedure is proposed that incorporates the monotonicity assumptions on one or more smooth components within a GAM framework. The ∞exible likelihood based boosting algorithm uses the monotonicity restriction for B-spline coe‐cients and provides componentwise selection of smooth components. Stopping criteria and approximate pointwise confldence bands are derived. The method is applied to data from a study conducted in the metropolitan area of S~ao Paulo, Brazil, where the in∞uence of several air pollutants like SO2 on respiratory mortality of children is investigated.

Journal ArticleDOI
TL;DR: This paper proposes a new method--Peptide Element Alignment (PETAL) that uses raw spectrum data and detected peak to simultaneously align features from multiple LC-MS experiments and enjoys greater flexibility than time warping methods.
Abstract: SUMMARY Integrated liquid-chromatography mass-spectrometry (LC-MS) is becoming a widely used approach for quantifying the protein composition of complex samples. The output of the LC-MS system measures the intensity of a peptide with a specific mass-charge ratio and retention time. In the last few years, this technology has been used to compare complex biological samples across multiple conditions. One challenge for comparative proteomic profiling with LC-MS is to match corresponding peptide features from different experiments. In this paper, we propose a new method—Peptide Element Alignment (PETAL) that uses raw spectrum data and detected peak to simultaneously align features from multiple LC-MS experiments. PETAL creates spectrum elements, each of which represents the mass spectrum of a single peptide in a single scan. Peptides detected in different LC-MS data are aligned if they can be represented by the same elements. By considering each peptide separately, PETAL enjoys greater flexibility than time warping methods. While most existing methods process multiple data sets by sequentially aligning each data set to an arbitrarily chosen template data set, PETAL treats all experiments symmetrically and can analyze all experiments simultaneously. We illustrate the performance of PETAL on example data sets.

Journal ArticleDOI
TL;DR: This work investigates the effect of vision loss on emotional distress in a population with a high mortality rate among those who would live both with and without vision loss, and introduces a set of scientifically driven assumptions to identify the causal effect.
Abstract: Evaluation of the causal effect of a baseline exposure on a morbidity outcome at a fixed time point is often complicated when study participants die before morbidity outcomes are measured. In this setting, the causal effect is only well defined for the principal stratum of subjects who would live regardless of the exposure. Motivated by gerontologic researchers interested in understanding the causal effect of vision loss on emotional distress in a population with a high mortality rate, we investigate the effect among those who would live both with and without vision loss. Since this subpopulation is not readily identifiable from the data and vision loss is not randomized, we introduce a set of scientifically driven assumptions to identify the causal effect. Since these assumptions are not empirically verifiable, we embed our methodology within a sensitivity analysis framework. We apply our method using the first three rounds of survey data from the Salisbury Eye Evaluation, a population-based cohort study of older adults. We also present a simulation study that validates our method.

Journal ArticleDOI
TL;DR: A stochastic epidemic model developed to infer transmission rates of asymptomatic communicable pathogens within a hospital ward is described and reversible jump Markov chain Monte Carlo methods are used within a Bayesian framework to make inferences about the colonization rates and unknown colonization times.
Abstract: Summary This paper describes a stochastic epidemic model developed to infer transmission rates of asymptomatic communicable pathogens within a hospital ward. Inference is complicated by partial observation of the epidemic process and dependencies within the data. The epidemic process of nosocomial communicable pathogens can be partially observed by routine swabs testing

Journal ArticleDOI
TL;DR: In this paper, the authors developed likelihood-based methods for panel data from a semi-Markov process, where transition intensities depend on the duration of time in the current state, and applied the results to model the natural history of oncogenic genital human papillomavirus infections in women.
Abstract: SUMMARY Continuous-time, multistate processes can be used to represent a variety of biological processes in the public health sciences; yet the analysis of such processes is complex when they are observed only at a limited number of time points. Inference methods for such panel data have been developed for time homogeneous Markov models, but there has been little research done for other classes of processes. We develop likelihood-based methods for panel data from a semi-Markov process, where transition intensities depend on the duration of time in the current state. The proposed methods account for possible misclassification of states. To illustrate the methods, we investigate a three- and a four-state models in detail and apply the results to model the natural history of oncogenic genital human papillomavirus infections in women.

Journal ArticleDOI
TL;DR: It is illustrated that it is possible to identify a substantial biologically relevant subset of the human genome within which hybridization results are reliable and to develop simple expression measures that allow comparison across platforms, studies, laboratories and populations.
Abstract: Investigations of transcript levels on a genomic scale using hybridization-based arrays have led to formidable advances in our understanding of the biology of many human illnesses. At the same time, these investigations have generated controversy because of the probabilistic nature of the conclusions and the surfacing of noticeable discrepancies between the results of studies addressing the same biological question. In this article, we present simple and effective data analysis and visualization tools for gauging the degree to which the findings of one study are reproduced by others and for integrating multiple studies in a single analysis. We describe these approaches in the context of studies of breast cancer and illustrate that it is possible to identify a substantial biologically relevant subset of the human genome within which hybridization results are reliable. The subset generally varies with the platforms used, the tissues studied, and the populations being sampled. Despite important differences, it is also possible to develop simple expression measures that allow comparison across platforms, studies, laboratories and populations. Important biological signals are often preserved or enhanced. Cross-study validation and combination of microarray results requires careful, but not overly complex, statistical thinking and can become a routine component of genomic analysis.

Journal ArticleDOI
TL;DR: A probe-level allele-specific quantitation procedure to determine copy number contributions from each of the parental chromosomes in cancer cells from single-nucleotide polymorphism (SNP) microarray data is described, based upon a generalized linear model that takes advantage of a novel classification of probes on the array.
Abstract: SUMMARY Human cancer is largely driven by the acquisition of mutations. One class of such mutations is copy number polymorphisms, comprised of deviations from the normal diploid two copies of each autosomal chromosome per cell. We describe a probe-level allele-specific quantitation (PLASQ) procedure to determine copy number contributions from each of the parental chromosomes in cancer cells from single-nucleotide polymorphism (SNP) microarray data. Our approach is based upon a generalized linear model that takes advantage of a novel classification of probes on the array. As a result of this classification, we are able to fit the model to the data using an expectation-maximization algorithm designed for the purpose. We demonstrate a strong model fit to data from a variety of cell types. In normal diploid samples, PLASQ is able to genotype with very high accuracy. Moreover, we are able to provide a generalized genotype in cancer samples (e.g. CCCCT at an amplified SNP). Our approach is illustrated on a variety of lung cancer cell lines and tumors, and a number of events are validated by independent computational and experimental means. An R software package containing the methods is freely available.

Journal ArticleDOI
TL;DR: A method for simultaneous likelihood inference on the 2 models and allow for nonignorable data missing is provided and is illustrated with a recent AIDS study by jointly modeling HIV viral dynamics and time to viral rebound.
Abstract: In many longitudinal studies, the individual characteristics associated with the repeated measures may be possible covariates of the time to an event of interest, and thus, it is desirable to model the time-to-event process and the longitudinal process jointly. Statistical analyses may be further complicated in such studies with missing data such as informative dropouts. This article considers a nonlinear mixed-effects model for the longitudinal process and the Cox proportional hazards model for the time-to-event process. We provide a method for simultaneous likelihood inference on the 2 models and allow for nonignorable data missing. The approach is illustrated with a recent AIDS study by jointly modeling HIV viral dynamics and time to viral rebound.

Journal ArticleDOI
TL;DR: A structural equation framework to assess source-specific health effects using speciated elemental data and suggests that the proposed informative Bayesian SEM is effective in eliminating the bias incurred by the 2 existing approaches, even when the number of exposures is limited.
Abstract: A primary objective of current air pollution research is the assessment of health effects related to specific sources of air particles or particulate matter (PM). Quantifying source-specific risk is a challenge because most PM health studies do not directly observe the contributions of the pollution sources themselves. Instead, given knowledge of the chemical characteristics of known sources, investigators infer pollution source contributions via a source apportionment or multivariate receptor analysis applied to a large number of observed elemental concentrations. Although source apportionment methods are well established for exposure assessment, little work has been done to evaluate the appropriateness of characterizing unobservable sources thus in health effects analyses. In this article, we propose a structural equation framework to assess source-specific health effects using speciated elemental data. This approach corresponds to fitting a receptor model and the health outcome model jointly, such that inferences on the health effects account for the fact that uncertainty is associated with the source contributions. Since the structural equation model (SEM) typically involves a large number of parameters, for small-sample settings, we propose a fully Bayesian estimation approach that leverages historical exposure data from previous related exposure studies. We compare via simulation the performance of our approach in estimating source-specific health effects to that of 2 existing approaches, a tracer approach and a 2-stage approach. Simulation results suggest that the proposed informative Bayesian SEM is effective in eliminating the bias incurred by the 2 existing approaches, even when the number of exposures is limited. We employ the proposed methods in the analysis of a concentrator study investigating the association between ST-segment, a cardiovascular outcome, and major sources of Boston PM and discuss the implications of our findings with respect to the design of future PM concentrator studies.

Journal ArticleDOI
TL;DR: This work proposes a smoothed PR estimator which maximizes a smooth approximation of the PR objective function and uses the weighted bootstrap, which is more stable than the usual sandwich technique with smoothing parameters, for estimating the standard error.
Abstract: SUMMARY The nonparametric transformation model makes no parametric assumptions on the forms of the transformation function and the error distribution. This model is appealing in its flexibility for modeling censored survival data. Current approaches for estimation of the regression parameters involve maximizing discontinuous objective functions, which are numerically infeasible to implement with multiple covariates. Based on the partial rank (PR) estimator (Khan and Tamer, 2004), we propose a smoothed PR estimator which maximizes a smooth approximation of the PR objective function. The estimator is shown to be asymptotically equivalent to the PR estimator but is much easier to compute when there are multiple covariates. We further propose using the weighted bootstrap, which is more stable than the usual sandwich technique with smoothing parameters, for estimating the standard error. The estimator is evaluated via simulation studies and illustrated with the Veterans Administration lung cancer data set.