scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2013"


Journal ArticleDOI
TL;DR: It is shown that MAXENT is equivalent to a Poisson regression model and hence is related to aPoisson point process model, differing only in the intercept term, which is scale-dependent in MAXENT.
Abstract: Summary Modeling the spatial distribution of a species is a fundamental problem in ecology. A number of modeling methods have been developed, an extremely popular one being MAXENT, a maximum entropy modeling approach. In this article, we show that MAXENT is equivalent to a Poisson regression model and hence is related to a Poisson point process model, differing only in the intercept term, which is scale-dependent in MAXENT. We illustrate a number of improvements to MAXENT that follow from these relations. In particular, a point process model approach facilitates methods for choosing the appropriate spatial resolution, assessing model adequacy, and choosing the LASSO penalty parameter, all currently unavailable to MAXENT. The equivalence result represents a significant step in the unification of the species distribution modeling literature.

390 citations


Journal ArticleDOI
TL;DR: This article proposes a method for obtaining correct curve estimates by accounting for uncertainty in FPC decompositions, and applies this method to sparse observations of CD4 cell counts and to dense white-matter tract profiles.
Abstract: Functional principal components (FPC) analysis is widely used to decompose and express functional observations. Curve estimates implicitly condition on basis functions and other quantities derived from FPC decompositions; however these objects are unknown in practice. In this article, we propose a method for obtaining correct curve estimates by accounting for uncertainty in FPC decompositions. Additionally, pointwise and simultaneous confidence intervals that account for both model- and decomposition-based variability are constructed. Standard mixed model representations of functional expansions are used to construct curve estimates and variances conditional on a specific decomposition. Iterated expectation and variance formulas combine model-based conditional estimates across the distribution of decompositions. A bootstrap procedure is implemented to understand the uncertainty in principal component decomposition quantities. Our method compares favorably to competing approaches in simulation studies that include both densely and sparsely observed functions. We apply our method to sparse observations of CD4 cell counts and to dense white-matter tract profiles. Code for the analyses and simulations is publicly available, and our method is implemented in the R package refund on CRAN.

130 citations


Journal ArticleDOI
TL;DR: This work proposes to use an additive logistic normal multinomial regression model to associate the covariates to bacterial composition and develops a Monte Carlo expectation‐maximization algorithm to implement the penalized likelihood estimation.
Abstract: Changes in human microbiome are associated with many human diseases. Next generation sequencing technologies make it possible to quantify the microbial composition without the need for laboratory cultivation. One important problem of microbiome data analysis is to identify the environmental/biological covariates that are associated with different bacterial taxa. Taxa count data in microbiome studies are often over-dispersed and include many zeros. To account for such an over-dispersion, we propose to use an additive logistic normal multinomial regression model to associate the covariates to bacterial composition. The model can naturally account for sampling variabilities and zero observations and also allow for a flexible covariance structure among the bacterial taxa. In order to select the relevant covariates and to estimate the corresponding regression coefficients, we propose a group l1 penalized likelihood estimation method for variable selection and estimation. We develop a Monte Carlo expectation-maximization algorithm to implement the penalized likelihood estimation. Our simulation results show that the proposed method outperforms the group l1 penalized multinomial logistic regression and the Dirichlet multinomial regression models in variable selection. We demonstrate the methods using a data set that links human gut microbiome to micro-nutrients in order to identify the nutrients that are associated with the human gut microbiome enterotype.

126 citations


Journal ArticleDOI
TL;DR: The methodology for giving the probability of recurrence for a new patient, as implemented on a web‐based calculator uses a joint longitudinal survival model that uses the longitudinal PSA measures from a new patients.
Abstract: Patients who were previously treated for prostate cancer with radiation therapy are monitored at regular intervals using a laboratory test called Prostate Specific Antigen (PSA). If the value of the PSA test starts to rise, this is an indication that the prostate cancer is more likely to recur, and the patient may wish to initiate new treatments. Such patients could be helped in making medical decisions by an accurate estimate of the probability of recurrence of the cancer in the next few years. In this article, we describe the methodology for giving the probability of recurrence for a new patient, as implemented on a web-based calculator. The methods use a joint longitudinal survival model. The model is developed on a training dataset of 2386 patients and tested on a dataset of 846 patients. Bayesian estimation methods are used with one Markov chain Monte Carlo (MCMC) algorithm developed for estimation of the parameters from the training dataset and a second quick MCMC developed for prediction of the risk of recurrence that uses the longitudinal PSA measures from a new patient.

118 citations


Journal ArticleDOI
TL;DR: A rigorous assessment of Bayesian propensity score estimation is provided to show that model feedback can produce poor estimates of causal effects absent strategies that augment propensity score adjustment with adjustment for individual covariates.
Abstract: Methods based on the propensity score comprise one set of valuable tools for comparative effectiveness research and for estimating causal effects more generally. These methods typically consist of two distinct stages: (1) a propensity score stage where a model is fit to predict the propensity to receive treatment (the propensity score), and (2) an outcome stage where responses are compared in treated and untreated units having similar values of the estimated propensity score. Traditional techniques conduct estimation in these two stages separately; estimates from the first stage are treated as fixed and known for use in the second stage. Bayesian methods have natural appeal in these settings because separate likelihoods for the two stages can be combined into a single joint likelihood, with estimation of the two stages carried out simultaneously. One key feature of joint estimation in this context is "feedback" between the outcome stage and the propensity score stage, meaning that quantities in a model for the outcome contribute information to posterior distributions of quantities in the model for the propensity score. We provide a rigorous assessment of Bayesian propensity score estimation to show that model feedback can produce poor estimates of causal effects absent strategies that augment propensity score adjustment with adjustment for individual covariates. We illustrate this phenomenon with a simulation study and with a comparative effectiveness investigation of carotid artery stenting versus carotid endarterectomy among 123,286 Medicare beneficiaries hospitlized for stroke in 2006 and 2007.

107 citations


Journal ArticleDOI
TL;DR: This work considers the m-out-of-n bootstrap for constructing confidence intervals for the parameters indexing the optimal dynamic regime and proposes an adaptive choice of m and shows that it produces asymptotically correct confidence sets under fixed alternatives.
Abstract: ummary A dynamic treatment regime consists of a set of decision rules that dictate how to individualize treatment to patients based on available treatment and covariate history. A common method for estimating an optimal dynamic treatment regime from data is Q-learning which involves nonsmooth operations of the data. This nonsmoothness causes standard asymptotic approaches for inference like the bootstrap or Taylor series arguments to breakdown if applied without correction. Here, we consider the m-out-of-n bootstrap for constructing confidence intervals for the parameters indexing the optimal dynamic regime. We propose an adaptive choice of m and show that it produces asymptotically correct confidence sets under fixed alternatives. Furthermore, the proposed method has the advantage of being conceptually and computationally much simple than competing methods possessing this same theoretical property. We provide an extensive simulation study to compare the proposed method with currently available inference procedures. The results suggest that the proposed method delivers nominal coverage while being less conservative than alternatives. The proposed methods are implemented in the qLearn R-package and have been made available on the Comprehensive R-Archive Network (http://cran.r-project.org/). Analysis of the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study is used as an illustrative example.

104 citations


Journal ArticleDOI
TL;DR: A Bayesian technique to perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high‐dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and perform an association analysis between the high-dimensional sets of predictors and responses in such a setting is described.
Abstract: Summary We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. As an example, we apply our method to an expression quantitative trait loci (eQTL) analysis on publicly available single nucleotide polymorphism (SNP) and gene expression data for humans where the primary interest lies in finding the significant associations between the sets of SNPs and possibly correlated genetic transcripts. Our method also allows for inference on the sparse interaction network of the transcripts (response variables) after accounting for the effect of the SNPs (predictor variables). We exploit properties of Gaussian graphical models to make statements concerning conditional independence of the responses. Our method compares favorably to existing Bayesian approaches developed for this purpose.

98 citations


Journal ArticleDOI
TL;DR: It is shown that for the surrogate paradox to be manifest it must be the case that either there is a direct effect of treatment on the outcome not through the surrogate and in the opposite direction as thatthrough the surrogate, or a lack of transitivity so that treatment does not positively affect the surrogate for all the same individuals for whom the surrogate positively affects the outcome.
Abstract: Surrogates which allow one to predict the effect of the treatment on the outcome of interest from the effect of the treatment on the surrogate are of importance when it is difficult or expensive to measure the primary outcome. Unfortunately, the use of such surrogates can give rise to paradoxical situations in which the effect of the treatment on the surrogate is positive, the surrogate and outcome are strongly positively correlated, but the effect of the treatment on the outcome is negative, a phenomenon sometimes referred to as the "surrogate paradox." New results are given for consistent surrogates that extend the existing literature on sufficient conditions that ensure the surrogate paradox is not manifest. Specifically, it is shown that for the surrogate paradox to be manifest it must be the case that either there is (i) a direct effect of treatment on the outcome not through the surrogate and in the opposite direction as that through the surrogate or (ii) confounding for the effect of the surrogate on the outcome, or (iii) a lack of transitivity so that treatment does not positively affect the surrogate for all the same individuals for whom the surrogate positively affects the outcome. The conditions for consistent surrogates and the results of the article are important because they allow investigators to predict the direction of the effect of the treatment on the outcome simply from the direction of the effect of the treatment on the surrogate. These results on consistent surrogates are then related to the four approaches to surrogate outcomes described by Joffe and Greene (2009, Biometrics 65, 530-538) to assess whether the standard criteria used by these approaches to assess whether a surrogate is "good" suffice to avoid the surrogate paradox.

96 citations


Journal ArticleDOI
TL;DR: In simulations, the parameter estimators with the proposed GEE method for a marginal cumulative probit model appear to be less biased and more efficient than those with the independence "working" model, especially for studies having time-varying covariates and strong correlation.
Abstract: In this article, we propose a generalized estimating equations (GEE) approach for correlated ordinal or nominal multinomial responses using a local odds ratios parameterization. Our motivation lies upon observing that: (i) modeling the dependence between correlated multinomial responses via the local odds ratios is meaningful both for ordinal and nominal response scales and (ii) ordinary GEE methods might not ensure the joint existence of the estimates of the marginal regression parameters and of the dependence structure. To avoid (ii), we treat the so-called "working" association vector α as a "nuisance" parameter vector that defines the local odds ratios structure at the marginalized contingency tables after tabulating the responses without a covariate adjustment at each time pair. To estimate α and simultaneously approximate adequately possible underlying dependence structures, we employ the family of association models proposed by Goodman. In simulations, the parameter estimators with the proposed GEE method for a marginal cumulative probit model appear to be less biased and more efficient than those with the independence "working" model, especially for studies having time-varying covariates and strong correlation.

92 citations


Journal ArticleDOI
TL;DR: A new spatially balanced design is presented that can be used to select a sample from discrete and continuous populations in multi-dimensional space and utilizes the Halton sequence to assure spatial diversity of selected locations.
Abstract: To design an efficient survey or monitoring program for a natural resource it is important to consider the spatial distribution of the resource. Generally, sample designs that are spatially balanced are more efficient than designs which are not. A spatially balanced design selects a sample that is evenly distributed over the extent of the resource. In this article we present a new spatially balanced design that can be used to select a sample from discrete and continuous populations in multi-dimensional space. The design, which we call balanced acceptance sampling, utilizes the Halton sequence to assure spatial diversity of selected locations. Targeted inclusion probabilities are achieved by acceptance sampling. The BAS design is conceptually simpler than competing spatially balanced designs, executes faster, and achieves better spatial balance as measured by a number of quantities. The algorithm has been programed in an R package freely available for download.

69 citations


Journal ArticleDOI
TL;DR: This paper takes a semiparametric approach to quantile regression, representing the quantile process as a linear combination of basis functions, and finds that the Bayesian model often gives smaller measures of uncertainty than its competitors, and thus identifies more significant effects.
Abstract: In this paper we propose a semiparametric quantile regression model for censored survival data. Quantile regression permits covariates to affect survival differently at different stages in the follow-up period, thus providing a comprehensive study of the survival distribution. We take a semiparametric approach, representing the quantile process as a linear combination of basis functions. The basis functions are chosen so that the prior for the quantile process is centered on a simple location-scale model, but flexible enough to accommodate a wide range of quantile processes. We show in a simulation study that this approach is competitive with existing methods. The method is illustrated using data from a drug treatment study, where we find that the Bayesian model often gives smaller measures of uncertainty than its competitors, and thus identifies more significant effects.

Journal ArticleDOI
TL;DR: In this paper, a wavelet decomposition of the signal for both fixed and random effects is proposed for high-dimensional curve clustering in the presence of interindividual variability, which is based on wavelet thresholding adapted to multiple curves and using an appropriate structure for the random effect variance.
Abstract: We propose a method for high-dimensional curve clustering in the presence of interindividual variability. Curve clustering has longly been studied especially using splines to account for functional random effects. However, splines are not appropriate when dealing with high-dimensional data and can not be used to model irregular curves such as peak-like data. Our method is based on a wavelet decomposition of the signal for both fixed and random effects. We propose an efficient dimension reduction step based on wavelet thresholding adapted to multiple curves and using an appropriate structure for the random effect variance, we ensure that both fixed and random effects lie in the same functional space even when dealing with irregular functions that belong to Besov spaces. In the wavelet domain our model resumes to a linear mixed-effects model that can be used for a model-based clustering algorithm and for which we develop an EM-algorithm for maximum likelihood estimation. The properties of the overall procedure are validated by an extensive simulation study. Then, we illustrate our method on mass spectrometry data and we propose an original application of functional data analysis on microarray comparative genomic hybridization (CGH) data. Our procedure is available through the R package curvclust which is the first publicly available package that performs curve clustering with random effects in the high dimensional framework (available on the CRAN).

Journal ArticleDOI
TL;DR: This work proposes an approach that calibrates the values of the sensitivity parameters to the observed covariates and is more interpretable to subject matter experts and will illustrate the method using data from the U.S. National Health and Nutrition Examination Survey regarding the relationship between cigarette smoking and blood lead levels.
Abstract: Summary In medical sciences, statistical analyses based on observational studies are common phenomena. One peril of drawing inferences about the effect of a treatment on subjects using observational studies is the lack of randomized assignment of subjects to the treatment. After adjusting for measured pretreatment covariates, perhaps by matching, a sensitivity analysis examines the impact of an unobserved covariate, u, in an observational study. One type of sensitivity analysis uses two sensitivity parameters to measure the degree of departure of an observational study from randomized assignment. One sensitivity parameter relates u to treatment and the other relates u to response. For subject matter experts, it may be difficult to specify plausible ranges of values for the sensitivity parameters on their absolute scales. We propose an approach that calibrates the values of the sensitivity parameters to the observed covariates and is more interpretable to subject matter experts. We will illustrate our method using data from the U.S. National Health and Nutrition Examination Survey regarding the relationship between cigarette smoking and blood lead levels.

Journal ArticleDOI
TL;DR: The research shows that group testing can offer large cost savings when classifying individuals for multiple infections and can provide prevalence estimates that are actually more efficient than those from individual testing.
Abstract: Screening for sexually transmitted diseases has benefited greatly from the use of group testing (pooled testing) to lower costs. With the development of assays that detect multiple infections, screening practices now involve testing pools of individuals for multiple infections simultaneously. Building on the research for single infection group testing procedures, we examine the performance of group testing for multiple infections. Our work is motivated by chlamydia and gonorrhea testing for the Infertility Prevention Project (IPP), a national program in the United States. We consider a two-stage pooling algorithm currently used to perform testing for the IPP. We first derive the operating characteristics of this algorithm for classification purposes (e.g., expected number of tests, misclassification probabilities, etc.) and identify pool sizes that minimize the expected number of tests. We then develop an expectation-maximization algorithm to estimate probabilities of infection using both group and individual retest responses. Our research shows that group testing can offer large cost savings when classifying individuals for multiple infections and can provide prevalence estimates that are actually more efficient than those from individual testing.


Journal ArticleDOI
TL;DR: A generalization of the Kruskal-Wallis test that incorporates group uncertainty when comparing k samples and follows an asymptotic chi-square distribution with k - 1 degrees of freedom under the null hypothesis is proposed.
Abstract: Motivated by genetic association studies of SNPs with genotype uncertainty, we propose a generalization of the Kruskal-Wallis test that incorporates group uncertainty when comparing k samples. The extended test statistic is based on probability-weighted rank-sums and follows an asymptotic chi-square distribution with k - 1 degrees of freedom under the null hypothesis. Simulation studies confirm the validity and robustness of the proposed test in finite samples. Application to a genome-wide association study of type 1 diabetic complications further demonstrates the utilities of this generalized Kruskal-Wallis test for studies with group uncertainty. The method has been implemented as an open-resource R program, GKW.

Journal ArticleDOI
TL;DR: The approach combines a commonly used methodology for robust experimental design, based on Markov chain Monte Carlo sampling, with approximate Bayesian computation (ABC) to ensure that no likelihood evaluations are required.
Abstract: In this paper we present a methodology for designing experiments for efficiently estimating the parameters of models with computationally intractable likelihoods. The approach combines a commonly used methodology for robust experimental design, based on Markov chain Monte Carlo sampling, with approximate Bayesian computation (ABC) to ensure that no likelihood evaluations are required. The utility function considered for precise parameter estimation is based upon the precision of the ABC posterior distribution, which we form efficiently via the ABC rejection algorithm based on pre-computed model simulations. Our focus is on stochastic models and, in particular, we investigate the methodology for Markov process models of epidemics and macroparasite population evolution. The macroparasite example involves a multivariate process and we assess the loss of information from not observing all variables.

Journal ArticleDOI
TL;DR: It is shown that the simple t-test without using any covariate is conservative under covariate-adaptive biased coin randomization in terms of its Type I error rate, and that a valid test using the bootstrap can be constructed.
Abstract: Some covariate-adaptive randomization methods have been used in clinical trials for a long time, but little theoretical work has been done about testing hypotheses under covariate-adaptive randomization until Shao et al. (2010) who provided a theory with detailed discussion for responses under linear models. In this article, we establish some asymptotic results for covariate-adaptive biased coin randomization under generalized linear models with possibly unknown link functions. We show that the simple t-test without using any covariate is conservative under covariate-adaptive biased coin randomization in terms of its Type I error rate, and that a valid test using the bootstrap can be constructed. This bootstrap test, utilizing covariates in the randomization scheme, is shown to be asymptotically as efficient as Wald's test correctly using covariates in the analysis. Thus, the efficiency loss due to not using covariates in the analysis can be recovered by utilizing covariates in covariate-adaptive biased coin randomization. Our theory is illustrated with two most popular types of discrete outcomes, binary responses and event counts under the Poisson model, and exponentially distributed continuous responses. We also show that an alternative simple test without using any covariate under the Poisson model has an inflated Type I error rate under simple randomization, but is valid under covariate-adaptive biased coin randomization. Effects on the validity of tests due to model misspecification is also discussed. Simulation studies about the Type I errors and powers of several tests are presented for both discrete and continuous responses.

Journal ArticleDOI
TL;DR: Estimators for line transect surveys of animals that are stochastically unavailable for detection while within detection range are developed and shown to be more general and more flexible than existing estimators based on parametric models of the availability process.
Abstract: Summary We develop estimators for line transect surveys of animals that are stochastically unavailable for detection while within detection range. The detection process is formulated as a hidden Markov model with a binary state-dependent observation model that depends on both perpendicular and forward distances. This provides a parametric method of dealing with availability bias when estimates of availability process parameters are available even if series of availability events themselves are not. We apply the estimators to an aerial and a shipboard survey of whales, and investigate their properties by simulation. They are shown to be more general and more flexible than existing estimators based on parametric models of the availability process. We also find that methods using availability correction factors can be very biased when surveys are not close to being instantaneous, as can estimators that assume temporal independence in availability when there is temporal dependence.

Journal ArticleDOI
TL;DR: It is demonstrated how recent advances in Gaussian process‐based nonparametric inference for Poisson processes can be extended to BayesianNonparametric estimation of population size dynamics under the coalescent.
Abstract: Summary Changes in population size influence genetic diversity of the population and, as a result, leave a signature of these changes in individual genomes in the population. We are interested in the inverse problem of reconstructing past population dynamics from genomic data. We start with a standard framework based on the coalescent, a stochastic process that generates genealogies connecting randomly sampled individuals from the population of interest. These genealogies serve as a glue between the population demographic history and genomic sequences. It turns out that only the times of genealogical lineage coalescences contain information about population size dynamics. Viewing these coalescent times as a point process, estimating population size trajectories is equivalent to estimating a conditional intensity of this point process. Therefore, our inverse problem is similar to estimating an inhomogeneous Poisson process intensity function. We demonstrate how recent advances in Gaussian process-based nonparametric inference for Poisson processes can be extended to Bayesian nonparametric estimation of population size dynamics under the coalescent. We compare our Gaussian process (GP) approach to one of the state-of-the-art Gaussian Markov random field (GMRF) methods for estimating population trajectories. Using simulated data, we demonstrate that our method has better accuracy and precision. Next, we analyze two genealogies reconstructed from real sequences of hepatitis C and human Influenza A viruses. In both cases, we recover more believed aspects of the viral demographic histories than the GMRF approach. We also find that our GP method produces more reasonable uncertainty estimates than the GMRF method.

Journal ArticleDOI
TL;DR: A pseudo-score type estimator suitable for the augmented design of a biomarker in a vaccine efficacy trial for efficiently estimating its surrogate effect, as characterized by the vaccine efficacy curve (a causal effect predictiveness curve) and by the predicted overall vaccine efficacy using the biomarker.
Abstract: In vaccine research, immune biomarkers that can reliably predict a vaccine’s effect on the clinical endpoint (i.e., surrogate markers) are important tools for guiding vaccine development. This paper addresses issues on optimizing two-phase sampling study design for evaluating surrogate markers in a principal surrogate framework, motivated by the design of a future HIV vaccine trial. To address the problem of missing potential outcomes in a standard trial design, novel trial designs have been proposed that utilize baseline predictors of the immune response biomarker(s) and/or augment the trial by vaccinating uninfected placebo recipients at the end of the trial and measuring their immune biomarkers. However, inefficient use of the augmented information can lead to counterintuitive results on the precision of estimation. To remedy this problem, we propose a pseudo-score type estimator suitable for the augmented design and characterize its asymptotic properties. This estimator has superior performance compared with existing estimators and allows calculation of analytical variances useful for guiding study design. Based on the new estimator we investigate in detail the problem of optimizing the sampling scheme of a biomarker in a vaccine efficacy trial for efficiently estimating its surrogate effect, as characterized by the vaccine efficacy curve (a causal effect predictiveness curve) and by the predicted overall vaccine efficacy using the biomarker.

Journal ArticleDOI
TL;DR: A regularized multiple SCCS approach that incorporates potentially thousands or more of time‐varying confounders such as other drugs is proposed, which successfully handles the high dimensionality and can provide a sparse solution via an L1 regularizer.
Abstract: Characterization of relationships between time-varying drug exposures and adverse events (AEs) related to health outcomes represents the primary objective in postmarketing drug safety surveillance. Such surveillance increasingly utilizes large-scale longitudinal observational databases (LODs), containing time-stamped patient-level medical information including periods of drug exposure and dates of diagnoses for millions of patients. Statistical methods for LODs must confront computational challenges related to the scale of the data, and must also address confounding and other biases that can undermine efforts to estimate effect sizes. Methods that compare on-drug with off-drug periods within patient offer specific advantages over between patient analysis on both counts. To accomplish these aims, we extend the self-controlled case series (SCCS) for LODs. SCCS implicitly controls for fixed multiplicative baseline covariates since each individual acts as their own control. In addition, only exposed cases are required for the analysis, which is computationally advantageous. The standard SCCS approach is usually used to assess single drugs and therefore estimates marginal associations between individual drugs and particular AEs. Such analyses ignore confounding drugs and interactions and have the potential to give misleading results. In order to avoid these difficulties, we propose a regularized multiple SCCS approach that incorporates potentially thousands or more of time-varying confounders such as other drugs. The approach successfully handles the high dimensionality and can provide a sparse solution via an L₁ regularizer. We present details of the model and the associated optimization procedure, as well as results of empirical investigations.

Journal ArticleDOI
TL;DR: Semiparametric theory is used to derive a doubly robust estimator of the treatment-specific survival distribution in cases where it is believed that all the potential confounders are captured.
Abstract: Observational studies are frequently conducted to compare the effects of two treatments on survival For such studies we must be concerned about confounding; that is, there are covariates that affect both the treatment assignment and the survival distribution With confounding the usual treatment-specific Kaplan-Meier estimator might be a biased estimator of the underlying treatment-specific survival distribution This article has two aims In the first aim we use semiparametric theory to derive a doubly robust estimator of the treatment-specific survival distribution in cases where it is believed that all the potential confounders are captured In cases where not all potential confounders have been captured one may conduct a substudy using a stratified sampling scheme to capture additional covariates that may account for confounding The second aim is to derive a doubly-robust estimator for the treatment-specific survival distributions and its variance estimator with such a stratified sampling scheme Simulation studies are conducted to show consistency and double robustness These estimators are then applied to the data from the ASCERT study that motivated this research

Journal ArticleDOI
TL;DR: Goodness-of-fit tests are useful in assessing whether a statistical model is consistent with available data as discussed by the authors, however, the usual χ² asymptotics often fail, either because of the paucity of the data or because a nonstandard test statistic is of interest.
Abstract: Goodness-of-fit tests are useful in assessing whether a statistical model is consistent with available data. However, the usual χ² asymptotics often fail, either because of the paucity of the data or because a nonstandard test statistic is of interest. In this article, we describe exact goodness-of-fit tests for first- and higher order Markov chains, with particular attention given to time-reversible ones. The tests are obtained by conditioning on the sufficient statistics for the transition probabilities and are implemented by simple Monte Carlo sampling or by Markov chain Monte Carlo. They apply both to single and to multiple sequences and allow a free choice of test statistic. Three examples are given. The first concerns multiple sequences of dry and wet January days for the years 1948-1983 at Snoqualmie Falls, Washington State, and suggests that standard analysis may be misleading. The second one is for a four-state DNA sequence and lends support to the original conclusion that a second-order Markov chain provides an adequate fit to the data. The last one is six-state atomistic data arising in molecular conformational dynamics simulation of solvated alanine dipeptide and points to strong evidence against a first-order reversible Markov chain at 6 picosecond time steps.

Journal ArticleDOI
TL;DR: A novel penalized regression method based on a weaker prior assumption that the parameters of neighboring nodes in a network are likely to be zero (or non‐zero) at the same time, regardless of their specific magnitudes is proposed.
Abstract: Summary. Penalized regression approaches are attractive in dealing with high-dimensional data such as arising in highthroughput genomic studies. New methods have been introduced to utilize the network structure of predictors, for example, gene networks, to improve parameter estimation and variable selection. All the existing network-based penalized methods are based on an assumption that parameters, for example, regression coefficients, of neighboring nodes in a network are close in magnitude, which however may not hold. Here we propose a novel penalized regression method based on a weaker prior assumption that the parameters of neighboring nodes in a network are likely to be zero (or non-zero) at the same time, regardless of their specific magnitudes. We propose a novel non-convex penalty function to incorporate this prior, and an algorithm based on difference convex programming. We use simulated data and two breast cancer gene expression datasets to demonstrate the advantages of the proposed methods over some existing methods. Our proposed methods can be applied to more general problems for group variable selection.

Journal ArticleDOI
TL;DR: A class of hierarchical low‐rank spatial factor models is proposed that pursues stochastic selection of the latent factors without resorting to complex computational strategies by utilizing certain identifiability characterizations for the spatial factor model.
Abstract: This article deals with jointly modeling a large number of geographically referenced outcomes observed over a very large number of locations. We seek to capture associations among the variables as well as the strength of spatial association for each variable. In addition, we reckon with the common setting where not all the variables have been observed over all locations, which leads to spatial misalignment. Dimension reduction is needed in two aspects: (i) the length of the vector of outcomes, and (ii) the very large number of spatial locations. Latent variable (factor) models are usually used to address the former, although low-rank spatial processes offer a rich and flexible modeling option for dealing with a large number of locations. We merge these two ideas to propose a class of hierarchical low-rank spatial factor models. Our framework pursues stochastic selection of the latent factors without resorting to complex computational strategies (such as reversible jump algorithms) by utilizing certain identifiability characterizations for the spatial factor model. A Markov chain Monte Carlo algorithm is developed for estimation that also deals with the spatial misalignment problem. We recover the full posterior distribution of the missing values (along with model parameters) in a Bayesian predictive framework. Various additional modeling and implementation issues are discussed as well. We illustrate our methodology with simulation experiments and an environmental data set involving air pollutants in California.

Journal ArticleDOI
TL;DR: A Bayesian two‐stage phase I–II design for optimizing administration schedule and dose of an experimental agent based on the times to response and toxicity in the case where schedules are non‐nested and qualitatively different is proposed.
Abstract: Summary. A Bayesian two-stage phase I–II design is proposed for optimizing administration schedule and dose of an experimental agent based on the times to response and toxicity in the case where schedules are non-nested and qualitatively different. Sequentially adaptive decisions are based on the joint utility of the two event times. A utility function is constructed by partitioning the two-dimensional positive real quadrant of possible event time pairs into rectangles, eliciting a numerical utility for each rectangle, and fitting a smooth parametric function to the elicited values. We assume that each event time follows a gamma distribution with shape and scale parameters both modeled as functions of schedule and dose. A copula is assumed to obtain a bivariate distribution. To ensure an ethical trial, adaptive safety and efficacy acceptability conditions are imposed on the (schedule, dose) regimes. In stage 1 of the design, patients are randomized fairly among schedules and, within each schedule, a dose is chosen using a hybrid algorithm that either maximizes posterior mean utility or randomizes among acceptable doses. In stage 2, fair randomization among schedules is replaced by the hybrid algorithm. A modified version of this algorithm is used for nested schedules. Extensions of the model and utility function to accommodate death or discontinuation of follow up are described. The method is illustrated by an autologous stem cell transplantation trial in multiple myeloma, including a simulation study.

Journal ArticleDOI
TL;DR: Analysis of data on the strains of S. pneumoniae carried in attendees of day care units in the metropolitan area of Oslo, Norway finds evidence for strong between‐strain competition, as the acquisition of other strains in the already colonized hosts is estimated to have a relative rate of 0.09.
Abstract: Summary. Streptococcus pneumoniae is a typical commensal bacterium causing severe diseases. Its prevalence is high among young children attending day care units, due to lower levels of acquired immunity and a high rate of infectious contacts between the attendees. Understanding the population dynamics of different strains of S.pneumoniae is necessary, for example, for making successful predictions of changes in the composition of the strain community under intervention policies. Here we analyze data on the strains of S. pneumoniae carried in attendees of day care units in the metropolitan area of Oslo, Norway. We introduce a variant of approximate Bayesian computation methods, which is suitable for estimating the parameters governing the transmission dynamics in a setting where small local populations of hosts are subject to epidemics of different pathogenic strains due to infections independently acquired from the community. We find evidence for strong between-strain competition, as the acquisition of other strains in the already colonized hosts is estimated to have a relative rate of 0.09 (95% credibility interval [0.06, 0.14]). We also predict the frequency and size distributions for epidemics within the day care unit, as well as other epidemiologically relevant features. The assumption of ecological neutrality between the strains is observed to be compatible with the data. Model validation checks and the consistency of our results with previous research support the validity of our conclusions.

Journal ArticleDOI
TL;DR: A new moment‐based approximation that performs well in simulations is developed that possesses most of the aforementioned characteristics of a good association test, especially when compared to existing quadratic score tests or restricted likelihood ratio tests.
Abstract: Summary Following the rapid development of genome-scale genotyping technologies, genetic association mapping has become a popular tool to detect genomic regions responsible for certain (disease) phenotypes, especially in early-phase pharmacogenomic studies with limited sample size. In response to such applications, a good association test needs to be (1) applicable to a wide range of possible genetic models, including, but not limited to, the presence of gene-by-environment or gene-by-gene interactions and non-linearity of a group of marker effects, (2) accurate in small samples, fast to compute on the genomic scale, and amenable to large scale multiple testing corrections, and (3) reasonably powerful to locate causal genomic regions. The kernel machine method represented in linear mixed models provides a viable solution by transforming the problem into testing the nullity of variance components. In this study, we consider score-based tests by choosing a statistic linear in the score function. When the model under the null hypothesis has only one error variance parameter, our test is exact in finite samples. When the null model has more than one variance parameter, we develop a new moment-based approximation that performs well in simulations. Through simulations and analysis of real data, we demonstrate that the new test possesses most of the aforementioned characteristics, especially when compared to existing quadratic score tests or restricted likelihood ratio tests.

Journal ArticleDOI
TL;DR: The issue is examined for Huber's m-statistics, including the t-test, the examination having three components: an example, asymptotic calculations using design sensitivity, and a simulation, which yields a new result giving the design sensitivity for the permutation distribution of m-Statistics.
Abstract: Summary In an observational study, one treated subject may be matched for observed covariates to either one or several untreated controls. The common motivation for using several controls rather than one is to increase the power of a test of no effect under the doubtful assumption that matching for observed covariates suffices to remove bias from nonrandom treatment assignment. Does the choice between one or several matched controls affect the sensitivity of conclusions to violations of this doubtful assumption? With continuous responses, it is known that reducing the heterogeneity of matched pair differences reduces sensitivity to unmeasured biases, but increasing the sample size has a highly circumscribed effect on sensitivity to bias. Is the use of several controls rather than one analogous to a reduction in heterogeneity or to an increase in sample size? The issue is examined for Huber's m-statistics, including the t-test, the examination having three components: an example, asymptotic calculations using design sensitivity, and a simulation. Use of multiple controls with continuous responses yields a nontrivial reduction in sensitivity to unmeasured biases. An example looks at lead and cadmium in the blood of smokers from the 2008 National Health and Nutrition Examination Survey. A by-product of the discussion is a new result giving the design sensitivity for the permutation distribution of m-statistics.