scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2019"


Journal ArticleDOI
TL;DR: In this paper, the authors consider methods for causal inference in randomized trials nested within cohorts of trial-eligible individuals, including those who are not randomized, and show how baseline covariate data from the entire cohort, and treatment and outcome data only from randomized individuals, can be used to identify potential (counterfactual) outcome means and average treatment effects in the target population of all eligible individuals.
Abstract: We consider methods for causal inference in randomized trials nested within cohorts of trial-eligible individuals, including those who are not randomized. We show how baseline covariate data from the entire cohort, and treatment and outcome data only from randomized individuals, can be used to identify potential (counterfactual) outcome means and average treatment effects in the target population of all eligible individuals. We review identifiability conditions, propose estimators, and assess the estimators' finite-sample performance in simulation studies. As an illustration, we apply the estimators in a trial nested within a cohort of trial-eligible individuals to compare coronary artery bypass grafting surgery plus medical therapy vs. medical therapy alone for chronic coronary artery disease.

105 citations


Journal ArticleDOI
TL;DR: By reanalyzing completed randomized trials for mild cognitive impairment, schizophrenia, and depression, it is demonstrated how ANCOVA can achieve variance reductions of 4 to 32%.
Abstract: "Covariate adjustment" in the randomized trial context refers to an estimator of the average treatment effect that adjusts for chance imbalances between study arms in baseline variables (called "covariates"). The baseline variables could include, for example, age, sex, disease severity, and biomarkers. According to two surveys of clinical trial reports, there is confusion about the statistical properties of covariate adjustment. We focus on the analysis of covariance (ANCOVA) estimator, which involves fitting a linear model for the outcome given the treatment arm and baseline variables, and trials that use simple randomization with equal probability of assignment to treatment and control. We prove the following new (to the best of our knowledge) robustness property of ANCOVA to arbitrary model misspecification: Not only is the ANCOVA point estimate consistent (as proved by Yang and Tsiatis, 2001) but so is its standard error. This implies that confidence intervals and hypothesis tests conducted as if the linear model were correct are still asymptotically valid even when the linear model is arbitrarily misspecified, for example, when the baseline variables are nonlinearly related to the outcome or there is treatment effect heterogeneity. We also give a simple, robust formula for the variance reduction (equivalently, sample size reduction) from using ANCOVA. By reanalyzing completed randomized trials for mild cognitive impairment, schizophrenia, and depression, we demonstrate how ANCOVA can achieve variance reductions of 4 to 32%.

49 citations


Journal ArticleDOI
TL;DR: New estimands are defined that describe average potential outcomes for realistic counterfactual treatment allocation programs, extending existing estimands to take into consideration the units' covariates and dependence between units' treatment assignment.
Abstract: Interference arises when an individual's potential outcome depends on the individual treatment level, but also on the treatment level of others. A common assumption in the causal inference literature in the presence of interference is partial interference, implying that the population can be partitioned in clusters of individuals whose potential outcomes only depend on the treatment of units within the same cluster. Previous literature has defined average potential outcomes under counterfactual scenarios where treatments are randomly allocated to units within a cluster. However, within clusters there may be units that are more or less likely to receive treatment based on covariates or neighbors' treatment. We define new estimands that describe average potential outcomes for realistic counterfactual treatment allocation programs, extending existing estimands to take into consideration the units' covariates and dependence between units' treatment assignment. We further propose entirely new estimands for population-level interventions over the collection of clusters, which correspond in the motivating setting to regulations at the federal (vs. cluster or regional) level. We discuss these estimands, propose unbiased estimators and derive asymptotic results as the number of clusters grows. For a small number of observed clusters, a bootstrap approach for confidence intervals is proposed. Finally, we estimate effects in a comparative effectiveness study of power plant emission reduction technologies on ambient ozone pollution.

44 citations



Journal ArticleDOI
TL;DR: SLIDE as mentioned in this paper is a linked component model that directly incorporates partially-shared structures, which allows for joint identification of the number of components of each type, in contrast to existing sequential approaches, and demonstrates excellent performance in both signal estimation and component selection.
Abstract: The increased availability of multi-view data (data on the same samples from multiple sources) has led to strong interest in models based on low-rank matrix factorizations. These models represent each data view via shared and individual components, and have been successfully applied for exploratory dimension reduction, association analysis between the views, and consensus clustering. Despite these advances, there remain challenges in modeling partially-shared components and identifying the number of components of each type (shared/partially-shared/individual). We formulate a novel linked component model that directly incorporates partially-shared structures. We call this model SLIDE for Structural Learning and Integrative DEcomposition of multi-view data. The proposed model-fitting and selection techniques allow for joint identification of the number of components of each type, in contrast to existing sequential approaches. In our empirical studies, SLIDE demonstrates excellent performance in both signal estimation and component selection. We further illustrate the methodology on the breast cancer data from The Cancer Genome Atlas repository.

39 citations



Journal ArticleDOI
TL;DR: In this paper, the authors introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies, which separately identify and estimate common factors shared across multiple studies and study-specific factors.
Abstract: We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate (1) common factors shared across multiple studies, and (2) study-specific factors. We develop an Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the numbers of common and specific factors. We present simulations for evaluating the performance of the method and we illustrate it by applying it to gene expression data in ovarian cancer. In both, we clarify the benefits of a joint analysis compared to the standard factor analysis. We have provided a tool to accelerate the pace at which we can combine unsupervised analysis across multiple studies, and understand the cross-study reproducibility of signal in multivariate data. An R package (MSFA), is implemented and is available on GitHub.

38 citations


Journal ArticleDOI
TL;DR: The method is applied to a twelve-year survey of male jaguars in the Cockscomb Basin Wildlife Sanctuary, Belize, to estimate survival probability and population abundance over time and provides a lower root-mean-square error in predicting population density compared to closed population models.
Abstract: Open population capture-recapture models are widely used to estimate population demographics and abundance over time. Bayesian methods exist to incorporate open population modeling with spatial capture-recapture (SCR), allowing for estimation of the effective area sampled and population density. Here, open population SCR is formulated as a hidden Markov model (HMM), allowing inference by maximum likelihood for both Cormack-Jolly-Seber and Jolly-Seber models, with and without activity center movement. The method is applied to a 12-year survey of male jaguars (Panthera onca) in the Cockscomb Basin Wildlife Sanctuary, Belize, to estimate survival probability and population abundance over time. For this application, inference is shown to be biased when assuming activity centers are fixed over time, while including a model for activity center movement provides negligible bias and nominal confidence interval coverage, as demonstrated by a simulation study. The HMM approach is compared with Bayesian data augmentation and closed population models for this application. The method is substantially more computationally efficient than the Bayesian approach and provides a lower root-mean-square error in predicting population density compared to closed population models.

38 citations


Journal ArticleDOI
TL;DR: In this article, a generalized linear regression analysis with compositional covariates is proposed, where a group of linear constraints on regression coefficients are imposed to account for the compositional nature of the data and to achieve subcompositional coherence.
Abstract: Motivated by regression analysis for microbiome compositional data, this article considers generalized linear regression analysis with compositional covariates, where a group of linear constraints on regression coefficients are imposed to account for the compositional nature of the data and to achieve subcompositional coherence. A penalized likelihood estimation procedure using a generalized accelerated proximal gradient method is developed to efficiently estimate the regression coefficients. A de-biased procedure is developed to obtain asymptotically unbiased and normally distributed estimates, which leads to valid confidence intervals of the regression coefficients. Simulations results show the correctness of the coverage probability of the confidence intervals and smaller variances of the estimates when the appropriate linear constraints are imposed. The methods are illustrated by a microbiome study in order to identify bacterial species that are associated with inflammatory bowel disease (IBD) and to predict IBD using fecal microbiome.

35 citations


Journal ArticleDOI
TL;DR: A novel method for separating amplitude and phase variability in exponential family functional data is introduced, andSimulations designed to mimic the application indicate that the proposed methods outperform competing approaches in terms of estimation accuracy and computational efficiency.
Abstract: We consider the problem of aligning curves from exponential family distributions. The approach is based on the combination of alignment and functional principal components analysis, and is facilitated by recent extensions of FPCA to non-Gaussian settings. Our work is motivated by the study of physical activity using accelerometers, wearable devices that provide around-the-clock monitoring of activity and produce non-Gaussian measurements. We apply the proposed methods to activity counts using a Poisson distribution, and to a binary “active” vs “inactive” indicator using a binomial distribution. After alignment, the trajectories show clear peaks of activity in the morning and afternoon with a dip in the middle of the day.

33 citations


Journal ArticleDOI
TL;DR: In this article, the authors leverage a large observational data and compare, in terms of mortality and CD4 cell count, the dynamic treatment initiation rules for human immunodeficiency virus-infected adolescents.
Abstract: Evidence supporting the current World Health Organization recommendations of early antiretroviral therapy (ART) initiation for adolescents is inconclusive. We leverage a large observational data and compare, in terms of mortality and CD4 cell count, the dynamic treatment initiation rules for human immunodeficiency virus-infected adolescents. Our approaches extend the marginal structural model for estimating outcome distributions under dynamic treatment regimes, developed in Robins et al. (2008), to allow the causal comparisons of both specific regimes and regimes along a continuum. Furthermore, we propose strategies to address three challenges posed by the complex data set: continuous-time measurement of the treatment initiation process; sparse measurement of longitudinal outcomes of interest, leading to incomplete data; and censoring due to dropout and death. We derive a weighting strategy for continuous-time treatment initiation, use imputation to deal with missingness caused by sparse measurements and dropout, and define a composite outcome that incorporates both death and CD4 count as a basis for comparing treatment regimes. Our analysis suggests that immediate ART initiation leads to lower mortality and higher median values of the composite outcome, relative to other initiation rules.

Journal ArticleDOI
TL;DR: This work proposes a new sparse CCA (SCCA) method that recasts high‐dimensional CCA as an iterative penalized least squares problem and produces nested solutions and thus provides great convenient in practice.
Abstract: It is increasingly interesting to model the relationship between two sets of high-dimensional measurements with potentially high correlations. Canonical correlation analysis (CCA) is a classical tool that explores the dependency of two multivariate random variables and extracts canonical pairs of highly correlated linear combinations. Driven by applications in genomics, text mining, and imaging research, among others, many recent studies generalize CCA to high-dimensional settings. However, most of them either rely on strong assumptions on covariance matrices, or do not produce nested solutions. We propose a new sparse CCA (SCCA) method that recasts high-dimensional CCA as an iterative penalized least squares problem. Thanks to the new iterative penalized least squares formulation, our method directly estimates the sparse CCA directions with efficient algorithms. Therefore, in contrast to some existing methods, the new SCCA does not impose any sparsity assumptions on the covariance matrices. The proposed SCCA is also very flexible in the sense that it can be easily combined with properly chosen penalty functions to perform structured variable selection and incorporate prior information. Moreover, our proposal of SCCA produces nested solutions and thus provides great convenient in practice. Theoretical results show that SCCA can consistently estimate the true canonical pairs with an overwhelming probability in ultra-high dimensions. Numerical results also demonstrate the competitive performance of SCCA.

Journal ArticleDOI
TL;DR: A metric of the information content of cluster-period cells, entire treatment sequences, and entire periods of the standard stepped wedge design is defined as the increase in variance of the estimator of the treatment effect when that cell, sequence, or period is omitted.
Abstract: Stepped wedge and other multiple-period cluster randomized trials, which collect data from multiple clusters across multiple time periods, are being conducted with increasing frequency; statistical research into these designs has not kept apace. In particular, some stepped wedge designs with missing cluster-period "cells" have been proposed without any formal justification. Indeed there are no general guidelines regarding which cells of a stepped wedge design contribute the least information toward estimation of the treatment effect, and correspondingly which may be preferentially omitted. In this article, we define a metric of the information content of cluster-period cells, entire treatment sequences, and entire periods of the standard stepped wedge design as the increase in variance of the estimator of the treatment effect when that cell, sequence, or period is omitted. We show that the most information-rich cells are those that occur immediately before or after treatment switches, but also that there are additional cells that contribute almost as much to the estimation of the treatment effect. However, the information content patterns depend on the assumed correlation structure for the repeated measurements within a cluster.

Journal ArticleDOI
TL;DR: An automated feature selection method based entirely on unlabeled observations is presented that demonstrates that variable selection for the underlying phenotype model can be achieved by fitting the surrogate‐based model and reduces the number of gold‐standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.
Abstract: The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

Journal ArticleDOI
TL;DR: The likelihood for this model is derived and an expectation‐maximization algorithm to maximize an L1‐penalized version of this likelihood to limit the number of factors and associated biomarkers is developed.
Abstract: We propose a model for high dimensional mediation analysis that includes latent variables. We describe our model in the context of an epidemiologic study for incident breast cancer with one exposure and a large number of biomarkers (i.e., potential mediators). We assume that the exposure directly influences a group of latent, or unmeasured, factors which are associated with both the outcome and a subset of the biomarkers. The biomarkers associated with the latent factors linking the exposure to the outcome are considered "mediators." We derive the likelihood for this model and develop an expectation-maximization algorithm to maximize an L1-penalized version of this likelihood to limit the number of factors and associated biomarkers. We show that the resulting estimates are consistent and that the estimates of the nonzero parameters have an asymptotically normal distribution. In simulations, procedures based on this new model can have significantly higher power for detecting the mediating biomarkers compared with the simpler approaches. We apply our method to a study that evaluates the relationship between body mass index, 481 metabolic measurements, and estrogen-receptor positive breast cancer.

Journal ArticleDOI
TL;DR: This work considers a novel class of semiparametric additive hazards models which leave the effects of covariates unspecified, and proposes two different estimation approaches for the hazard difference and the relative chance of survival, which yield estimators that are doubly robust.
Abstract: The estimation of conditional treatment effects in an observational study with a survival outcome typically involves fitting a hazards regression model adjusted for a high-dimensional covariate. Standard estimation of the treatment effect is then not entirely satisfactory, as the misspecification of the effect of this covariate may induce a large bias. Such misspecification is a particular concern when inferring the hazard difference, because it is difficult to postulate additive hazards models that guarantee non-negative hazards over the entire observed covariate range. We therefore consider a novel class of semiparametric additive hazards models which leave the effects of covariates unspecified. The efficient score under this model is derived. We then propose two different estimation approaches for the hazard difference (and hence also the relative chance of survival), both of which yield estimators that are doubly robust. The approaches are illustrated using simulation studies and data on right heart catheterization and mortality from the SUPPORT study.

Journal ArticleDOI
TL;DR: An exact, unconditional, non-randomized procedure for producing confidence intervals for the grand mean in a normal-normal random effects meta-analysis of meta-analyses investigating the effect of calcium intake on bone mineral density is described.
Abstract: We describe an exact, unconditional, non-randomized procedure for producing confidence intervals for the grand mean in a normal-normal random effects meta-analysis. The procedure targets meta-analyses based on too few primary studies, ≤ 7 , say, to allow for the conventional asymptotic estimators, e.g., DerSimonian and Laird (1986), or non-parametric resampling-based procedures, e.g., Liu et al. (2017). Meta-analyses with such few studies are common, with one recent sample of 22,453 heath-related meta-analyses finding a median of 3 primary studies per meta-analysis (Davey et al., 2011). Reliable and efficient inference procedures are therefore needed to address this setting. The coverage level of the resulting CI is guaranteed to be above the nominal level, up to Monte Carlo error, provided the meta-analysis contains more than 1 study and the model assumptions are met. After employing several techniques to accelerate computation, the new CI can be easily constructed on a personal computer. Simulations suggest that the proposed CI typically is not overly conservative. We illustrate the approach on several contrasting examples of meta-analyses investigating the effect of calcium intake on bone mineral density.

Journal ArticleDOI
TL;DR: The proposed method avoids the Kalman filter approximations and Monte Carlo simulations, is exact, flexible, and allows the use of standard techniques of classical inference to model systems which have a discrete state space.
Abstract: Integrated population modelling is widely used in statistical ecology. It allows data from population time series and independent surveys to be analysed simultaneously. In classical analysis the time-series likelihood component can be conveniently approximated using Kalman filter methodology. However, the natural way to model systems which have a discrete state space is to use hidden Markov models (HMMs). The proposed method avoids the Kalman filter approximations and Monte Carlo simulations. Subject to possible numerical sensitivity analysis, it is exact, flexible, and allows the use of standard techniques of classical inference. We apply the approach to data on Little owls, where the model is shown to require a one-dimensional state space, and Northern lapwings, with a two-dimensional state space. In the former example the method identifies a parameter redundancy which changes the perception of the data needed to estimate immigration in integrated population modelling. The latter example may be analysed using either first- or second-order HMMs, describing numbers of one-year olds and adults or adults only, respectively. The use of first-order chains is found to be more efficient, mainly due to the smaller number of one-year olds than adults in this application. For the lapwing modelling it is necessary to group the states in order to reduce the large dimension of the state space. Results check with Bayesian and Kalman filter analyses, and avenues for future research are identified.

Journal ArticleDOI
TL;DR: In this article, an embedding of the log-ratio parameter space into a space of much lower dimension is proposed, which serves as the foundation for a two-step fitting procedure that combines a convex filtering step with a second nonconvex pruning step to yield highly sparse solutions.
Abstract: Positive-valued signal data is common in the biological and medical sciences, due to the prevalence of mass spectrometry other imaging techniques. With such data, only the relative intensities of the raw measurements are meaningful. It is desirable to consider models consisting of the log-ratios of all pairs of the raw features, since log-ratios are the simplest meaningful derived features. In this case, however, the dimensionality of the predictor space becomes large, and computationally efficient estimation procedures are required. In this work, we introduce an embedding of the log-ratio parameter space into a space of much lower dimension and use this representation to develop an efficient penalized fitting procedure. This procedure serves as the foundation for a two-step fitting procedure that combines a convex filtering step with a second non-convex pruning step to yield highly sparse solutions. On a cancer proteomics data set, the proposed method fits a highly sparse model consisting of features of known biological relevance while greatly improving upon the predictive accuracy of less interpretable methods.

Journal ArticleDOI
TL;DR: This paper fits a nested multivariate max-stable model to the maxima of air pollution concentrations and temperatures recorded at a number of sites in the Los Angeles area, showing that the proposed model succeeds in capturing their complex tail dependence structure.
Abstract: Capturing the potentially strong dependence among the peak concentrations of multiple air pollutants across a spatial region is crucial for assessing the related public health risks. In order to investigate the multivariate spatial dependence properties of air pollution extremes, we introduce a new class of multivariate max-stable processes. Our proposed model admits a hierarchical tree-based formulation, in which the data are conditionally independent given some latent nested positive stable random factors. The hierarchical structure facilitates Bayesian inference and offers a convenient and interpretable characterization. We fit this nested multivariate max-stable model to the maxima of air pollution concentrations and temperatures recorded at a number of sites in the Los Angeles area, showing that the proposed model succeeds in capturing their complex tail dependence structure.

Journal ArticleDOI
TL;DR: An ENsemble Deep Learning Optimal Treatment (EndLot) approach is proposed for personalized medicine problems that employs an ensemble of deep neural networks (DNNs) to learn the optimal decision rule.
Abstract: An ENsemble Deep Learning Optimal Treatment (EndLot) approach is proposed for personalized medicine problems. The statistical framework of the proposed method is based on the outcome weighted learning (OWL) framework which transforms the optimal decision rule problem into a weighted classification problem. We further employ an ensemble of deep neural networks (DNNs) to learn the optimal decision rule. Utilizing the flexibility of DNNs and the stability of bootstrap aggregation, the proposed method achieves a considerable improvement over existing methods. An R package "ITRlearn" is developed to implement the proposed method. Numerical performance is demonstrated via simulation studies and a real data analysis of the Cancer Cell Line Encyclopedia data.

Journal ArticleDOI
TL;DR: This extension substantially widens the range of alternatives against which the win ratio is known to be consistent and incorporates alternatives induced by simple and popular copula models that are left out by the characterization of Luo et al. (2015).
Abstract: We extend the results of Luo et al. (2015, Biometrics 71, 139-145) regarding the alternative hypotheses for the win ratio from hazard orders to the upper quadrant stochastic order on the plane. This extension substantially widens the range of alternatives against which the win ratio is known to be consistent. It incorporates alternatives induced by simple and popular copula models that are left out by the characterization of Luo et al. (2015). We also discuss how our results may be generalized to win ratios in multivariate and stratified settings.

Journal ArticleDOI
TL;DR: This work proposes using the fused sparse group lasso (FSGL) penalty to encourage structured, sparse, and interpretable solutions by incorporating prior information about spatial and group structure among voxels and presents optimization steps for FSGL penalized regression using the alternating direction method of multipliers algorithm.
Abstract: Predicting clinical variables from whole-brain neuroimages is a high-dimensional problem that can potentially benefit from feature selection or extraction. Penalized regression is a popular embedded feature selection method for high-dimensional data. For neuroimaging applications, spatial regularization using the l1 or l2 norm of the image gradient has shown good performance, yielding smooth solutions in spatially contiguous brain regions. Enormous resources have been devoted to establishing structural and functional brain connectivity networks that can be used to define spatially distributed yet related groups of voxels. We propose using the fused sparse group lasso (FSGL) penalty to encourage structured, sparse, and interpretable solutions by incorporating prior information about spatial and group structure among voxels. We present optimization steps for FSGL penalized regression using the alternating direction method of multipliers algorithm. With simulation studies and in application to real functional magnetic resonance imaging data from the Autism Brain Imaging Data Exchange, we demonstrate conditions under which fusion and group penalty terms together outperform either of them alone.

Journal ArticleDOI
TL;DR: This paper proposes a single-index model for the mixing probabilities, which is a flexible model which at the same time does not suffer from curse-of-dimensionality problems, and estimates this model using a maximum likelihood approach.
Abstract: In survival analysis, it often happens that a certain fraction of the subjects under study never experience the event of interest, that is, they are considered "cured." In the presence of covariates, a common model for this type of data is the mixture cure model, which assumes that the population consists of two subpopulations, namely the cured and the non-cured ones, and it writes the survival function of the whole population given a set of covariates as a mixture of the survival function of the cured subjects (which equals one), and the survival function of the non-cured ones. In the literature, one usually assumes that the mixing probabilities follow a logistic model. This is, however, a strong modeling assumption, which might not be met in practice. Therefore, in order to have a flexible model which at the same time does not suffer from curse-of-dimensionality problems, we propose in this paper a single-index model for the mixing probabilities. For the survival function of the non-cured subjects we assume a Cox proportional hazards model. We estimate this model using a maximum likelihood approach. We also carry out a simulation study, in which we compare the estimators under the single-index model and under the logistic model for various model settings, and we apply the new model and estimation method on a breast cancer data set.

Journal ArticleDOI
TL;DR: It is the experience of the author that two or three adjustments, guided by balance diagnostics, can substantially improve covariate balance, perhaps requiring fifteen minutes effort sitting at the computer.
Abstract: Multivariate matching in observational studies tends to view covariate differences symmetrically: a difference in age of 10 years is thought equally problematic whether the treated subject is older or younger than the matched control. If matching is correcting an imbalance in age, such that treated subjects are typically older than controls, then the situation in need of correction is asymmetric: a matched pair with a difference in age of 10 years is much more likely to have an older treated subject and a younger control than the opposite. Correcting the bias may be easier if matching tries to avoid the typical case that creates the bias. We describe several easily used, asymmetric, directional penalties and illustrate how they can improve covariate balance in a matched sample. The investigator starts with a matched sample built in a conventional way, then diagnoses residual covariate imbalances in need of reduction, and achieves the needed reduction by slightly altering the distance matrix with directional penalties, creating a new matched sample. Unlike penalties commonly used in matching, a directional penalty can go too far, reversing the direction of the bias rather than reducing the bias, so the magnitude of the directional penalty matters and may need adjustment. Our experience is that two or three adjustments, guided by balance diagnostics, can substantially improve covariate balance, perhaps requiring fifteen minutes effort sitting at the computer. We also explore the connection between directional penalties and a widely used technique in integer programming, namely Lagrangian relaxation of problematic linear side constraints in a minimum cost flow problem. In effect, many directional penalties are Lagrange multipliers, pushing a matched sample in the direction of satisfying a linear constraint that would not be satisfied without penalization. The method and example are in an R package DiPs at CRAN.

Journal ArticleDOI
TL;DR: The proposed integrative reduced‐rank regression (iRRR) seamlessly bridges group‐sparse and low‐rank methods and can achieve substantially faster convergence rate under realistic settings of multi‐view learning.
Abstract: Multi-view data have been routinely collected in various fields of science and engineering. A general problem is to study the predictive association between multivariate responses and multi-view predictor sets, all of which can be of high dimensionality. It is likely that only a few views are relevant to prediction, and the predictors within each relevant view contribute to the prediction collectively rather than sparsely. We cast this new problem under the familiar multivariate regression framework and propose an integrative reduced-rank regression (iRRR), where each view has its own low-rank coefficient matrix. As such, latent features are extracted from each view in a supervised fashion. For model estimation, we develop a convex composite nuclear norm penalization approach, which admits an efficient algorithm via alternating direction method of multipliers. Extensions to non-Gaussian and incomplete data are discussed. Theoretically, we derive non-asymptotic oracle bounds of iRRR under a restricted eigenvalue condition. Our results recover oracle bounds of several special cases of iRRR including Lasso, group Lasso, and nuclear norm penalized regression. Therefore, iRRR seamlessly bridges group-sparse and low-rank methods and can achieve substantially faster convergence rate under realistic settings of multi-view learning. Simulation studies and an application in the Longitudinal Studies of Aging further showcase the efficacy of the proposed methods.

Journal ArticleDOI
TL;DR: A semi‐supervised method to make inferences about both the accuracy of multiple available algorithms and the effect of genetic markers on the true phenotype is proposed, leveraging information from both a large set of unlabeled data where both genetic markers and algorithm output information and a small validation data where labels are additionally available.
Abstract: The Electronic Medical Records (EMR) data linked with genomic data have facilitated efficient and large scale translational studies. One major challenge in using EMR for translational research is the difficulty in accurately and efficiently annotating disease phenotypes due to the low accuracy of billing codes and the time involved with manual chart review. Recent efforts such as those by the Electronic Medical Records and Genomics (eMERGE) Network and Informatics for Integrating Biology & the Bedside (i2b2) have led to an increasing number of algorithms available for classifying various disease phenotypes. Investigators can apply such algorithms to obtain predicted phenotypes for their specific EMR study. They typically perform a small validation study within their cohort to assess the algorithm performance and then subsequently treat the algorithm classification as the true phenotype for downstream genetic association analyses. Despite the superior performance compared to simple billing codes, these algorithms may not port well across institutions, leading to bias and low power for association studies. In this paper, we propose a semi-supervised method to make inferences about both the accuracy of multiple available algorithms and the effect of genetic markers on the true phenotype, leveraging information from both a large set of unlabeled data where both genetic markers and algorithm output information and a small validation data where labels are additionally available. The simulation studies show that the proposed method substantially outperforms existing methods from the missing data literature. The proposed methods are applied to an EMR study of how low density lipoprotein risk alleles affect the risk of cardiovascular disease among patients with rheumatoid arthritis.

Journal ArticleDOI
Yen-Tsung Huang1
TL;DR: A novel test is proposed that conducts J hypothesis tests accounting for the composite null hypothesis by adjusting for the variances of the normally distributed statistics for the S - M and M - Y associations.
Abstract: Mediation effects of multiple mediators are determined by two associations: one between an exposure and mediators ( S - M ) and the other between the mediators and an outcome conditional on the exposure ( M - Y ). The test for mediation effects is conducted under a composite null hypothesis, that is, either one of the S - M and M - Y associations is zero or both are zeros. Without accounting for the composite null, the type 1 error rate within a study containing a large number of multimediator tests may be much less than the expected. We propose a novel test to address the issue. For each mediation test j , j = 1 , … , J , we examine the S - M and M - Y associations using two separate variance component tests. Assuming a zero-mean working distribution with a common variance for the element-wise S - M (and M - Y ) associations, score tests for the variance components are constructed. We transform the test statistics into two normally distributed statistics under the null. Using a recently developed result, we conduct J hypothesis tests accounting for the composite null hypothesis by adjusting for the variances of the normally distributed statistics for the S - M and M - Y associations. Advantages of the proposed test over other methods are illustrated in simulation studies and a data application where we analyze lung cancer data from The Cancer Genome Atlas to investigate the smoking effect on gene expression through DNA methylation in 15 114 genes.

Journal ArticleDOI
TL;DR: The asymptotic bias in the disease progression marker's change over time (slope) of a specific class of joint models, termed shared-random-effects-models (SREMs), under MAR drop-out and a proposed alternative SREM model are evaluated.
Abstract: Missing data are common in longitudinal studies. Likelihood-based methods ignoring the missingness mechanism are unbiased provided missingness is at random (MAR); under not-at-random missingness (MNAR), joint modeling is commonly used, often as part of sensitivity analyses. In our motivating example of modeling CD4 count trajectories during untreated HIV infection, CD4 counts are mainly censored due to treatment initiation, with the nature of this mechanism remaining debatable. Here, we evaluate the bias in the disease progression marker's change over time (slope) of a specific class of joint models, termed shared-random-effects-models (SREMs), under MAR drop-out and propose an alternative SREM model. Our proposed model relates drop-out to both the observed marker's data and the corresponding random effects, in contrast to most SREMs, which assume that the marker and the drop-out processes are independent given the random effects. We analytically calculate the asymptotic bias in two SREMs under specific MAR drop-out mechanisms, showing that the bias in marker's slope increases as the drop-out probability increases. The performance of the proposed model, and other commonly used SREMs, is evaluated under specific MAR and MNAR scenarios through simulation studies. Under MAR, the proposed model yields nearly unbiased slope estimates, whereas the other SREMs yield seriously biased estimates. Under MNAR, the proposed model estimates are approximately unbiased, whereas those from the other SREMs are moderately to heavily biased, depending on the parameterization used. The examined models are also fitted to real data and results are compared/discussed in the light of our analytical and simulation-based findings.

Journal ArticleDOI
TL;DR: The purpose of the paper is to propose the first group testing algorithms for multiplex assays that take advantage of individual risk‐factor information as expressed by these probabilities, and to show that the methods significantly reduce the number of tests required while preserving accuracy.
Abstract: Infectious disease testing frequently takes advantage of two tools-group testing and multiplex assays-to make testing timely and cost effective. Until the work of Tebbs et al. (2013) and Hou et al. (2017), there was no research available to understand how best to apply these tools simultaneously. This recent work focused on applications where each individual is considered to be identical in terms of the probability of disease. However, risk-factor information, such as past behavior and presence of symptoms, is very often available on each individual to allow one to estimate individual-specific probabilities. The purpose of our paper is to propose the first group testing algorithms for multiplex assays that take advantage of individual risk-factor information as expressed by these probabilities. We show that our methods significantly reduce the number of tests required while preserving accuracy. Throughout this paper, we focus on applying our methods with the Aptima Combo 2 Assay that is used worldwide for chlamydia and gonorrhea screening.