scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2021"


Journal ArticleDOI
TL;DR: This work discusses a generalization of the analysis of variance variable importance measure and discusses how it facilitates the use of machine learning techniques to flexibly estimate the variable importance of a single feature or group of features.
Abstract: In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often suboptimal for predicting the response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. This measure is a property of the true data-generating mechanism. Specifically, we discuss a generalization of the analysis of variance variable importance measure and discuss how it facilitates the use of machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. The importance of each feature or group of features in the data can then be described individually, using this measure. We describe how to construct an efficient estimator of this measure as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of risk factors for cardiovascular disease in South Africa.

64 citations


Journal ArticleDOI
TL;DR: The theory from renewal process is partially adopted by considering the incubation period as the inter‐arrival time, and the duration between departure from Wuhan and onset of symptoms as the mixture of forward time and inter‐Arrival time with censored intervals.
Abstract: The incubation period and generation time are key characteristics in the analysis of infectious diseases The commonly used contact-tracing-based estimation of incubation distribution is highly influenced by the individuals' judgment on the possible date of exposure, and might lead to significant errors On the other hand, interval censoring-based methods are able to utilize a much larger set of traveling data but may encounter biased sampling problems The distribution of generation time is usually approximated by observed serial intervals However, it may result in a biased estimation of generation time, especially when the disease is infectious during incubation In this paper, the theory from renewal process is partially adopted by considering the incubation period as the interarrival time, and the duration between departure from Wuhan and onset of symptoms as the mixture of forward time and interarrival time with censored intervals In addition, a consistent estimator for the distribution of generation time based on incubation period and serial interval is proposed for incubation-infectious diseases A real case application to the current outbreak of COVID-19 is implemented We find that the incubation period has a median of 850 days (95% confidence interval [CI] [722; 915]) The basic reproduction number in the early phase of COVID-19 outbreak based on the proposed generation time estimation is estimated to be 296 (95% CI [215; 386])

39 citations


Journal ArticleDOI
TL;DR: The proposed zero‐inflated Poisson factor analysis model provides valuable insights into the relation between subgingival microbiome and periodontal disease and an efficient and robust expectation‐maximization algorithm for parameter estimation is developed.
Abstract: Dimension reduction of high-dimensional microbiome data facilitates subsequent analysis such as regression and clustering. Most existing reduction methods cannot fully accommodate the special features of the data such as count-valued and excessive zero reads. We propose a zero-inflated Poisson factor analysis model in this paper. The model assumes that microbiome read counts follow zero-inflated Poisson distributions with library size as offset and Poisson rates negatively related to the inflated zero occurrences. The latent parameters of the model form a low-rank matrix consisting of interpretable loadings and low-dimensional scores that can be used for further analyses. We develop an efficient and robust expectation-maximization algorithm for parameter estimation. We demonstrate the efficacy of the proposed method using comprehensive simulation studies. The application to the Oral Infections, Glucose Intolerance, and Insulin Resistance Study provides valuable insights into the relation between subgingival microbiome and periodontal disease.

22 citations


Journal ArticleDOI
TL;DR: In this article, a Bayesian inverse problem approach applied to UK data on first-wave Covid-19 deaths and the disease duration distribution suggests that fatal infections were in decline before full UK lockdown (24 March 2020).
Abstract: The number of new infections per day is a key quantity for effective epidemic management. It can be estimated relatively directly by testing of random population samples. Without such direct epidemiological measurement, other approaches are required to infer whether the number of new cases is likely to be increasing or decreasing: for example, estimating the pathogen-effective reproduction number, R, using data gathered from the clinical response to the disease. For coronavirus disease 2019 (Covid-19/SARS-Cov-2), such R estimation is heavily dependent on modelling assumptions, because the available clinical case data are opportunistic observational data subject to severe temporal confounding. Given this difficulty, it is useful to retrospectively reconstruct the time course of infections from the least compromised available data, using minimal prior assumptions. A Bayesian inverse problem approach applied to UK data on first-wave Covid-19 deaths and the disease duration distribution suggests that fatal infections were in decline before full UK lockdown (24 March 2020), and that fatal infections in Sweden started to decline only a day or two later. An analysis of UK data using the model of Flaxman et al. gives the same result under relaxation of its prior assumptions on R, suggesting an enhanced role for non-pharmaceutical interventions short of full lockdown in the UK context. Similar patterns appear to have occurred in the subsequent two lockdowns.

20 citations


Journal ArticleDOI
TL;DR: In this article, the authors tailor post-selection inference methods toward changepoint detection, focusing on copy number variation data, and implement some of the latest developments in postselection inference theory, mainly auxiliary randomization.
Abstract: Changepoint detection methods are used in many areas of science and engineering, for example, in the analysis of copy number variation data to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or the presence) of given changepoints post-selection are lacking. Post-selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low-powered hypothesis tests and leaves open several important questions about practical usability. In this work, we carefully tailor post-selection inference methods toward changepoint detection, focusing on copy number variation data. To accomplish this, we study commonly used changepoint algorithms: binary segmentation, as well as two of its most popular variants, wild and circular, and the fused lasso. We implement some of the latest developments in post-selection inference theory, mainly auxiliary randomization. This improves the power, which requires implementations of Markov chain Monte Carlo algorithms (importance sampling and hit-and-run sampling) to carry out our tests. We also provide recommendations for improving practical useability, detailed simulations, and example analyses on array comparative genomic hybridization as well as sequencing data.

20 citations


Journal ArticleDOI
TL;DR: Simulation and two applications on brain signal studies confirm the excellent performance of the proposed method including a better prediction accuracy than the competitors and the scientific interpretability of the solution.
Abstract: The authors thank the editor, the associate editor and two referees for their constructive and helpful comments on the earlier version of this paper. Shen’s research was partially supported by the Simons Foundation Award 512620 and the NSF Grant DMS 1509023. Hu’s effort was partially supported by the National Institute of Health Grants R01AI143886, R01CA219896,and CCSG P30 CA013696. Fortin’s research was supported by NIH grant R01-MH115697,NSF awards IOS-1150292 and BCS-1439267, and Whitehall Foundation award 2010-05-84.Frostig’s research was partially supported by the Leducq Foundation.

19 citations


Journal ArticleDOI
TL;DR: This work proposes a repeated measures random forest (RMRF) algorithm that can handle nonlinear relationships and interactions and the correlated responses from patients evaluated over several nights and finds that nocturnal hypoglycemia is associated with HbA1c, bedtime blood glucose (BG), insulin on board, time system activated, exercise intensity, and daytime hypoglyCEmia.
Abstract: Nocturnal hypoglycemia is a common phenomenon among patients with diabetes and can lead to a broad range of adverse events and complications. Identifying factors associated with hypoglycemia can improve glucose control and patient care. We propose a repeated measures random forest (RMRF) algorithm that can handle nonlinear relationships and interactions and the correlated responses from patients evaluated over several nights. Simulation results show that our proposed algorithm captures the informative variable more often than naively assuming independence. RMRF also outperforms standard random forest and extremely randomized trees algorithms. We demonstrate scenarios where RMRF attains greater prediction accuracy than generalized linear models. We apply the RMRF algorithm to analyze a diabetes study with 2524 nights from 127 patients with type 1 diabetes. We find that nocturnal hypoglycemia is associated with HbA1c, bedtime blood glucose (BG), insulin on board, time system activated, exercise intensity, and daytime hypoglycemia. The RMRF can accurately classify nights at high risk of nocturnal hypoglycemia.

18 citations



Journal ArticleDOI
TL;DR: A new variance estimator is proposed that combines the estimation procedures for the hazard ratio and weights using stacked estimating equations, with additional adjustments for the sum of non-independent and identically distributed terms in a Cox partial likelihood score equation.
Abstract: Inverse probability weighted Cox models can be used to estimate marginal hazard ratios under different point treatments in observational studies. To obtain variance estimates, the robust sandwich variance estimator is often recommended to account for the induced correlation among weighted observations. However, this estimator does not incorporate the uncertainty in estimating the weights and tends to overestimate the variance, leading to inefficient inference. Here we propose a new variance estimator that combines the estimation procedures for the hazard ratio and weights using stacked estimating equations, with additional adjustments for the sum of terms that are not independently and identically distributed in a Cox partial likelihood score equation. We prove analytically that the robust sandwich variance estimator is conservative and establish the asymptotic equivalence between the proposed variance estimator and one obtained through linearization by Hajage et al. in 2018. In addition, we extend our proposed variance estimator to accommodate clustered data. We compare the finite sample performance of the proposed method with alternative methods through simulation studies. We illustrate these different variance methods in both independent and clustered data settings, using a bariatric surgery dataset and a multiple readmission dataset, respectively. To facilitate implementation of the proposed method, we have developed an R package ipwCoxCSV.

17 citations


Journal ArticleDOI
TL;DR: An application of both non-iterative and iterative conditional expectation to answer "when to start" treatment questions using data from the HIV-CAUSAL Collaboration is described and it is shown that both are at least as efficient as the classical iteratives conditional expectation estimator.
Abstract: The g-formula can be used to estimate the survival curve under a sustained treatment strategy Two available estimators of the g-formula are noniterative conditional expectation and iterative conditional expectation We propose a version of the iterative conditional expectation estimator and describe its procedures for deterministic and random treatment strategies Also, because little is known about the comparative performance of noniterative and iterative conditional expectation estimators, we explore their relative efficiency via simulation studies Our simulations show that, in the absence of model misspecification and unmeasured confounding, our proposed iterative conditional expectation estimator and the noniterative conditional expectation estimator are similarly efficient, and that both are at least as efficient as the classical iterative conditional expectation estimator We describe an application of both noniterative and iterative conditional expectation to answer "when to start" treatment questions using data from the HIV-CAUSAL Collaboration

15 citations


Journal ArticleDOI
TL;DR: This paper uses the continuation-ratio (CR) model to characterize the trinomial response outcomes and the cause-specific hazard rate method to model the competing-risk survival outcomes and develops a Bayesian data augmentation method to impute the missing data from the observations.
Abstract: Early-phase dose-finding clinical trials are often subject to the issue of late-onset outcomes. In phase I/II clinical trials, the issue becomes more intractable because toxicity and efficacy can be competing risk outcomes such that the occurrence of the first outcome will terminate the other one. In this paper, we propose a novel Bayesian adaptive phase I/II clinical trial design to address the issue of late-onset competing risk outcomes. We use the continuation-ratio model to characterize the trinomial response outcomes and the cause-specific hazard rate method to model the competing-risk survival outcomes. We treat the late-onset outcomes as missing data and develop a Bayesian data augmentation method to impute the missing data from the observations. We also propose an adaptive dose-finding algorithm to allocate patients and identify the optimal biological dose during the trial. Simulation studies show that the proposed design yields desirable operating characteristics.

Journal ArticleDOI
TL;DR: In this paper, the authors frame the design of an automatic algorithmic change protocol (aACP) as an online hypothesis testing problem, and investigate how repeated testing and adoption of modifications might lead to gradual deterioration in prediction accuracy.
Abstract: Successful deployment of machine learning algorithms in healthcare requires careful assessments of their performance and safety. To date, the FDA approves locked algorithms prior to marketing and requires future updates to undergo separate premarket reviews. However, this negates a key feature of machine learning-the ability to learn from a growing dataset and improve over time. This paper frames the design of an approval policy, which we refer to as an automatic algorithmic change protocol (aACP), as an online hypothesis testing problem. As this process has obvious analogy with noninferiority testing of new drugs, we investigate how repeated testing and adoption of modifications might lead to gradual deterioration in prediction accuracy, also known as "biocreep" in the drug development literature. We consider simple policies that one might consider but do not necessarily offer any error-rate guarantees, as well as policies that do provide error-rate control. For the latter, we define two online error-rates appropriate for this context: bad approval count (BAC) and bad approval and benchmark ratios (BABR). We control these rates in the simple setting of a constant population and data source using policies aACP-BAC and aACP-BABR, which combine alpha-investing, group-sequential, and gate-keeping methods. In simulation studies, bio-creep regularly occurred when using policies with no error-rate guarantees, whereas aACP-BAC and aACP-BABR controlled the rate of bio-creep without substantially impacting our ability to approve beneficial modifications.

Journal ArticleDOI
TL;DR: A continuous‐time hidden Markov model is developed to analyze longitudinal data accounting for irregular visits and different types of observations and focuses on Bayesian inference for the model, facilitated by an expectation‐maximization algorithm and Markov chain Monte Carlo methods.
Abstract: Large amounts of longitudinal health records are now available for dynamic monitoring of the underlying processes governing the observations. However, the health status progression across time is not typically observed directly: records are observed only when a subject interacts with the system, yielding irregular and often sparse observations. This suggests that the observed trajectories should be modeled via a latent continuous-time process potentially as a function of time-varying covariates. We develop a continuous-time hidden Markov model to analyze longitudinal data accounting for irregular visits and different types of observations. By employing a specific missing data likelihood formulation, we can construct an efficient computational algorithm. We focus on Bayesian inference for the model: this is facilitated by an expectation-maximization algorithm and Markov chain Monte Carlo methods. Simulation studies demonstrate that these approaches can be implemented efficiently for large data sets in a fully Bayesian setting. We apply this model to a real cohort where patients suffer from chronic obstructive pulmonary disease with the outcome being the number of drugs taken, using health care utilization indicators and patient characteristics as covariates.

Journal ArticleDOI
TL;DR: An adaptive estimation procedure is developed, which uses the combined information to determine the degree of information borrowing from the aggregate data of the external resource, and yields a substantial gain in statistical efficiency over the conventional method using the primary cohort only.
Abstract: In comparative effectiveness research (CER) for rare types of cancer, it is appealing to combine primary cohort data containing detailed tumor profiles together with aggregate information derived from cancer registry databases. Such integration of data may improve statistical efficiency in CER. A major challenge in combining information from different resources, however, is that the aggregate information from the cancer registry databases could be incomparable with the primary cohort data, which are often collected from a single cancer center or a clinical trial. We develop an adaptive estimation procedure, which uses the combined information to determine the degree of information borrowing from the aggregate data of the external resource. We establish the asymptotic properties of the estimators and evaluate the finite sample performance via simulation studies. The proposed method yields a substantial gain in statistical efficiency over the conventional method using the primary cohort only, and avoids undesirable biases when the given external information is incomparable to the primary cohort. We apply the proposed method to evaluate the long-term effect of trimodality treatment to inflammatory breast cancer (IBC) by tumor subtypes, while combining the IBC patient cohort at The University of Texas MD Anderson Cancer Center and the external aggregate information from the National Cancer Data Base.

Journal ArticleDOI
TL;DR: A two‐step compositional knockoff filter is proposed to provide the effective finite‐sample false discovery rate (FDR) control in high‐dimensional linear log‐contrast regression analysis of microbiome compositional data.
Abstract: A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.

Journal ArticleDOI
TL;DR: This paper proposes a fully Bayesian methodology to make inference on the causal effects of any intervention in the system and demonstrates the merits of the methodology in simulation studies, wherein comparisons with current state‐of‐the‐art procedures turn out to be highly satisfactory.
Abstract: We assume that multivariate observational data are generated from a distribution whose conditional independencies are encoded in a Directed Acyclic Graph (DAG). For any given DAG, the causal effect of a variable onto another one can be evaluated through intervention calculus. A DAG is typically not identifiable from observational data alone. However, its Markov equivalence class (a collection of DAGs) can be estimated from the data. As a consequence, for the same intervention a set of causal effects, one for each DAG in the equivalence class, can be evaluated. In this paper, we propose a fully Bayesian methodology to make inference on the causal effects of any intervention in the system. Main features of our method are: (a) both uncertainty on the equivalence class and the causal effects are jointly modeled; (b) priors on the parameters of the modified Cholesky decomposition of the precision matrices across all DAG models are constructively assigned starting from a unique prior on the complete (unrestricted) DAG; (c) an efficient algorithm to sample from the posterior distribution on graph space is adopted; (d) an objective Bayes approach, requiring virtually no user specification, is used throughout. We demonstrate the merits of our methodology in simulation studies, wherein comparisons with current state-of-the-art procedures turn out to be highly satisfactory. Finally we examine a real data set of gene expressions for Arabidopsis thaliana.

Journal ArticleDOI
TL;DR: A penalized fusion approach for heterogeneity analysis based on the Gaussian Graphical Model (GGM) applies penalization to the mean and precision matrix parameters to generate regularized and interpretable estimates.
Abstract: Heterogeneity is a hallmark of cancer, diabetes, cardiovascular diseases, and many other complex diseases. This study has been partly motivated by the unsupervised heterogeneity analysis for complex diseases based on molecular and imaging data, for which, network-based analysis, by accommodating the interconnections among variables, can be more informative than that limited to mean, variance, and other simple distributional properties. In the literature, there has been very limited research on network-based heterogeneity analysis, and a common limitation shared by the existing techniques is that the number of subgroups needs to be specified a priori or in an ad hoc manner. In this article, we develop a penalized fusion approach for heterogeneity analysis based on the Gaussian graphical model. It applies penalization to the mean and precision matrix parameters to generate regularized and interpretable estimates. More importantly, a fusion penalty is imposed to "automatedly" determine the number of subgroups and generate more concise, reliable, and interpretable estimation. Consistency properties are rigorously established, and an effective computational algorithm is developed. The heterogeneity analysis of non-small-cell lung cancer based on single-cell gene expression data of the Wnt pathway and that of lung adenocarcinoma based on histopathological imaging data not only demonstrate the practical applicability of the proposed approach but also lead to interesting new findings.

Journal ArticleDOI
TL;DR: This work proposes graphical proportional hazards measurement error models, and develops inferential procedures for the parameters of interest that significantly enlarge the scope of the usual Cox PH model and have great flexibility in characterizing survival data.
Abstract: In survival data analysis, the Cox proportional hazards (PH) model is perhaps the most widely used model to feature the dependence of survival times on covariates. While many inference methods have been developed under such a model or its variants, those models are not adequate for handling data with complex structured covariates. High-dimensional survival data often entail several features: (1) many covariates are inactive in explaining the survival information, (2) active covariates are associated in a network structure, and (3) some covariates are error-contaminated. To hand such kinds of survival data, we propose graphical PH measurement error models and develop inferential procedures for the parameters of interest. Our proposed models significantly enlarge the scope of the usual Cox PH model and have great flexibility in characterizing survival data. Theoretical results are established to justify the proposed methods. Numerical studies are conducted to assess the performance of the proposed methods.

Journal ArticleDOI
TL;DR: In this article, the authors proposed an elastic prior approach to control the behavior of information borrowing and type I errors by incorporating a well-known concept of clinically significant difference through an elastic function, defined as a monotonic function of a congruence measure between historical data and trial data.
Abstract: Use of historical data and real-world evidence holds great potential to improve the efficiency of clinical trials. One major challenge is to effectively borrow information from historical data while maintaining a reasonable type I error and minimal bias. We propose the elastic prior approach to address this challenge. Unlike existing approaches, this approach proactively controls the behavior of information borrowing and type I errors by incorporating a well-known concept of clinically significant difference through an elastic function, defined as a monotonic function of a congruence measure between historical data and trial data. The elastic function is constructed to satisfy a set of prespecified criteria such that the resulting prior will strongly borrow information when historical and trial data are congruent, but refrain from information borrowing when historical and trial data are incongruent. The elastic prior approach has a desirable property of being information borrowing consistent, i.e. asymptotically controls type I error at the nominal value, no matter that historical data are congruent or not to the trial data. Our simulation study that evaluates the finite sample characteristic confirms that, compared to existing methods, the elastic prior has better type I error control and yields competitive or higher power. The proposed approach is applicable to binary, continuous and survival endpoints. This article is protected by copyright. All rights reserved.

Journal ArticleDOI
TL;DR: In this paper, a hierarchical group spike and slab prior for logistic regression models with group-structured covariates is proposed to establish strong group selection consistency of the induced posterior, which is the first theoretical result in the Bayesian literature.
Abstract: We consider Bayesian logistic regression models with group-structured covariates. In high-dimensional settings, it is often assumed that only a small portion of groups are significant, and thus, consistent group selection is of significant importance. While consistent frequentist group selection methods have been proposed, theoretical properties of Bayesian group selection methods for logistic regression models have not been investigated yet. In this paper, we consider a hierarchical group spike and slab prior for logistic regression models in high-dimensional settings. Under mild conditions, we establish strong group selection consistency of the induced posterior, which is the first theoretical result in the Bayesian literature. Through simulation studies, we demonstrate that the proposed method outperforms existing state-of-the-art methods in various settings. We further apply our method to a magnetic resonance imaging data set for predicting Parkinson's disease and show its benefits over other contenders.

Journal ArticleDOI
TL;DR: In this paper, the null variance of the test statistic is derived using U-statistic theory in terms of a dispersion parameter called the standard rank deviation, an intrinsic characteristic of the null outcome distribution and the user-defined rule of comparison.
Abstract: Originally proposed for the analysis of prioritized composite endpoints, the win ratio has now expanded into a broad class of methodology based on general pairwise comparisons. Complicated by the non-i.i.d. structure of the test statistic, however, sample size estimation for the win ratio has lagged behind. In this article, we develop general and easy-to-use formulas to calculate sample size for win ratio analysis of different outcome types. In a nonparametric setting, the null variance of the test statistic is derived using U-statistic theory in terms of a dispersion parameter called the standard rank deviation, an intrinsic characteristic of the null outcome distribution and the user-defined rule of comparison. The effect size can be hypothesized either on the original scale of the population win ratio, or on the scale of a "usual" effect size suited to the outcome type. The latter approach allows one to measure the effect size by, for example, odds/continuation ratio for totally/partially ordered outcomes and hazard ratios for composite time-to-event outcomes. Simulation studies show that the derived formulas provide accurate estimates for the required sample size across different settings. As illustration, real data from two clinical studies of hepatic and cardiovascular diseases are used as pilot data to calculate sample sizes for future trials.

Journal ArticleDOI
TL;DR: In this article, a generalized robust allele-based regression model with individual allele as the response variable was proposed to develop a new generalized robust allelic test, and the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from the Hardy-Weinberg equilibrium (HWE) assumption.
Abstract: The allele-based association test, comparing allele frequency difference between case and control groups, is locally most powerful. However, application of the classical allelic test is limited in practice, because the method is sensitive to the Hardy-Weinberg equilibrium (HWE) assumption, not applicable to continuous traits, and not easy to account for covariate effect or sample correlation. To develop a generalized robust allelic test, we propose a new allele-based regression model with individual allele as the response variable. We show that the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from HWE and encompasses the classical allelic test as a special case. When the trait of interest is continuous, the corresponding allelic test evaluates a weighted difference between individual-level allele frequency estimate and sample estimate where the weight is proportional to an individual's trait value, and the test remains valid under Y-dependent sampling. Finally, the proposed allele-based method can analyze multiple (continuous or binary) phenotypes simultaneously and multiallelic genetic markers, while accounting for covariate effect, sample correlation, and population heterogeneity. To support our analytical findings, we provide empirical evidence from both simulation and application studies.

Journal ArticleDOI
TL;DR: This work proposes a new distance‐based ICC, defined in terms of arbitrary distances among observations, and the Spearman‐Brown formula, which shows how more intensive measurement increases reliability, is extended to encompass the dbICC.
Abstract: The intraclass correlation coefficient (ICC) is a classical index of measurement reliability. With the advent of new and complex types of data for which the ICC is not defined, there is a need for new ways to assess reliability. To meet this need, we propose a new distance-based ICC (dbICC), defined in terms of arbitrary distances among observations. We introduce a bias correction to improve the coverage of bootstrap confidence intervals for the dbICC, and demonstrate its efficacy via simulation. We illustrate the proposed method by analyzing the test-retest reliability of brain connectivity matrices derived from a set of repeated functional magnetic resonance imaging scans. The Spearman-Brown formula, which shows how more intensive measurement increases reliability, is extended to encompass the dbICC.

Journal ArticleDOI
TL;DR: In this paper, a Bayesian multiple index model framework was proposed to combine the strengths of response-surface and exposure-index models, allowing for non-linear and non-additive relationships between exposure indices and a health outcome.
Abstract: An important goal of environmental health research is to assess the risk posed by mixtures of environmental exposures. Two popular classes of models for mixtures analyses are response-surface methods and exposure-index methods. Response-surface methods estimate high-dimensional surfaces and are thus highly flexible but difficult to interpret. In contrast, exposure-index methods decompose coefficients from a linear model into an overall mixture effect and individual index weights; these models yield easily interpretable effect estimates and efficient inferences when model assumptions hold, but, like most parsimonious models, incur bias when these assumptions do not hold. In this paper, we propose a Bayesian multiple index model framework that combines the strengths of each, allowing for non-linear and non-additive relationships between exposure indices and a health outcome, while reducing the dimensionality of the exposure vector and estimating index weights with variable selection. This framework contains response-surface and exposure-index models as special cases, thereby unifying the two analysis strategies. This unification increases the range of models possible for analysing environmental mixtures and health, allowing one to select an appropriate analysis from a spectrum of models varying in flexibility and interpretability. In an analysis of the association between telomere length and 18 organic pollutants in the National Health and Nutrition Examination Survey (NHANES), the proposed approach fits the data as well as more complex response-surface methods and yields more interpretable results.

Journal ArticleDOI
TL;DR: The study illustrates (a) the need for relevant process covariates and (b) the benefits of using externally estimated observation variances for inference about nonlinear stage‐structured SSMs.
Abstract: State-space models (SSMs) are a popular tool for modeling animal abundances. Inference difficulties for simple linear SSMs are well known, particularly in relation to simultaneous estimation of process and observation variances. Several remedies to overcome estimation problems have been studied for relatively simple SSMs, but whether these challenges and proposed remedies apply for nonlinear stage-structured SSMs, an important class of ecological models, is less well understood. Here we identify improvements for inference about nonlinear stage-structured SSMs fit with biased sequential life stage data. Theoretical analyses indicate parameter identifiability requires covariates in the state processes. Simulation studies show that plugging in externally estimated observation variances, as opposed to jointly estimating them with other parameters, reduces bias and standard error of estimates. In contrast to previous results for simple linear SSMs, strong confounding between jointly estimated process and observation variance parameters was not found in the models explored here. However, when observation variance was also estimated in the motivating case study, the resulting process variance estimates were implausibly low (near-zero). As SSMs are used in increasingly complex ways, understanding when inference can be expected to be successful, and what aids it, becomes more important. Our study illustrates (a) the need for relevant process covariates and (b) the benefits of using externally estimated observation variances for inference about nonlinear stage-structured SSMs.

Journal ArticleDOI
TL;DR: The authors proposed a unified framework to summarize various forms of aggregated information via estimating equations and developed a penalized empirical likelihood approach to incorporate such information in logistic regression when the homogeneity assumption is violated, and extended the method to account for population heterogeneity among different sources of information.
Abstract: With the increasing availability of data in the public domain, there has been a growing interest in exploiting information from external sources to improve the analysis of smaller scale studies An emerging challenge in the era of big data is that the subject-level data are high dimensional, but the external information is at an aggregate level and of a lower dimension Moreover, heterogeneity and uncertainty in the auxiliary information are often not accounted for in information synthesis In this paper, we propose a unified framework to summarize various forms of aggregated information via estimating equations and develop a penalized empirical likelihood approach to incorporate such information in logistic regression When the homogeneity assumption is violated, we extend the method to account for population heterogeneity among different sources of information When the uncertainty in the external information is not negligible, we propose a variance estimator adjusting for the uncertainty The proposed estimators are asymptotically more efficient than the conventional penalized maximum likelihood estimator and enjoy the oracle property even with a diverging number of predictors Simulation studies show that the proposed approaches yield higher accuracy in variable selection compared with competitors We illustrate the proposed methodologies with a pediatric kidney transplant study

Journal ArticleDOI
TL;DR: A flexible and scalable modelling framework for case-crossover models with linear and semi-parametric effects which retains the flexibility and computational advantages of INLA is developed and applied to quantify non-linear associations between mortality and extreme temperatures in India.
Abstract: A case-crossover analysis is used as a simple but powerful tool for estimating the effect of short-term environmental factors such as extreme temperatures or poor air quality on mortality. The environment on the day of each death is compared to the one or more "control days" in previous weeks, and higher levels of exposure on death days than control days provide evidence of an effect. Current state-of-the-art methodology and software (integrated nested Laplace approximation [INLA]) cannot be used to fit the most flexible case-crossover models to large datasets, because the likelihood for case-crossover models cannot be expressed in a manner compatible with this methodology. In this paper, we develop a flexible and scalable modeling framework for case-crossover models with linear and semiparametric effects which retains the flexibility and computational advantages of INLA. We apply our method to quantify nonlinear associations between mortality and extreme temperatures in India. An R package implementing our methods is publicly available.

Journal ArticleDOI
TL;DR: In this paper, the authors present two approaches to estimate the birth, death, and growth rates of a discretely observed linear birth-and-death process: via an embedded Galton-Watson process and by maximizing a saddlepoint approximation to the likelihood.
Abstract: Birth-and-death processes are widely used to model the development of biological populations. Although they are relatively simple models, their parameters can be challenging to estimate, as the likelihood can become numerically unstable when data arise from the most common sampling schemes, such as annual population censuses. A further difficulty arises when the discrete observations are not equi-spaced, for example, when census data are unavailable for some years. We present two approaches to estimating the birth, death, and growth rates of a discretely observed linear birth-and-death process: via an embedded Galton-Watson process and by maximizing a saddlepoint approximation to the likelihood. We study asymptotic properties of the estimators, compare them on numerical examples, and apply the methodology to data on monitored populations.

Journal ArticleDOI
TL;DR: In this paper, a smoothed robust estimator was proposed to directly target the parameter corresponding to the Bayes decision rule for optimal treatment regimes estimation, which is shown to have an asymptotic normal distribution.
Abstract: We propose a new procedure for inference on optimal treatment regimes in the model-free setting, which does not require to specify an outcome regression model. Existing model-free estimators for optimal treatment regimes are usually not suitable for the purpose of inference, because they either have nonstandard asymptotic distributions or do not necessarily guarantee consistent estimation of the parameter indexing the Bayes rule due to the use of surrogate loss. We first study a smoothed robust estimator that directly targets the parameter corresponding to the Bayes decision rule for optimal treatment regimes estimation. This estimator is shown to have an asymptotic normal distribution. Furthermore, we verify that a resampling procedure provides asymptotically accurate inference for both the parameter indexing the optimal treatment regime and the optimal value function. A new algorithm is developed to calculate the proposed estimator with substantially improved speed and stability. Numerical results demonstrate the satisfactory performance of the new methods.

Journal ArticleDOI
TL;DR: A variable selection technique with the following novel features: a generalized transformation and z‐prior to handle the compositional constraint, and an Ising prior that encourages the joint selection of microbiome features that are closely related in terms of their genetic sequence similarity.
Abstract: The microbiome plays a critical role in human health and disease, and there is a strong scientific interest in linking specific features of the microbiome to clinical outcomes. There are key aspects of microbiome data, however, that limit the applicability of standard variable selection methods. In particular, the observed data are compositional, as the counts within each sample have a fixed-sum constraint. In addition, microbiome features, typically quantified as operational taxonomic units (OTUs), often reflect microorganisms that are similar in function, and may therefore have a similar influence on the response variable. To address the challenges posed by these aspects of the data structure, we propose a variable selection technique with the following novel features: a generalized transformation and z-prior to handle the compositional constraint, and an Ising prior that encourages the joint selection of microbiome features that are closely related in terms of their genetic sequence similarity. We demonstrate that our proposed method outperforms existing penalized approaches for microbiome variable selection in both simulation and the analysis of real data exploring the relationship of the gut microbiome to body mass index (BMI).