scispace - formally typeset
Search or ask a question

Showing papers in "Biometrics in 2018"


Journal ArticleDOI
TL;DR: A new measure is introduced, the skewness of the standardized deviates, to quantify publication bias, which describes the asymmetry of the collected studies' distribution and is illustrated using simulations and three case studies.
Abstract: Publication bias is a serious problem in systematic reviews and meta-analyses, which can affect the validity and generalization of conclusions. Currently, approaches to dealing with publication bias can be distinguished into two classes: selection models and funnel-plot-based methods. Selection models use weight functions to adjust the overall effect size estimate and are usually employed as sensitivity analyses to assess the potential impact of publication bias. Funnel-plot-based methods include visual examination of a funnel plot, regression and rank tests, and the nonparametric trim and fill method. Although these approaches have been widely used in applications, measures for quantifying publication bias are seldom studied in the literature. Such measures can be used as a characteristic of a meta-analysis; also, they permit comparisons of publication biases between different meta-analyses. Egger's regression intercept may be considered as a candidate measure, but it lacks an intuitive interpretation. This article introduces a new measure, the skewness of the standardized deviates, to quantify publication bias. This measure describes the asymmetry of the collected studies' distribution. In addition, a new test for publication bias is derived based on the skewness. Large sample properties of the new measure are studied, and its performance is illustrated using simulations and three case studies.

528 citations


Journal ArticleDOI
TL;DR: It is concluded that under the constant p assumption reliable inference is only possible for relative abundance in the absence of questionable and/or untestable assumptions or with better quality data than seen in typical applications.
Abstract: N-mixture models describe count data replicated in time and across sites in terms of abundance N and detectability p. They are popular because they allow inference about N while controlling for factors that influence p without the need for marking animals. Using a capture-recapture perspective, we show that the loss of information that results from not marking animals is critical, making reliable statistical modeling of N and p problematic using just count data. One cannot reliably fit a model in which the detection probabilities are distinct among repeat visits as this model is overspecified. This makes uncontrolled variation in p problematic. By counter example, we show that even if p is constant after adjusting for covariate effects (the "constant p" assumption) scientifically plausible alternative models in which N (or its expectation) is non-identifiable or does not even exist as a parameter, lead to data that are practically indistinguishable from data generated under an N-mixture model. This is particularly the case for sparse data as is commonly seen in applications. We conclude that under the constant p assumption reliable inference is only possible for relative abundance in the absence of questionable and/or untestable assumptions or with better quality data than seen in typical applications. Relative abundance models for counts can be readily fitted using Poisson regression in standard software such as R and are sufficiently flexible to allow controlling for p through the use covariates while simultaneously modeling variation in relative abundance. If users require estimates of absolute abundance, they should collect auxiliary data that help with estimation of p.

186 citations


Journal ArticleDOI
TL;DR: It is demonstrated that analytical power agrees well with simulated power for as few as eight clusters, when data is analyzed using bias-corrected estimating equations for the correlation parameters concurrently with a bias-Corrected sandwich variance estimator.
Abstract: In stepped wedge cluster randomized trials, intact clusters of individuals switch from control to intervention from a randomly assigned period onwards. Such trials are becoming increasingly popular in health services research. When a closed cohort is recruited from each cluster for longitudinal follow-up, proper sample size calculation should account for three distinct types of intraclass correlations: the within-period, the inter-period, and the within-individual correlations. Setting the latter two correlation parameters to be equal accommodates cross-sectional designs. We propose sample size procedures for continuous and binary responses within the framework of generalized estimating equations that employ a block exchangeable within-cluster correlation structure defined from the distinct correlation types. For continuous responses, we show that the intraclass correlations affect power only through two eigenvalues of the correlation matrix. We demonstrate that analytical power agrees well with simulated power for as few as eight clusters, when data are analyzed using bias-corrected estimating equations for the correlation parameters concurrently with a bias-corrected sandwich variance estimator.

66 citations



Journal ArticleDOI
TL;DR: The relative efficiency of the hazard ratio and t-MST tests with respect to the statistical power under various PH and non-PH models theoretically and empirically is assessed.
Abstract: In comparing two treatments with the event time observations, the hazard ratio (HR) estimate is routinely used to quantify the treatment difference. However, this model dependent estimate may be difficult to interpret clinically especially when the proportional hazards (PH) assumption is violated. An alternative estimation procedure for treatment efficacy based on the restricted means survival time or t-year mean survival time (t-MST) has been discussed extensively in the statistical and clinical literature. On the other hand, a statistical test via the HR or its asymptotically equivalent counterpart, the logrank test, is asymptotically distribution-free. In this article, we assess the relative efficiency of the hazard ratio and t-MST tests with respect to the statistical power under various PH and non-PH models theoretically and empirically. When the PH assumption is valid, the t-MST test performs almost as well as the HR test. For non-PH models, the t-MST test can substantially outperform its HR counterpart. On the other hand, the HR test can be powerful when the true difference of two survival functions is quite large at end but not the beginning of the study. Unfortunately, for this case, the HR estimate may not have a simple clinical interpretation for the treatment effect due to the violation of the PH assumption.

59 citations




Journal ArticleDOI
TL;DR: This work defines population parameters for both partial and conditional Spearman's correlation through concordance-discordance probabilities, and describes estimation and inference, and highlights the use of semiparametric cumulative probability models, which allow preservation of the rank-based nature of Spearman't correlation.
Abstract: It is desirable to adjust Spearman's rank correlation for covariates, yet existing approaches have limitations. For example, the traditionally defined partial Spearman's correlation does not have a sensible population parameter, and the conditional Spearman's correlation defined with copulas cannot be easily generalized to discrete variables. We define population parameters for both partial and conditional Spearman's correlation through concordance-discordance probabilities. The definitions are natural extensions of Spearman's rank correlation in the presence of covariates and are general for any orderable random variables. We show that they can be neatly expressed using probability-scale residuals (PSRs). This connection allows us to derive simple estimators. Our partial estimator for Spearman's correlation between X and Y adjusted for Z is the correlation of PSRs from models of X on Z and of Y on Z, which is analogous to the partial Pearson's correlation derived as the correlation of observed-minus-expected residuals. Our conditional estimator is the conditional correlation of PSRs. We describe estimation and inference, and highlight the use of semiparametric cumulative probability models, which allow preservation of the rank-based nature of Spearman's correlation. We conduct simulations to evaluate the performance of our estimators and compare them with other popular measures of association, demonstrating their robustness and efficiency. We illustrate our method in two applications, a biomarker study and a large survey.

54 citations


Journal ArticleDOI
TL;DR: This article proposed robust and powerful multiple phenotype testing procedures by jointly testing a common mean and a variance component in linear mixed models for summary statistics and developed genetic association tests for multiple phenotypes by accounting for between‐phenotype correlation without the need to access individual‐level data.
Abstract: We study in this article jointly testing the associations of a genetic variant with correlated multiple phenotypes using the summary statistics of individual phenotype analysis from Genome-Wide Association Studies (GWASs). We estimated the between-phenotype correlation matrix using the summary statistics of individual phenotype GWAS analyses, and developed genetic association tests for multiple phenotypes by accounting for between-phenotype correlation without the need to access individual-level data. Since genetic variants often affect multiple phenotypes differently across the genome and the between-phenotype correlation can be arbitrary, we proposed robust and powerful multiple phenotype testing procedures by jointly testing a common mean and a variance component in linear mixed models for summary statistics. We computed the p-values of the proposed tests analytically. This computational advantage makes our methods practically appealing in large-scale GWASs. We performed simulation studies to show that the proposed tests maintained correct type I error rates, and to compare their powers in various settings with the existing methods. We applied the proposed tests to a GWAS Global Lipids Genetics Consortium summary statistics data set and identified additional genetic variants that were missed by the original single-trait analysis.

49 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed matching on both the estimated propensity score and the estimated prognostic scores when the number of covariates is large relative to the total number of observations, and derived asymptotic results for the matching estimator.
Abstract: Valid estimation of treatment effects from observational data requires proper control of confounding. If the number of covariates is large relative to the number of observations, then controlling for all available covariates is infeasible. In cases where a sparsity condition holds, variable selection or penalization can reduce the dimension of the covariate space in a manner that allows for valid estimation of treatment effects. In this article, we propose matching on both the estimated propensity score and the estimated prognostic scores when the number of covariates is large relative to the number of observations. We derive asymptotic results for the matching estimator and show that it is doubly robust in the sense that only one of the two score models need be correct to obtain a consistent estimator. We show via simulation its effectiveness in controlling for confounding and highlight its potential to address nonlinear confounding. Finally, we apply the proposed procedure to analyze the effect of gender on prescription opioid use using insurance claims data.

49 citations


Journal ArticleDOI
TL;DR: In this paper, a Bayesian joint model was proposed to link the longitudinal and the survival processes, using P-splines, in order to improve dynamic predictions of valve function over time.
Abstract: In the field of cardio-thoracic surgery, valve function is monitored over time after surgery. The motivation for our research comes from a study which includes patients who received a human tissue valve in the aortic position. These patients are followed prospectively over time by standardized echocardiographic assessment of valve function. Loss of follow-up could be caused by valve intervention or the death of the patient. One of the main characteristics of the human valve is that its durability is limited. Therefore, it is of interest to obtain a prognostic model in order for the physicians to scan trends in valve function over time and plan their next intervention, accounting for the characteristics of the data. Several authors have focused on deriving predictions under the standard joint modeling of longitudinal and survival data framework that assumes a constant effect for the coefficient that links the longitudinal and survival outcomes. However, in our case, this may be a restrictive assumption. Since the valve degenerates, the association between the biomarker with survival may change over time. To improve dynamic predictions, we propose a Bayesian joint model that allows a time-varying coefficient to link the longitudinal and the survival processes, using P-splines. We evaluate the performance of the model in terms of discrimination and calibration, while accounting for censoring.

Journal ArticleDOI
TL;DR: GLiDeR (Group Lasso and Doubly Robust Estimation), a novel variable selection technique for identifying confounders and predictors of outcome using an adaptive group lasso approach that simultaneously performs coefficient selection, regularization, and estimation across the treatment and outcome models is proposed.
Abstract: The efficiency of doubly robust estimators of the average causal effect (ACE) of a treatment can be improved by including in the treatment and outcome models only those covariates which are related to both treatment and outcome (i.e., confounders) or related only to the outcome. However, it is often challenging to identify such covariates among the large number that may be measured in a given study. In this article, we propose GLiDeR (Group Lasso and Doubly Robust Estimation), a novel variable selection technique for identifying confounders and predictors of outcome using an adaptive group lasso approach that simultaneously performs coefficient selection, regularization, and estimation across the treatment and outcome models. The selected variables and corresponding coefficient estimates are used in a standard doubly robust ACE estimator. We provide asymptotic results showing that, for a broad class of data generating mechanisms, GLiDeR yields a consistent estimator of the ACE when either the outcome or treatment model is correctly specified. A comprehensive simulation study shows that GLiDeR is more efficient than doubly robust methods using standard variable selection techniques and has substantial computational advantages over a recently proposed doubly robust Bayesian model averaging method. We illustrate our method by estimating the causal treatment effect of bilateral versus single-lung transplant on forced expiratory volume in one year after transplant using an observational registry.

Journal ArticleDOI
TL;DR: Real data analyses show that high-dimensional hippocampus surface data may be an important marker for predicting time to conversion to Alzheimer's disease and an algorithm is developed to calculate the maximum approximate partial likelihood estimates of unknown finite and infinite dimensional parameters.
Abstract: ummary We consider a functional linear Cox regression model for characterizing the association between time-to-event data and a set of functional and scalar predictors. The functional linear Cox regression model incorporates a functional principal component analysis for modeling the functional predictors and a high-dimensional Cox regression model to characterize the joint effects of both functional and scalar predictors on the time-to-event data. We develop an algorithm to calculate the maximum approximate partial likelihood estimates of unknown finite and infinite dimensional parameters. We also systematically investigate the rate of convergence of the maximum approximate partial likelihood estimates and a score test statistic for testing the nullity of the slope function associated with the functional predictors. We demonstrate our estimation and testing procedures by using simulations and the analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) data. Our real data analyses show that high-dimensional hippocampus surface data may be an important marker for predicting time to conversion to Alzheimer's disease. Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu).

Journal ArticleDOI
TL;DR: A theoretical result is used to show that transformation cannot reasonably be expected to stabilize variances for small counts, and has clear implications for the analysis of counts as often implemented in the applied sciences, but particularly for multivariate analysis in ecology.
Abstract: Summary While data transformation is a common strategy to satisfy linear modeling assumptions, a theoretical result is used to show that transformation cannot reasonably be expected to stabilize variances for small counts Under broad assumptions, as counts get smaller, it is shown that the variance becomes proportional to the mean under monotonic transformations g(·) that satisfy g(0)=0, excepting a few pathological cases A suggested rule-of-thumb is that if many predicted counts are less than one then data transformation cannot reasonably be expected to stabilize variances, even for a well-chosen transformation This result has clear implications for the analysis of counts as often implemented in the applied sciences, but particularly for multivariate analysis in ecology Multivariate discrete data are often collected in ecology, typically with a large proportion of zeros, and it is currently widespread to use methods of analysis that do not account for differences in variance across observations nor across responses Simulations demonstrate that failure to account for the mean–variance relationship can have particularly severe consequences in this context, and also in the univariate context if the sampling design is unbalanced

Journal ArticleDOI
TL;DR: This article describes an approach using numerical quadrature to obtain PA estimates from their SS counterparts in models with multiple random effects, and illustrates the proposed method using data from a smoking cessation study in which a binary outcome was measured longitudinally.
Abstract: This article discusses marginalization of the regression parameters in mixed models for correlated binary outcomes. As is well known, the regression parameters in such models have the "subject-specific" (SS) or conditional interpretation, in contrast to the "population-averaged" (PA) or marginal estimates that represent the unconditional covariate effects. We describe an approach using numerical quadrature to obtain PA estimates from their SS counterparts in models with multiple random effects. Standard errors for the PA estimates are derived using the delta method. We illustrate our proposed method using data from a smoking cessation study in which a binary outcome (smoking, Y/N) was measured longitudinally. We compare our estimates to those obtained using GEE and marginalized multilevel models, and present results from a simulation study.

Journal ArticleDOI
TL;DR: A non-myopic, covariate-adjusted response adaptive (CARA) allocation design for multi-armed clinical trials is introduced and simple modifications of the proposed CARA rule can be incorporated so that an ethical advantage can be offered without sacrificing power in comparison with balanced designs.
Abstract: We introduce a non-myopic, covariate-adjusted response adaptive (CARA) allocation design for multi-armed clinical trials. The allocation scheme is a computationally tractable procedure based on the Gittins index solution to the classic multi-armed bandit problem and extends the procedure recently proposed in Villar et al. (2015). Our proposed CARA randomization procedure is defined by reformulating the bandit problem with covariates into a classic bandit problem in which there are multiple combination arms, considering every arm per each covariate category as a distinct treatment arm. We then apply a heuristically modified Gittins index rule to solve the problem and define allocation probabilities from the resulting solution. We report the efficiency, balance, and ethical performance of our approach compared to existing CARA methods using a recently published clinical trial as motivation. The net savings in terms of expected number of treatment failures is considerably larger and probably enough to make this design attractive for certain studies where known covariates are expected to be important, stratification is not desired, treatment failures have a high ethical cost, and the disease under study is rare. In a two-armed context, this patient benefit advantage comes at the expense of increased variability in the allocation proportions and a reduction in statistical power. However, in a multi-armed context, simple modifications of the proposed CARA rule can be incorporated so that an ethical advantage can be offered without sacrificing power in comparison with balanced designs.

Journal ArticleDOI
TL;DR: A wild bootstrap resampling technique for nonparametric inference on transition probabilities in a general time-inhomogeneous Markov multistate model to investigate a non-standard time-to-event outcome with data from a recent study of prophylactic treatment in allogeneic transplanted leukemia patients.
Abstract: We suggest a wild bootstrap resampling technique for nonparametric inference on transition probabilities in a general time-inhomogeneous Markov multistate model. We first approximate the limiting distribution of the Nelson-Aalen estimator by repeatedly generating standard normal wild bootstrap variates, while the data is kept fixed. Next, a transformation using a functional delta method argument is applied. The approach is conceptually easier than direct resampling for the transition probabilities. It is used to investigate a non-standard time-to-event outcome, currently being alive without immunosuppressive treatment, with data from a recent study of prophylactic treatment in allogeneic transplanted leukemia patients. Due to non-monotonic outcome probabilities in time, neither standard survival nor competing risks techniques apply, which highlights the need for the present methodology. Finite sample performance of time-simultaneous confidence bands for the outcome probabilities is assessed in an extensive simulation study motivated by the clinical trial data. Example code is provided in the web-based Supplementary Materials.

Journal ArticleDOI
TL;DR: A general Bayesian nonparametric (BNP) approach to causal inference in the point treatment setting that is well‐suited for causal inference problems, as it does not require parametric assumptions about the distribution of confounders and naturally leads to a computationally efficient Gibbs sampling algorithm.
Abstract: We propose a general Bayesian nonparametric (BNP) approach to causal inference in the point treatment setting. The joint distribution of the observed data (outcome, treatment, and confounders) is modeled using an enriched Dirichlet process. The combination of the observed data model and causal assumptions allows us to identify any type of causal effect-differences, ratios, or quantile effects, either marginally or for subpopulations of interest. The proposed BNP model is well-suited for causal inference problems, as it does not require parametric assumptions about the distribution of confounders and naturally leads to a computationally efficient Gibbs sampling algorithm. By flexibly modeling the joint distribution, we are also able to impute (via data augmentation) values for missing covariates within the algorithm under an assumption of ignorable missingness, obviating the need to create separate imputed data sets. This approach for imputing the missing covariates has the additional advantage of guaranteeing congeniality between the imputation model and the analysis model, and because we use a BNP approach, parametric models are avoided for imputation. The performance of the method is assessed using simulation studies. The method is applied to data from a cohort study of human immunodeficiency virus/hepatitis C virus co-infected patients.

Journal ArticleDOI
TL;DR: A highly flexible structural proportional hazards model is developed and applies to data on 4903 individuals with HIV/TB co‐infection, derived from electronic health records in a large HIV care program in Kenya, to provide 'higher resolution' information about the relationship between ART timing and mortality.
Abstract: The timing of antiretroviral therapy (ART) initiation for HIV and tuberculosis (TB) co-infected patients needs to be considered carefully CD4 cell count can be used to guide decision making about when to initiate ART Evidence from recent randomized trials and observational studies generally supports early initiation but does not provide information about effects of initiation time on a continuous scale In this article, we develop and apply a highly flexible structural proportional hazards model for characterizing the effect of treatment initiation time on a survival distribution The model can be fitted using a weighted partial likelihood score function Construction of both the score function and the weights must accommodate censoring of the treatment initiation time, the outcome, or both The methods are applied to data on 4903 individuals with HIV/TB co-infection, derived from electronic health records in a large HIV care program in Kenya We use a model formulation that flexibly captures the joint effects of ART initiation time and ART duration using natural cubic splines The model is used to generate survival curves corresponding to specific treatment initiation times; and to identify optimal times for ART initiation for subgroups defined by CD4 count at time of TB diagnosis Our findings potentially provide 'higher resolution' information about the relationship between ART timing and mortality, and about the differential effect of ART timing within CD4 subgroups

Journal ArticleDOI
TL;DR: The proposed adaptive platform design improves power by as much as 51% with limited type‐I error inflation and effectuates more balance with respect to the distribution of acquired information among study arms, with more patients randomized to experimental regimens.
Abstract: Traditional paradigms for clinical translation are challenged in settings where multiple contemporaneous therapeutic strategies have been identified as potentially beneficial. Platform trials have emerged as an approach for sequentially comparing multiple trials using a single protocol. The Ebola virus disease outbreak in West Africa represents one recent example which utilized a platform design. Specifically, the PREVAIL II master protocol sequentially tested new combinations of therapies against the concurrent, optimal standard of care (oSOC) strategy. Once a treatment demonstrated sufficient evidence of benefit, the treatment was added to the oSOC for all future comparisons (denoted as segments throughout the manuscript). In the interest of avoiding bias stemming from population drift, PREVAIL II considered only within-segment comparisons between the oSOC and novel treatments and failed to leverage data from oSOC patients in prior segments. This article describes adaptive design methodology aimed at boosting statistical power through Bayesian modeling and adaptive randomization. Specifically, the design uses multi-source exchangeability models to combine data from multiple segments and adaptive randomization to achieve information balance within a segment. When compared to the PREVAIL II design, we demonstrate that our proposed adaptive platform design improves power by as much as 51% with limited type-I error inflation. Further, the adaptive platform effectuates more balance with respect to the distribution of acquired information among study arms, with more patients randomized to experimental regimens.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a probabilistic graphical model to estimate the causal structure of the observed data and select the target subsets given the estimated graph, which is evaluated by simulation both in a high-dimensional setting where unconfoundedness holds given X and in a setting where only holds given subsets of X. The proposed method is implemented with existing software that can easily handle highdimensional data.
Abstract: To unbiasedly estimate a causal effect on an outcome unconfoundedness is often assumed. If there is sufficient knowledge on the underlying causal structure then existing confounder selection criteria can be used to select subsets of the observed pretreatment covariates, X, sufficient for unconfoundedness, if such subsets exist. Here, estimation of these target subsets is considered when the underlying causal structure is unknown. The proposed method is to model the causal structure by a probabilistic graphical model, for example, a Markov or Bayesian network, estimate this graph from observed data and select the target subsets given the estimated graph. The approach is evaluated by simulation both in a high-dimensional setting where unconfoundedness holds given X and in a setting where unconfoundedness only holds given subsets of X. Several common target subsets are investigated and the selected subsets are compared with respect to accuracy in estimating the average causal effect. The proposed method is implemented with existing software that can easily handle high-dimensional data, in terms of large samples and large number of covariates. The results from the simulation study show that, if unconfoundedness holds given X, this approach is very successful in selecting the target subsets, outperforming alternative approaches based on random forests and LASSO, and that the subset estimating the target subset containing all causes of outcome yields smallest MSE in the average causal effect estimation.

Journal ArticleDOI
TL;DR: A new, simple method of estimating the power parameter suitable for the case when only one historical dataset is available, based on predictive distributions and parameterized in such a way that the type I error can be controlled by calibrating to the degree of similarity between the new and historical data.
Abstract: In order for historical data to be considered for inclusion in the design and analysis of clinical trials, prospective rules are essential. Incorporation of historical data may be of particular interest in the case of small populations where available data is scarce and heterogeneity is not as well understood, and thus conventional methods for evidence synthesis might fall short. The concept of power priors can be particularly useful for borrowing evidence from a single historical study. Power priors employ a parameter γ ∈ [ 0 , 1 ] that quantifies the heterogeneity between the historical study and the new study. However, the possibility of borrowing data from a historical trial will usually be associated with an inflation of the type I error. We suggest a new, simple method of estimating the power parameter suitable for the case when only one historical dataset is available. The method is based on predictive distributions and parameterized in such a way that the type I error can be controlled by calibrating to the degree of similarity between the new and historical data. The method is demonstrated for normal responses in a one or two group setting. Generalization to other models is straightforward.

Journal ArticleDOI
TL;DR: A suite of likelihood-based spatial mark-resight models that either include the marking phase or require a known distribution of marked animals (narrow-sense "mark- resight") are developed and suggested to have low bias and adequate confidence interval coverage under typical sampling conditions.
Abstract: Sightings of previously marked animals can extend a capture-recapture dataset without the added cost of capturing new animals for marking. Combined marking and resighting methods are therefore an attractive option in animal population studies, and there exist various likelihood-based non-spatial models, and some spatial versions fitted by Markov chain Monte Carlo sampling. As implemented to date, the focus has been on modeling sightings only, which requires that the spatial distribution of pre-marked animals is known. We develop a suite of likelihood-based spatial mark-resight models that either include the marking phase ("capture-mark-resight" models) or require a known distribution of marked animals (narrow-sense "mark-resight"). The new models sacrifice some information in the covariance structure of the counts of unmarked animals; estimation is by maximizing a pseudolikelihood with a simulation-based adjustment for overdispersion in the sightings of unmarked animals. Simulations suggest that the resulting estimates of population density have low bias and adequate confidence interval coverage under typical sampling conditions. Further work is needed to specify the conditions under which ignoring covariance results in unacceptable loss of precision, or to modify the pseudolikelihood to include that information. The methods are applied to a study of ship rats Rattus rattus using live traps and video cameras in a New Zealand forest, and to previously published data.

Journal ArticleDOI
TL;DR: A Tukey-like circular boxplot is introduced, which would be especially useful in all fields where circular measures arise: biometrics, astronomy, environmetrics, Earth sciences, to cite just a few.
Abstract: The box-and-whiskers plot is an extraordinary graphical tool that provides a quick visual summary of an observed distribution. In spite of its many extensions, a really suitable boxplot to display circular data is not yet available. Thanks to its simplicity and strong visual impact, such a tool would be especially useful in all fields where circular measures arise: biometrics, astronomy, environmetrics, Earth sciences, to cite just a few. For this reason, in line with Tukey's original idea, a Tukey-like circular boxplot is introduced. Several simulated and real datasets arising in biology are used to illustrate the proposed graphical tool.

Journal ArticleDOI
TL;DR: The proposed Bayesian nonparametric approach for causal inference on quantiles in the presence of many confounders is used to answer an important clinical question involving acute kidney injury using electronic health records.
Abstract: We propose a Bayesian nonparametric approach (BNP) for causal inference on quantiles in the presence of many confounders. In particular, we define relevant causal quantities and specify BNP models to avoid bias from restrictive parametric assumptions. We first use Bayesian additive regression trees (BART) to model the propensity score and then construct the distribution of potential outcomes given the propensity score using a Dirichlet process mixture (DPM) of normals model. We thoroughly evaluate the operating characteristics of our approach and compare it to Bayesian and frequentist competitors. We use our approach to answer an important clinical question involving acute kidney injury using electronic health records.

Journal ArticleDOI
TL;DR: A Cox regression model is proposed to adjust for double truncation using a weighted estimating equation approach, where the weights are estimated from the data both parametrically and nonparametrically, and are inversely proportional to the probability that a subject is observed.
Abstract: Truncation is a well-known phenomenon that may be present in observational studies of time-to-event data. While many methods exist to adjust for either left or right truncation, there are very few methods that adjust for simultaneous left and right truncation, also known as double truncation. We propose a Cox regression model to adjust for this double truncation using a weighted estimating equation approach, where the weights are estimated from the data both parametrically and nonparametrically, and are inversely proportional to the probability that a subject is observed. The resulting weighted estimators of the hazard ratio are consistent. The parametric weighted estimator is asymptotically normal and a consistent estimator of the asymptotic variance is provided. For the nonparametric weighted estimator, we apply the bootstrap technique to estimate the variance and confidence intervals. We demonstrate through extensive simulations that the proposed estimators greatly reduce the bias compared to the unweighted Cox regression estimator which ignores truncation. We illustrate our approach in an analysis of autopsy-confirmed Alzheimer's disease patients to assess the effect of education on survival.

Journal ArticleDOI
TL;DR: This article proposes a GLM‐based Ordination Method for Microbiome Samples (GOMMS), which uses a zero‐inflated quasi–Poisson (ZIQP) latent factor model and an EM algorithm based on the quasi‐likelihood is developed to estimate parameters.
Abstract: Distance-based ordination methods, such as principal coordinates analysis (PCoA), are widely used in the analysis of microbiome data. However, these methods are prone to pose a potential risk of misinterpretation about the compositional difference in samples across different populations if there is a difference in dispersion effects. Accounting for high sparsity and overdispersion of microbiome data, we propose a GLM-based Ordination Method for Microbiome Samples (GOMMS) in this article. This method uses a zero-inflated quasi-Poisson (ZIQP) latent factor model. An EM algorithm based on the quasi-likelihood is developed to estimate parameters. It performs comparatively to the distance-based approach when dispersion effects are negligible and consistently better when dispersion effects are strong, where the distance-based approach sometimes yields undesirable results. The estimated latent factors from GOMMS can be used to associate the microbiome community with covariates or outcomes using the standard multivariate tests, which can be investigated in future confirmatory experiments. We illustrate the method in simulations and an analysis of microbiome samples from nasopharynx and oropharynx.

Journal ArticleDOI
TL;DR: This work simultaneously estimates the optimal individualized treatment rule for all composite outcomes representable as a convex combination of the (suitably transformed) outcomes and proves that as the number of subjects and items on the questionnaire diverge, the estimator is consistent for an oracle optimal individualization treatment rule wherein each patient's preference is known a priori.
Abstract: Summary Precision medicine seeks to provide treatment only if, when, to whom, and at the dose it is needed. Thus, precision medicine is a vehicle by which healthcare can be made both more effective and efficient. Individualized treatment rules operationalize precision medicine as a map from current patient information to a recommended treatment. An optimal individualized treatment rule is defined as maximizing the mean of a pre-specified scalar outcome. However, in settings with multiple outcomes, choosing a scalar composite outcome by which to define optimality is difficult. Furthermore, when there is heterogeneity across patient preferences for these outcomes, it may not be possible to construct a single composite outcome that leads to high-quality treatment recommendations for all patients. We simultaneously estimate the optimal individualized treatment rule for all composite outcomes representable as a convex combination of the (suitably transformed) outcomes. For each patient, we use a preference elicitation questionnaire and item response theory to derive the posterior distribution over preferences for these composite outcomes and subsequently derive an estimator of an optimal individualized treatment rule tailored to patient preferences. We prove that as the number of subjects and items on the questionnaire diverge, our estimator is consistent for an oracle optimal individualized treatment rule wherein each patient's preference is known a priori. We illustrate the proposed method using data from a clinical trial on antipsychotic medications for schizophrenia.

Journal ArticleDOI
TL;DR: An optimal surrogate for the current study is defined as the function of the data generating distribution collected by the intermediate time point that satisfies the Prentice definition of a valid surrogate endpoint and that optimally predicts the final outcome: this optimal surrogate is an unknown parameter.
Abstract: A common scientific problem is to determine a surrogate outcome for a long-term outcome so that future randomized studies can restrict themselves to only collecting the surrogate outcome. We consider the setting that we observe n independent and identically distributed observations of a random variable consisting of baseline covariates, a treatment, a vector of candidate surrogate outcomes at an intermediate time point, and the final outcome of interest at a final time point. We assume the treatment is randomized, conditional on the baseline covariates. The goal is to use these data to learn a most-promising surrogate for use in future trials for inference about a mean contrast treatment effect on the final outcome. We define an optimal surrogate for the current study as the function of the data generating distribution collected by the intermediate time point that satisfies the Prentice definition of a valid surrogate endpoint and that optimally predicts the final outcome: this optimal surrogate is an unknown parameter. We show that this optimal surrogate is a conditional mean and present super-learner and targeted super-learner based estimators, whose predicted outcomes are used as the surrogate in applications. We demonstrate a number of desirable properties of this optimal surrogate and its estimators, and study the methodology in simulations and an application to dengue vaccine efficacy trials.

Journal ArticleDOI
TL;DR: This work proposes a new varying-coefficient semiparametric model averaging prediction (VC-SMAP) approach to analyze large data sets with abundant covariates, which provides more flexibility than parametric methods, while being more stable and easily implemented than fully multivariate nonparametric varying- coefficient models.
Abstract: Forecasting and predictive inference are fundamental data analysis tasks. Most studies employ parametric approaches making strong assumptions about the data generating process. On the other hand, while nonparametric models are applied, it is sometimes found in situations involving low signal to noise ratios or large numbers of covariates that their performance is unsatisfactory. We propose a new varying-coefficient semiparametric model averaging prediction (VC-SMAP) approach to analyze large data sets with abundant covariates. Performance of the procedure is investigated with numerical examples. Even though model averaging has been extensively investigated in the literature, very few authors have considered averaging a set of semiparametric models. Our proposed model averaging approach provides more flexibility than parametric methods, while being more stable and easily implemented than fully multivariate nonparametric varying-coefficient models. We supply numerical evidence to justify the effectiveness of our methodology.